Figuring out uniqueness in large datasets is somewhat trivial in SQL via the DISTINCT statement. This DISTINCT technique, however, puts a load on the SQL box to where it is more beneficial to scale out horizontally in managed code instead. Typically, it is much easier to spin more web boxes to handle traffic than it is to stand up a beefier database. Moving the work into managed code scales indefinitely to increase throughput and reduces SQL pressure.
In this take, I will show you some techniques when working with duplicate data in managed code. I will explore common gotchas and