ProgrammingSQL/BI Analyst

How to implement search and duplicate processing in large SQL tables to ensure the uniqueness of key data?

Pass interviews with Hintsage AI assistant

Answer.

The issue of duplicates in tables is one of the oldest in SQL, particularly exacerbated by the mass growth of internet services and the migration of large data sets. Initially, the problem was solved manually: finding duplicates with one-off scripts and deleting them, but as data grew, this approach became ineffective.

Problem: During careless loading, migration, or failures, it is common to create tens of thousands of rows with identical key characteristics (e.g., email or passport). This leads to errors in integrations, incorrect analytics, and loss of customer trust.

Solution: Build reports on duplicates using grouping and window functions; implement the deletion of duplicate records while retaining one "correct" record; add constraints (UNIQUE) for key fields and mandatory regular auditing.

Code example

WITH Duplicates AS ( SELECT email, COUNT(*) AS cnt FROM users GROUP BY email HAVING COUNT(*) > 1 ) SELECT u.* FROM users u JOIN Duplicates d ON u.email = d.email ORDER BY u.email, u.id;

Deleting duplicates (keeping the row with the minimum id):

DELETE FROM users WHERE id NOT IN ( SELECT MIN(id) FROM users GROUP BY email );

Key features:

  • Use of GROUP BY, HAVING for finding duplicates
  • Automation of deletion via NOT IN/EXISTS/ROW_NUMBER()
  • Introduction of uniqueness constraints for key fields

Tricky questions.

Can DISTINCT be used to delete duplicates from a table?

No, DISTINCT only works for selections (SELECT), it does not delete rows from the table. You need to use DELETE or INSERT ... SELECT with DISTINCT to create a new clean table.

Does the DELETE ... WHERE id NOT IN (SELECT MIN(id) ...) command guarantee the removal of all duplicates?

No, if there are NULLs in the key column, such a query may mistakenly leave duplicates due to the nature of NULL comparison.

Is it sufficient to create a UNIQUE INDEX to prevent duplicates from occurring again?

No, an index will only protect against new attempts to insert duplicates, but will not affect existing duplicates in the table.

Common mistakes and anti-patterns

  • Direct deletion by a single column without considering NULLs or complex keys
  • Lack of constant monitoring and logging of duplicates
  • Setting uniqueness constraints without prior cleansing of the table — will lead to an index creation error

Real-life example

Negative case

A CRM system copies users from various sources without considering email uniqueness, resulting in 50,000 duplicates. Mass addition of a UNIQUE index leads to a failure, and the service is down.

Pros:

  • Quick implementation (before the index)

Cons:

  • Data loss, service outages, support failure

Positive case

An engineer regularly analyzes duplicates through specialized reports, cleans the database, and creates unique indexes. Before migrating new data, validation is performed.

Pros:

  • Clean key data
  • Minimization of errors in analytics

Cons:

  • Requires monitoring adjustments and automation procedures