Answer.

The issue of duplicates in tables is one of the oldest in SQL, particularly exacerbated by the mass growth of internet services and the migration of large data sets. Initially, the problem was solved manually: finding duplicates with one-off scripts and deleting them, but as data grew, this approach became ineffective.

Problem: During careless loading, migration, or failures, it is common to create tens of thousands of rows with identical key characteristics (e.g., email or passport). This leads to errors in integrations, incorrect analytics, and loss of customer trust.

Solution: Build reports on duplicates using grouping and window functions; implement the deletion of duplicate records while retaining one "correct" record; add constraints (UNIQUE) for key fields and mandatory regular auditing.

Code example

WITH Duplicates AS (
  SELECT email, COUNT(*) AS cnt
  FROM users
  GROUP BY email
  HAVING COUNT(*) > 1
)
SELECT u.*
FROM users u
JOIN Duplicates d ON u.email = d.email
ORDER BY u.email, u.id;

Deleting duplicates (keeping the row with the minimum id):

DELETE FROM users
WHERE id NOT IN (
  SELECT MIN(id)
  FROM users
  GROUP BY email
);

Key features:

Use of GROUP BY, HAVING for finding duplicates
Automation of deletion via NOT IN/EXISTS/ROW_NUMBER()
Introduction of uniqueness constraints for key fields

Tricky questions.

Can DISTINCT be used to delete duplicates from a table?

No, DISTINCT only works for selections (SELECT), it does not delete rows from the table. You need to use DELETE or INSERT ... SELECT with DISTINCT to create a new clean table.

Does the DELETE ... WHERE id NOT IN (SELECT MIN(id) ...) command guarantee the removal of all duplicates?

No, if there are NULLs in the key column, such a query may mistakenly leave duplicates due to the nature of NULL comparison.

Is it sufficient to create a UNIQUE INDEX to prevent duplicates from occurring again?

No, an index will only protect against new attempts to insert duplicates, but will not affect existing duplicates in the table.

Common mistakes and anti-patterns

Direct deletion by a single column without considering NULLs or complex keys
Lack of constant monitoring and logging of duplicates
Setting uniqueness constraints without prior cleansing of the table — will lead to an index creation error

Real-life example

Negative case

A CRM system copies users from various sources without considering email uniqueness, resulting in 50,000 duplicates. Mass addition of a UNIQUE index leads to a failure, and the service is down.

Pros:

Quick implementation (before the index)

Cons:

Data loss, service outages, support failure

Positive case

An engineer regularly analyzes duplicates through specialized reports, cleans the database, and creates unique indexes. Before migrating new data, validation is performed.