ProgrammingSQL Analyst

How to retrieve only unique records from a complex data structure with several duplicates across different columns, and what are the specifics of using DISTINCT vs GROUP BY vs ROW_NUMBER()?

Pass interviews with Hintsage AI assistant

Answer.

Extracting unique records in SQL has become a critical task with the widespread transition of organizations to store multidimensional data. Sometimes it's necessary to output non-repeating rows based on a combination of several columns, sometimes — just based on one key.

Background:

Early versions of SQL only offered DISTINCT for filtering duplicates. Then structural techniques emerged, including GROUP BY for aggregations over unique sets of values and window functions like ROW_NUMBER() for more flexible scenarios when handling duplicates, for example: selection by "latest" or "first" record.

Problem:

DISTINCT works only at the level of the fields in SELECT, while GROUP BY requires aggregations. Window functions allow advanced logic, but their use often leads to errors if the row selection order is not considered. Developers often confuse these approaches, and mistakes lead to incorrect results.

Solution:

  • Use DISTINCT to obtain unique rows based on required fields.
  • GROUP BY — when aggregates are needed (for example, sum or date for unique pairs).
  • Window functions (ROW_NUMBER()) — for tasks like selecting "one row from a group of duplicates based on a certain criterion".

Example code:

Get the latest order record for each customer:

WITH OrdersRank AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) as rn FROM Orders ) SELECT * FROM OrdersRank WHERE rn = 1;

Key features:

  • DISTINCT — returns unique rows only for the fields specified in SELECT.
  • GROUP BY — required if aggregation is needed.
  • ROW_NUMBER() — maximally flexible for row selection with the needed priority/date/versioning.

Tricky questions.

Can DISTINCT be used with aggregate functions without GROUP BY?

No, aggregate functions require grouping, otherwise there will be a syntax error.

SELECT COUNT(DISTINCT CustomerID) -- correct SELECT SUM(Amount), DISTINCT CustomerID -- error!

What happens if not all non-aggregated fields from SELECT are specified in GROUP BY?

This will cause an error in most DBMS: all fields in SELECT, except for aggregates, must be listed in GROUP BY.

Can duplicates be "removed" using window functions without a subquery?

No: using ROW_NUMBER() within a single SELECT does not automatically filter out "duplicates"; an outer query is necessary to select the required rows.

Common mistakes and anti-patterns

  • Using DISTINCT with a large number of columns and rows — sharp performance drop.
  • GROUP BY without the necessary aggregate — meaningless and resource-intensive.
  • Window functions without subsequent filtering — data is returned with duplicates.

Real-life example

Negative case

Used DISTINCT across all columns for a table with 20 million rows: the query ran for hours, resulting in a timeout or performance drop of the DB.

Pros:

  • Easy to write.

Cons:

  • Extremely inefficient on large data sets.

Positive case

Used window functions: retrieved only the required latest record per customer in milliseconds; previous and duplicate records were not loaded.

Pros:

  • Extremely high performance.
  • Flexibility.

Cons:

  • Requires a well-designed query architecture and knowledge of window functions.