Extracting unique records in SQL has become a critical task with the widespread transition of organizations to store multidimensional data. Sometimes it's necessary to output non-repeating rows based on a combination of several columns, sometimes — just based on one key.
Background:
Early versions of SQL only offered DISTINCT for filtering duplicates. Then structural techniques emerged, including GROUP BY for aggregations over unique sets of values and window functions like ROW_NUMBER() for more flexible scenarios when handling duplicates, for example: selection by "latest" or "first" record.
Problem:
DISTINCT works only at the level of the fields in SELECT, while GROUP BY requires aggregations. Window functions allow advanced logic, but their use often leads to errors if the row selection order is not considered. Developers often confuse these approaches, and mistakes lead to incorrect results.
Solution:
Example code:
Get the latest order record for each customer:
WITH OrdersRank AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) as rn FROM Orders ) SELECT * FROM OrdersRank WHERE rn = 1;
Key features:
Can DISTINCT be used with aggregate functions without GROUP BY?
No, aggregate functions require grouping, otherwise there will be a syntax error.
SELECT COUNT(DISTINCT CustomerID) -- correct SELECT SUM(Amount), DISTINCT CustomerID -- error!
What happens if not all non-aggregated fields from SELECT are specified in GROUP BY?
This will cause an error in most DBMS: all fields in SELECT, except for aggregates, must be listed in GROUP BY.
Can duplicates be "removed" using window functions without a subquery?
No: using ROW_NUMBER() within a single SELECT does not automatically filter out "duplicates"; an outer query is necessary to select the required rows.
Used DISTINCT across all columns for a table with 20 million rows: the query ran for hours, resulting in a timeout or performance drop of the DB.
Pros:
Cons:
Used window functions: retrieved only the required latest record per customer in milliseconds; previous and duplicate records were not loaded.
Pros:
Cons: