Reports on unique users are essential for analytics and statistics. However, in real data, duplicate accounts and NULL values (e.g., unspecified email) are often present, and various criteria must be taken into account (e.g., uniqueness by name, email, IP, and sometimes their combinations).
A typical mistake is to calculate COUNT(DISTINCT user_id) without considering that the relevant columns may contain NULLs or non-obvious duplicates (e.g., one person with different emails, or multiple rows with the same user_id but different statuses). Complex queries with GROUP BY can yield incorrect results if the uniqueness logic is not well thought out.
It is important to combine DISTINCT, GROUP BY, and NULL filtering. Sometimes it is necessary to prepare the data in a CTE or a subquery, grouping by the appropriate set of attributes.
Example code:
-- Counting unique users by email and IP, ignoring NULL SELECT COUNT(*) AS unique_users FROM ( SELECT DISTINCT email, ip_address FROM users WHERE email IS NOT NULL AND ip_address IS NOT NULL ) u;
Key features:
Does COUNT(DISTINCT ...) consider NULL rows?
No: if at least one of the columns in the DISTINCT list has a NULL value, such a combination is considered unique (NULL is not equal to NULL according to SQL standards). It is usually more convenient to first remove NULLs using filtering.
Can NULL be compared to NULL through DISTINCT?
In SQL, each pair of NULL values is considered different, so each row with a NULL in any of the columns will be counted separately. Filtering through IS NOT NULL should be applied.
Does GROUP BY always give the same result as DISTINCT?
No: GROUP BY creates one row for each non-repeating combination of values, whereas DISTINCT simply removes duplicates. In some cases, the results are different, especially when aggregation is applied.
A business analyst builds a report on unique clients through COUNT(DISTINCT user_id), but user_id may actually be NULL or duplicated (e.g., temporary accounts). The actual number of users turns out to be higher than the real one, leading to distorted metrics in the report.
Pros:
Cons:
An analyst cleans the data in advance, filters out NULLs and obvious duplicates in subqueries, and uses SET operations for complex criteria of uniqueness.
Pros:
Cons: