SQL (ANSI)ProgrammingSQL Developer

Elaborate on the technique for calculating a cumulative distinct count across ordered partitions, where the running total of unique entities must increment only upon first appearance within each group, using strictly ANSI SQL window functions without correlated subqueries?

Pass interviews with Hintsage AI assistant

Answer to the question

History of the question. The necessity for running distinct counts emerged from analytical workloads tracking metrics like cumulative unique customer acquisitions or distinct SKU introductions over time. Prior to the ANSI SQL:2003 window function extensions, analysts relied on self-joins or correlated subqueries, resulting in quadratic time complexity unacceptable for modern data volumes. The standardization of window functions provided a linear-time, set-based mechanism to maintain running cardinality without procedural loops.

The problem. ANSI SQL explicitly prohibits the DISTINCT keyword within window aggregate functions (e.g., COUNT(DISTINCT col) OVER (...)). This restriction prevents direct computation of distinct values within a cumulative or sliding frame. The core challenge lies in identifying the inaugural appearance of each entity within the partition's sort order and summing these binary flags (first appearance = 1, else = 0) progressively.

The solution. The canonical approach combines ROW_NUMBER() to flag first occurrences with a conditional SUM() window function. By partitioning ROW_NUMBER() by the entity identifier, the chronologically first occurrence receives value 1; subsequent appearances receive incrementing integers. An outer query then sums a case expression emitting 1 only when the row number equals 1, evaluated over an unbounded preceding frame.

SELECT event_date, region_id, user_id, SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY region_id ORDER BY event_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_unique_users FROM ( SELECT event_date, region_id, user_id, ROW_NUMBER() OVER ( PARTITION BY region_id, user_id ORDER BY event_date, event_id -- event_id as tie-breaker ) AS rn FROM user_activity ) flagged;

Situation from life

Problem description. A fintech startup needed to monitor regulatory compliance by tracking cumulative unique merchants onboarded per sales region throughout the fiscal year. Their merchant_signups table contained 120 million rows with region_code, merchant_id, and signup_timestamp. Existing Python batch jobs took 35 minutes to compute these metrics nightly, causing reporting delays and stale dashboard data. The requirement was to produce real-time cumulative counts within strict ANSI SQL for portability across cloud data warehouses.

Solution A: The self-join approach. This method joins the table to itself on matching region and earlier timestamps, counting distinct merchants per outer row. Pros: It requires no window function support and functions on legacy SQL-92 engines. Cons: The algorithm exhibits O(n²) complexity; for millions of rows, this generates intermediate cartesian products consuming terabytes of temporary storage and fails to complete within hours, making it operationally infeasible.

Solution B: The correlated scalar subquery. Here, the SELECT clause embeds a subquery: (SELECT COUNT(DISTINCT merchant_id) FROM merchant_signups m2 WHERE m2.region_code = m1.region_code AND m2.signup_timestamp <= m1.signup_timestamp). Pros: It is declarative and logically transparent to read. Cons: The subquery executes once per row (120 million times), preventing predicate pushdown and causing massive random I/O; database optimizers cannot decorrelate distinct aggregates across varying temporal ranges, resulting in estimated execution times exceeding 90 minutes.

Solution C: The ANSI SQL window function technique. Utilizing ROW_NUMBER() to identify first appearances followed by a running SUM() as shown in the code example above. Pros: This performs a single table scan with sorting, utilizing the optimizer's window spooling capabilities for O(n log n) complexity and bounded memory usage. Cons: It requires careful handling of temporal ties; if two signups share identical timestamps, non-deterministic ordering could double-count unless a unique tie-breaker (like event_id) is appended to the ORDER BY clause.

Chosen solution and result. Solution C was implemented. By including event_id in the ORDER BY to ensure deterministic first-appearance detection, the query executed in 4 minutes on the existing cluster—a 9x improvement. The result enabled real-time compliance dashboards, allowing risk officers to monitor onboarding diversity without ETL delays, and the query was fully portable to PostgreSQL, Snowflake, and BigQuery without modification.

What candidates often miss

Why does COUNT(DISTINCT column) OVER (ORDER BY ...) raise a syntax error in strict ANSI SQL?

The SQL standard explicitly disallows the DISTINCT keyword within the argument of a window aggregate function such as COUNT, SUM, or AVG. While specific vendors (e.g., PostgreSQL 16+, Oracle) offer this as a proprietary extension, ANSI SQL:2011 and earlier versions restrict window aggregates to operate on all rows within the defined frame. This limitation exists because maintaining a distinct-set hash table for every possible window frame during streaming evaluation is not mandated by the standard grammar. Candidates must recognize that DISTINCT is permitted only in standard aggregate functions lacking OVER clauses, or within inverse distribution functions like PERCENTILE_CONT, but never as a windowed distinct count.

How do you handle duplicate timestamps when determining the "first" occurrence of an entity?

ROW_NUMBER() assigns arbitrary values among ties unless the ORDER BY clause specifies a total ordering. If a merchant has two entries with identical timestamps, both rows could potentially receive rn = 1 if the ordering is non-deterministic, causing the cumulative count to increment twice erroneously. The resolution is to append a unique primary key or auto-incrementing ID to the ORDER BY clause: ORDER BY signup_timestamp, merchant_signup_id. This ensures deterministic sequencing where the earlier-assigned ID is considered the first occurrence, preserving the mathematical integrity of the running distinct count.

Can this technique be adapted for a moving distinct count over a fixed row-count window (e.g., last 100 transactions) rather than unbounded preceding?

No, not efficiently with pure ANSI SQL. The unbounded preceding method succeeds because distinctness is monotonic—once an entity appears, it remains "counted" forever. In a sliding window (e.g., ROWS BETWEEN 100 PRECEDING AND CURRENT ROW), an entity exiting the window must decrement the count, requiring knowledge of whether the departing row represents the only instance of that entity within the current frame. ANSI SQL lacks array aggregation or set-difference operators within window frames to track such egress efficiently. Implementing this requires either recursive CTEs (which degrade to O(n²) for this scenario) or proprietary extensions like ARRAY_AGG combined with set operations, both of which violate strict ANSI compliance.