History: In temporal data warehousing, the Last Observation Carried Forward (LOCF) technique dominates missing-value imputation, using preceding valid records to fill gaps. However, specific analytical domains—such as applying end-of-day reconciliations to intraday financial transactions or back-propagating laboratory confirmations to earlier provisional diagnoses—require the inverse Next Observation Carried Backward (NOCB) approach. Historically, NOCB was implemented via correlated subqueries or procedural cursors, both of which exhibit O(n²) complexity and fail to leverage modern set-based optimizers.
The Problem: Given a totally ordered sequence (e.g., event_time), each NULL value must be replaced with the nearest non-NULL value that occurs after it in the sequence. Consecutive NULLs preceding a valid record should receive the same subsequent value. Standard functions like LEAD() access only the immediate next row, failing when multiple consecutive NULLs exist before a non-NULL anchor. Self-joins and recursive CTEs are prohibited by performance constraints.
The Solution: The solution exploits the NULL-ignoring semantics of COUNT(expression). By counting non-NULL values from the current row to the end of the partition (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING), we generate a stable "bucket identifier" that is identical for all rows between two non-NULL anchors. Within each bucket, MAX(val)—which also ignores NULLs—retrieves the anchor value and broadcasts it to all rows in that group.
WITH bucketed AS ( SELECT record_id, event_time, status_code, COUNT(status_code) OVER ( ORDER BY event_time, record_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS bucket_id FROM audit_log ) SELECT record_id, event_time, COALESCE( MAX(status_code) OVER ( PARTITION BY bucket_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ), 'UNKNOWN' ) AS confirmed_status FROM bucketed;
Context and Problem Description: A high-frequency trading firm maintains an execution table capturing microsecond-level equity trades. Due to exchange reporting protocols, the final "consolidated price" for any given minute—verified by the clearinghouse—arrives 30 seconds after the minute ends and is stamped only at the boundary (e.g., 14:30:00.000). For regulatory TWAP (Time-Weighted Average Price) calculations, every millisecond of that minute must reflect the final consolidated price, requiring backfill to all preceding 14:29:00.000 - 14:29:59.999 records. The daily volume exceeds 50 million rows, and the batch window is 10 minutes.
Solution 1: Correlated Scalar Subquery. This approach uses a scalar subquery for each row to locate the MIN(event_time) of future rows where consolidated_price IS NOT NULL, then joins back to retrieve that price.
Pros: Conceptually straightforward for developers with procedural backgrounds.
Cons: Executes O(n²) comparisons. On production data, query runtime exceeded 45 minutes, violating the batch window. Handling multiple consecutive NULLs requires additional logic to skip ahead, increasing complexity and error rates.
Solution 2: Recursive CTE Traversal. A recursive CTE iterates backwards row-by-row, carrying the non-NULL price backward until encountering another non-NULL.
Pros: Guaranteed to work on any ANSI SQL compliant database.
Cons: Recursive CTEs process rows sequentially in many engines (e.g., PostgreSQL), resulting in single-threaded execution and potential stack overflow on deep partitions. Benchmarks showed a 20-minute runtime with high memory pressure, making it unsuitable for production SLAs.
Solution 3: Window Function Bucketization (Chosen). Implement the COUNT and MAX pattern. The backward-looking COUNT creates identical buckets for all rows requiring the same future value, while MAX propagates that value within the bucket.
Pros: Fully set-based, parallelizable, and executes in O(n log n) time due to the sort operation. It scales linearly with volume and uses standard ANSI SQL portable across PostgreSQL, SQL Server, Oracle, and DB2.
Cons: Requires two passes over the data (the CTE and the outer query), though modern optimizers often fuse these. It mandates a total ordering; duplicate timestamps require a tie-breaker column to ensure determinism.
Result: The pipeline runtime dropped from 45 minutes to 8 seconds on the 50-million-row dataset. The firm eliminated a fragile Python backfill script, reducing infrastructure complexity and ensuring regulatory reports generated within the compliance window.
Why must COUNT(column) be used instead of COUNT(*) or ROW_NUMBER() when constructing the grouping key?
Many candidates intuitively use COUNT(*) or ROW_NUMBER(), believing these can segment the data. COUNT(*) counts every row regardless of NULLs, producing a unique, monotonically changing value for each row in the backward frame, which prevents the formation of stable groups. ROW_NUMBER() assigns a unique identifier to each row, similarly destroying the grouping. Only COUNT(column) increments exclusively when encountering non-NULL values, thereby assigning the same "bucket ID" to all preceding NULLs until the next non-NULL boundary. This distinction is crucial because it leverages the NULL-ignoring semantics of aggregate window functions to simulate a "look-ahead" without procedural logic.
How does the query behave if the partition ends with trailing NULL values, and what modification ensures deterministic handling when no future observation exists?
If the final rows in the ordered partition are NULL, COUNT(status_code) evaluates to zero for those rows. Consequently, MAX(status_code) returns NULL, which is logically correct—no future observation exists to carry backward. Candidates often forget to handle this in downstream business logic. To provide a default value (e.g., a static placeholder or a value from an external lookup), one must wrap the result in COALESCE. Furthermore, to distinguish between "filled NULL" and "unfillable NULL" for data quality monitoring, one should compare the original and filled values: CASE WHEN status_code IS NULL AND bucket_id = 0 THEN 'UNCONFIRMED' END.
What determinism issue arises if the ORDER BY clause contains duplicate values, and why does switching from ROWS to RANGE exacerbate the problem?
When ordering keys contain duplicates (ties), the window frame definition becomes ambiguous. Using ROWS (physical offsets) assigns groups based on physical table order, which is arbitrary unless a unique secondary sort key is provided. Switching to RANGE (logical value ranges) treats all rows with the same ordering value as peers, causing them to share the same frame. In this solution, if multiple rows share the same event_time, RANGE might incorrectly group NULL rows with non-NULL rows from the same timestamp or split groups unpredictably. Candidates must ensure a total ordering by appending a unique key (e.g., record_id) to the ORDER BY clause: ORDER BY event_time, record_id to guarantee deterministic bucket assignment across all ANSI SQL implementations.