History: Before ANSI SQL:2003 introduced analytical functions, filling sparse time-series data required inefficient self-joins or procedural cursors that processed rows individually. The LOCF pattern originated in statistical packages like SAS and R, where carrying forward the last known observation is a standard data cleaning technique. Database vendors later implemented this logic in SQL through window functions, with the IGNORE NULLS clause formalized in ANSI SQL:2011 specifically to handle such gaps declaratively.
Problem: Sensor networks and financial trading systems frequently generate NULL values due to transmission failures or non-trading hours. Simple LAG functions fail because they return the immediate predecessor, which might also be NULL, creating gaps in the calculated metrics. The challenge requires scanning backwards through an ordered partition until encountering the most recent non-null value, without using self-joins that degrade performance quadratically.
Solution: Use the LAST_VALUE window function with the IGNORE NULLS option and a frame specification extending from the partition start to the current row. This configuration instructs the engine to maintain a running buffer of non-null values, effectively looking back through NULLs to retrieve the last valid observation. For systems lacking IGNORE NULLS, a workaround uses COUNT of non-nulls to create stable groups, though this technically involves a subquery.
SELECT device_id, reading_time, temperature, LAST_VALUE(temperature IGNORE NULLS) OVER ( PARTITION BY device_id ORDER BY reading_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS locf_temperature FROM sensor_readings;
A healthcare analytics platform monitors continuous glucose levels for diabetic patients using wearable devices. Due to Bluetooth interference, approximately 12% of readings arrive as NULL, but clinicians require complete curves for insulin dosage calculations where interpolation could be medically unsafe. Accurate LOCF logic is essential because missing values during sleep or meals could trigger false hypoglycemia alerts.
Solution A: Cursor-based procedural update. A PL/SQL stored procedure iterates through patient records chronologically, maintaining a session variable to store the last valid glucose reading and updating NULL rows immediately. Pros: compatible with legacy Oracle versions predating window function support; easy to understand for developers with imperative programming backgrounds. Cons: row-by-row processing creates excessive I/O and table locking; processing 10 million rows requires 45 minutes, making real-time dashboards impossible.
Solution B: Self-join with correlated subquery. The query performs a left join to find the maximum timestamp less than the current row where glucose IS NOT NULL, effectively looking up the previous value for each gap. Pros: declarative SQL without procedural code; functions on ANSI SQL-92 compliant systems. Cons: O(n²) complexity causes exponential slowdown; the query times out after 6 hours on production datasets due to repeated full table scans.
Solution C: Window function with IGNORE NULLS. Implements LAST_VALUE(glucose IGNORE NULLS) partitioned by patient and ordered by time, utilizing a single pass through the index. Pros: O(n log n) complexity executes in 28 seconds on the same 10-million-row dataset; minimal memory footprint and no locking issues. Cons: requires ANSI SQL:2011 support, necessitating a database upgrade from the existing PostgreSQL 9.5 instance.
The team selected Solution C after determining that the database upgrade cost was justified by the 99% performance improvement. The implementation enabled real-time glucose alerts and reduced server CPU utilization by 94%. Consequently, the clinic successfully monitored 50,000 concurrent patients without latency or missing critical glucose spikes.
Question 1: Why does LAST_VALUE without IGNORE NULLS return NULL even when previous non-null values exist in the partition?
By default, LAST_VALUE evaluates the frame including the current row. When the current row contains NULL and the frame extends to CURRENT ROW, the function sees this NULL as the last value in the window. Candidates incorrectly assume the function scans backward indefinitely; however, without IGNORE NULLS, it treats NULLs as valid values. The window frame ROWS UNBOUNDED PRECEDING includes the current row, making LAST_VALUE equivalent to the present row's value unless explicitly told to disregard NULLs.
Question 2: How can you implement LOCF in pre-2011 ANSI SQL without IGNORE NULLS, and what is the logical error in using ROW_NUMBER differences instead of COUNT?
You can use COUNT(non_null_col) OVER (ORDER BY ... ROWS UNBOUNDED PRECEDING) to create a grouping identifier that only increments when encountering non-null values. All subsequent NULLs share this count, forming a carry group. Candidates sometimes try subtracting ROW_NUMBER() OVER (ORDER BY ...) from ROW_NUMBER() OVER (PARTITION BY non_null_flag ORDER BY ...). This fails because it creates new groups for every gap between non-nulls rather than extending the previous group forward. The COUNT method works because it produces a stable identifier for the entire last known value period.
Question 3: When using RANGE instead of ROWS framing for LOCF on timestamps with duplicates, why might the results become non-deterministic?
RANGE framing groups rows with identical ORDER BY values into peer groups, treating them as a single unit. If multiple sensor readings share the same millisecond timestamp, RANGE UNBOUNDED PRECEDING cannot distinguish their physical order. When some duplicates contain NULLs and others contain values, the window function might randomly pick from the peer group depending on execution plan. ROWS framing guarantees deterministic results by processing physical row order, ensuring that the specific sequence of insertions determines which value carries forward. This distinction is critical for high-frequency trading data where microseconds matter.