SQLProgrammingSenior Database Engineer

When evaluating a **window function** with a **RANGE** frame over **interval** types in **PostgreSQL**, why does the planner materialize the full partition into memory, and which specific property of the **ORDER BY** expression allows you to substitute **ROWS** framing to achieve constant memory usage without changing the result?

Pass interviews with Hintsage AI assistant

Answer to the question.

PostgreSQL implements RANGE framing by evaluating logical value offsets from the current row’s ordering column. When the frame bounds involve an interval type (e.g., INTERVAL '1 hour' PRECEDING), the executor cannot determine frame membership using simple physical row counts because the number of rows falling within that time window varies dynamically across the dataset. To ensure correctness, the engine materializes the entire sorted partition into a worktable (either in work_mem or spilled to disk), scanning all rows to identify which values fall within the specified range relative to each current row, resulting in O(partition size) memory complexity.

You can safely substitute ROWS framing only when the ORDER BY expression constitutes a unique key for every row within the partition. If the ordering column contains no duplicates (or is extended with a secondary unique column such as a primary key), the physical row offset (ROWS) becomes semantically identical to the logical value offset (RANGE). This uniqueness guarantee ensures that the frame contains exactly the intended rows without requiring the engine to scan for value-matching peers, enabling a streaming execution model using a fixed-size ring buffer with O(frame size) memory.

Situation from life

A high-frequency trading platform processed nanosecond-precision market tick data, requiring a moving average of bid-ask spreads over the previous 50 milliseconds. The initial analytics query utilized AVG(spread) OVER (PARTITION BY symbol ORDER BY nanos_ts RANGE BETWEEN INTERVAL '50 ms' PRECEDING AND CURRENT ROW). During market volatility, this triggered work_mem exhaustion, forcing PostgreSQL to spill worktables to disk and causing query latency to degrade from milliseconds to tens of seconds, unacceptable for real-time algorithmic trading.

The engineering team first considered vertically scaling the database servers to provision sufficient RAM to hold the largest partitions (high-volume symbols) entirely in memory. While this would eliminate disk spilling, the cost was prohibitive; the largest symbols contained hundreds of millions of ticks, requiring terabytes of RAM per database connection, and the solution did not scale horizontally to thousands of concurrent trading algorithms.

A second proposal suggested approximating the 50-millisecond window by using a fixed ROWS offset calculated from average tick density (e.g., assuming 1000 rows equaled 50ms). This approach would guarantee constant memory usage regardless of partition size. However, tick density varies wildly during market crashes (thousands of ticks per millisecond) versus quiet periods (minutes between ticks), making the row count approximation arbitrarily inaccurate and potentially violating financial regulations requiring precise time-window calculations for audit trails.

The chosen solution exploited the fact that nanos_ts combined with tick_id formed a composite unique key. The team reformulated the query to use ORDER BY nanos_ts, tick_id and switched to ROWS BETWEEN 1000 PRECEDING AND CURRENT ROW. Because the timestamp uniqueness ensured that the logical 50-millisecond boundary always aligned with a predictable physical row offset under normal market conditions, the calculation remained accurate while allowing PostgreSQL to stream rows through a bounded buffer. Query latency dropped to sub-millisecond levels, memory footprint stabilized at O(1), and the system handled billion-row partitions without spilling to disk.

What candidates often miss

Why does the default frame clause (RANGE UNBOUNDED PRECEDING) produce different running totals than ROWS UNBOUNDED PRECEDING when the ORDER BY column contains duplicate values?

When a window function omits an explicit frame clause, PostgreSQL defaults to RANGE UNBOUNDED PRECEDING. This mode treats all rows sharing the same ORDER BY value as a single peer group, including them all in the frame simultaneously. Consequently, if a user has three transactions on the same day, the running sum for all three rows will be identical, showing the total of all three plus previous days. In contrast, ROWS UNBOUNDED PRECEDING calculates the sum progressively: the first transaction of the day includes only itself plus prior days, the second includes the first two, and so on. Candidates often miss this default behavior, leading to reports where intra-day running totals appear "stuck" at the day's final total for all rows of that day, breaking time-series analytics.

How does PostgreSQL handle NULL values in the ORDER BY column when evaluating RANGE frames, and why can this cause rows to be silently omitted from calculations?

In SQL three-valued logic, comparisons with NULL yield UNKNOWN, not equality. For RANGE framing, PostgreSQL typically excludes rows with NULL ordering values from finite range windows (e.g., BETWEEN 1 PRECEDING AND 1 FOLLOWING) because the arithmetic comparisons against NULL fail. These rows may form isolated peer groups that are invisible to adjacent rows' frames. If a dataset contains NULL timestamps (representing legacy or pending data), a moving average using RANGE will silently drop these rows, whereas ROWS framing would include them based on physical position regardless of the NULL value, potentially skewing analytical aggregates.

When the ORDER BY column is guaranteed unique, why is explicit ROWS framing still preferable to RANGE for large datasets, and what internal operation does this avoid?

Even when uniqueness ensures semantic equivalence between ROWS and RANGE, the mere presence of the RANGE keyword forces the PostgreSQL executor to prepare for potential peer group scanning. This triggers the Materialize node, buffering the entire sorted partition into a worktable (consuming O(N) memory) before emitting rows. By explicitly declaring ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, you signal to the planner that only a sliding window of physical rows is needed. This enables a streaming WindowAgg node using a fixed-size ring buffer, avoiding the costly materialization step and reducing memory usage to O(frame size), which is critical for processing billion-row partitions without disk spilling.