The introduction of parallel query capabilities in PostgreSQL 9.6 brought the Gather node to combine results from background workers into the leader process. However, the standard Gather node destroys any tuple ordering produced by the parallel workers, necessitating an expensive final Sort step in the leader to re-establish sequence. To eliminate this redundancy when processing inherently ordered data streams, version 10 introduced the Gather Merge node, which performs a k-way merge of sorted inputs from workers, bypassing the need for leader-side materialization and sorting.
The planner elects to inject Gather Merge exclusively when the parallel subplan guarantees output ordered according to a required property, typically generated by Index Scans or Merge Joins that preserve tuple sequence. If the subplan loses ordering through operations like Hash Joins or unordered aggregations, Gather Merge becomes ineligible, forcing the optimizer to choose between a Gather followed by a costly Sort or abandoning parallelism entirely to maintain order with a single process.
When the subplan guarantees ordered output, Gather Merge allows the leader to perform a streaming merge using minimal memory buffers rather than materializing and sorting all tuples. The memory strategy shifts from a single large allocation for sorting in the leader to smaller per-worker maintenance of sorted runs, significantly reducing the risk of work_mem exhaustion and disk spills during large-scale ordered retrievals.
Our team managed a time-series analytics platform storing sensor readings in a PostgreSQL table partitioned by hour, containing over 2 billion rows. A critical dashboard required displaying the latest 1000 readings across all partitions sorted by timestamp descending, with a latency budget under 500 milliseconds. The initial single-threaded query plan failed to meet these requirements, creating a bottleneck in the user experience during peak analytical loads.
Single-process Index Scan: We initially considered utilizing a backward Index Scan on each partition followed by a Limit node executed sequentially. This approach offered implementation simplicity and deterministic ordering without complex parallel coordination. However, it failed to saturate the I/O bandwidth of our NVMe storage array and consistently exceeded 2 seconds during peak load, rendering it unacceptable for real-time dashboard updates.
Parallel Seq Scan with Gather and Sort: The second approach involved enabling max_parallel_workers_per_gather and using a Parallel Seq Scan with a standard Gather node, collecting all rows into the leader for a final Sort and Limit. This leveraged CPU parallelism and improved scan throughput significantly. Nevertheless, it caused the leader process to allocate over 4GB of work_mem to sort millions of rows, frequently triggering disk spills and OutOfMemory errors on our constrained leader node, which compromised system stability.
Parallel Index Scan with Gather Merge: We ultimately selected a plan where workers performed Parallel Index Scans in descending timestamp order, feeding into a Gather Merge node. Workers scanned index leaf pages in the required sequence, streaming sorted tuples to the leader, which performed a lightweight k-way merge to extract the top 1000 rows. This architecture eliminated the need for a final sort in the leader, drastically reducing memory pressure while maintaining streaming efficiency.
We selected the Gather Merge approach because it uniquely satisfied both the latency and memory constraints by leveraging the existing index structure rather than fighting it with hash-based operations. This solution reduced the leader's memory footprint to under 64MB for the merge buffers and achieved consistent sub-300ms response times. The system now handles peak loads without memory exhaustion, validating the architectural choice to preserve ordering through parallel execution.
Why does placing a Hash Aggregate below a Gather Merge node cause the PostgreSQL planner to either reject the plan or insert an explicit Sort step, and how does this differ from GroupAggregate behavior?
Hash Aggregate builds an unordered hash table to group tuples, which inherently destroys any input sequence produced by underlying scans. Since Gather Merge requires strictly ordered input streams from all parallel workers to perform its streaming k-way merge, unordered output from aggregation blocks its direct usage. Conversely, GroupAggregate can operate on pre-sorted inputs and preserve tuple ordering when the GROUP BY keys match the sort order, making it compatible with Gather Merge without requiring an intermediate sort step.
How does the parallel_tuple_cost GUC influence the threshold at which the planner switches from a Gather plan to a Gather Merge plan when estimating the cost of merging sorted streams from eight parallel workers?
parallel_tuple_cost adds a per-tuple CPU overhead for transferring rows between parallel workers and the leader process. For Gather Merge, this cost is slightly higher than for a standard Gather node due to the additional comparison logic required to maintain the merge heap. When the estimated result set is small, the planner may favor a Gather node coupled with a cheap Sort in the leader over Gather Merge, because the cumulative overhead of eight merge streams can exceed the cost of sorting a small batch of tuples centrally.
What specific limitation arises when using DECLARE CURSOR with the SCROLL option over a query plan containing a Gather Merge node, and why might the executor silently materialize the entire result set despite the streaming nature of the merge?
SCROLL cursors require the ability to move backwards through the result set, which necessitates materializing rows in work_mem or spilling to disk to support backward fetching. Although Gather Merge produces a streaming, ordered output efficiently, the SCROLL option forces the executor to insert a Materialize node above the Gather Merge to buffer rows for potential reverse traversal. This materialization consumes memory proportional to the result set size, effectively negating the memory efficiency benefits of the streaming merge strategy and potentially causing disk spills identical to those avoided by choosing Gather Merge initially.