Cumulative totals and running sums are traditionally solved in SQL using window functions (e.g., SUM() OVER(ORDER BY ...)), however, in earlier or simplified versions of DBMS, only subqueries and groupings are available. Historically, database architects sought workarounds until the SQL:2003 standard with support for window functions appeared.
Problem — in the absence of window functions, for each row, it is necessary to explicitly calculate the sum of all previous values, which leads to O(N^2) nested queries for sufficiently large datasets if tricks are not applied.
Solution:
Typically, correlated subqueries or temporary tables with value updating are used:
Sample code:
-- Cumulative sum using a correlated subquery SELECT t1.id, t1.amount, ( SELECT SUM(t2.amount) FROM transactions t2 WHERE t2.id <= t1.id ) AS running_total FROM transactions t1 ORDER BY t1.id; -- Through a temporary table with manual value updates CREATE TEMPORARY TABLE temp_running (id INT, amount INT, running_total INT); -- Iterate over rows using external code (e.g., pl/pgsql), sequentially adding sums
Key features:
Does an ORDER BY guarantee sorting in a correlated subquery?
No — the subquery does not necessarily affect the result itself. The sorting of the final selection is always set externally in the main query: the result depends solely on filtering by WHERE.
Can the calculation of the cumulative sum be parallelized in this approach?
No — the sequence is very important, especially when calculations depend on previous rows, which makes simple parallelization impossible in regular SQL.
Why is the correlated subquery so slow with a large number of rows?
Because for each row, the sum over the set of previous rows is recalculated. This leads to O(N^2) operations. With a sample of 100 thousand rows, this can already take minutes or even hours.
An analyst calculated daily cumulative revenue by date through a correlated subquery, while the table periodically had deleted ids (gaps). The final sum had sharp drops and depended not on the date but on the order of ids.
Pros:
Cons:
An engineer moved the cumulative sum processing to an ETL script (Python/pandas), then uploaded the final values to a separate table, synchronizing only new entries. The totals are always consistent by date, and the code works quickly and with millions of records.
Pros:
Cons: