ProgrammingBackend Developer

How to implement cumulative totals calculation in SQL without window functions, considering performance on thousands or millions of rows?

Pass interviews with Hintsage AI assistant

Answer

Cumulative totals and running sums are traditionally solved in SQL using window functions (e.g., SUM() OVER(ORDER BY ...)), however, in earlier or simplified versions of DBMS, only subqueries and groupings are available. Historically, database architects sought workarounds until the SQL:2003 standard with support for window functions appeared.

Problem — in the absence of window functions, for each row, it is necessary to explicitly calculate the sum of all previous values, which leads to O(N^2) nested queries for sufficiently large datasets if tricks are not applied.

Solution:

Typically, correlated subqueries or temporary tables with value updating are used:

Sample code:

-- Cumulative sum using a correlated subquery SELECT t1.id, t1.amount, ( SELECT SUM(t2.amount) FROM transactions t2 WHERE t2.id <= t1.id ) AS running_total FROM transactions t1 ORDER BY t1.id; -- Through a temporary table with manual value updates CREATE TEMPORARY TABLE temp_running (id INT, amount INT, running_total INT); -- Iterate over rows using external code (e.g., pl/pgsql), sequentially adding sums

Key features:

  • Method works only if there is a unique sorting criterion (id, date of production)
  • Correlated subquery scales poorly — exponential growth of execution time
  • For large volumes of data, it is more logical to use ETL with aggregation outside of SQL or through procedures

Tricky questions.

Does an ORDER BY guarantee sorting in a correlated subquery?

No — the subquery does not necessarily affect the result itself. The sorting of the final selection is always set externally in the main query: the result depends solely on filtering by WHERE.

Can the calculation of the cumulative sum be parallelized in this approach?

No — the sequence is very important, especially when calculations depend on previous rows, which makes simple parallelization impossible in regular SQL.

Why is the correlated subquery so slow with a large number of rows?

Because for each row, the sum over the set of previous rows is recalculated. This leads to O(N^2) operations. With a sample of 100 thousand rows, this can already take minutes or even hours.

Typical mistakes and anti-patterns

  • Incorrect filtering by id instead of the actual date — sums "jump" on gaps in id
  • Trying to perform summation without ordering data
  • Using such an approach for huge tables when ETL or partitioned processing is needed

Real-life example

Negative case

An analyst calculated daily cumulative revenue by date through a correlated subquery, while the table periodically had deleted ids (gaps). The final sum had sharp drops and depended not on the date but on the order of ids.

Pros:

  • Works for small datasets, does not require window functions

Cons:

  • Data is incorrect, does not calculate as expected
  • Difficult to maintain

Positive case

An engineer moved the cumulative sum processing to an ETL script (Python/pandas), then uploaded the final values to a separate table, synchronizing only new entries. The totals are always consistent by date, and the code works quickly and with millions of records.

Pros:

  • Reliability, ability to recalculate without downtime
  • Support for large volumes

Cons:

  • Complex landscape — external processing tools are needed