SQL (ANSI)ProgrammingSQL Developer

How do you identify contiguous ranges (islands) in sequential data using only ANSI SQL window functions when row-by-row processing is prohibited?

Pass interviews with Hintsage AI assistant

Answer to the question

History of the question

This question emerged from the evolution of SQL standards from SQL-92 to SQL:2003, when window functions were formally introduced to the standard. Prior to this advancement, developers relied on procedural cursors or computationally expensive self-joins to solve sequence-oriented problems. The gaps-and-islands pattern represents the paradigm shift from procedural algorithms to declarative, set-based logic that defines result sets rather than processing steps.

The problem

When working with tables containing sequential values such as timestamps, IDs, or dates, you must group consecutive values into contiguous blocks (islands) while distinguishing them from breaks (gaps). The fundamental challenge arises because ANSI SQL tables represent unordered mathematical sets, yet detecting sequences requires explicit ordering. Traditional GROUP BY clauses aggregate similar values but destroy the sequential relationships necessary to identify contiguity.

The solution

Leverage the arithmetic difference between ROW_NUMBER() over the entire dataset and ROW_NUMBER() partitioned by the grouping key to generate a constant island identifier. This technique creates identical calculated values for all rows within the same contiguous sequence, allowing standard aggregation to reconstruct the islands.

WITH numbered AS ( SELECT event_date, ROW_NUMBER() OVER (ORDER BY event_date) AS rn_global, event_date - ROW_NUMBER() OVER (ORDER BY event_date) AS island_grp FROM events ) SELECT MIN(event_date) AS island_start, MAX(event_date) AS island_end, COUNT(*) AS consecutive_days FROM numbered GROUP BY island_grp;

Situation from life

A retail analytics team needed to reconstruct customer shopping sessions from clickstream data stored in PostgreSQL. The system logged millions of events containing user_id and event_time, but lacked pre-calculated session identifiers. Business requirements defined a session as a sequence of events where no gap exceeded 30 minutes of inactivity.

The first approach considered using a self-join with a correlated subquery to locate each event's immediate predecessor. This method required O(n²) row comparisons, causing query timeouts when processing daily batches exceeding five million rows, though it maintained compatibility with legacy SQL-92 systems lacking modern window functions.

The team subsequently evaluated pl/pgSQL cursors to iterate through events row-by-row while maintaining session state in procedural variables. While this approach offered intuitive logic familiar to application developers, it abandoned set-based processing principles and required over four hours to complete daily batches, creating unacceptable ETL latency and significant table locking issues.

The selected solution utilized ANSI SQL window functions exclusively. By applying LAG() to capture the previous timestamp per user and calculating time differences, the team identified session boundaries where gaps exceeded 30 minutes. A conditional running sum generated unique session identifiers, enabling set-based aggregation. This method processed the entire dataset in eight minutes, scaled linearly with volume, and remained portable across Oracle, SQL Server, and PostgreSQL without vendor-specific syntax modifications.

What candidates often miss


Why can't I simply truncate timestamps to the hour and group by that value to find sessions?

Truncating timestamps using DATE_TRUNC or similar functions forces artificial boundaries at clock hours rather than relative time differences. Two events occurring at 10:55 and 11:05 would be separated into different groups despite being only 10 minutes apart, while events at 10:01 and 10:59 would group together despite a 58-minute gap. True session detection requires calculating the interval from each event's immediate predecessor, not aligning to calendar boundaries.


How do NULL values in the ordering column affect island detection using LAG or LEAD?

LAG and LEAD return NULL for the first and last rows of each partition respectively. When subtracting the lagged timestamp from the current timestamp to calculate gaps, arithmetic with NULL produces NULL results, which can cause entire islands to disappear from aggregations. You must utilize the optional default parameter in LAG (for example, LAG(event_time, 1, event_time) OVER (...)) or explicitly handle NULL with COALESCE to prevent island fragmentation at partition boundaries.


What changes when detecting islands across multiple categories simultaneously, such as per user or per device?

Candidates frequently omit the PARTITION BY clause in window functions, calculating ROW_NUMBER globally across the entire table rather than per category. Without partitioning by user_id or equivalent grouping columns, islands from different users incorrectly merge when their sequences happen to align temporally. Every window function involved in the island calculation must include PARTITION BY user_id to ensure the arithmetic resets for each distinct entity, maintaining independent island detection per partition.