Gaps and Islands
The streak-detection template: subtract row number from a value, group by the difference. Identifies runs of consecutive rows.
What it is#
Gaps and islands is the canonical SQL template for finding runs of consecutive rows in an ordered sequence. The trick: subtract a row number from a value such that consecutive rows produce the same difference. Then group by that difference — each unique value is one “island” (a run), and every island can be aggregated for its length, start, end, or member rows.
-- Template: islands of consecutive daily check-ins per userWITH numbered AS ( SELECT user_id, check_in_date, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY check_in_date) AS rn, check_in_date - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY check_in_date)) * INTERVAL '1 day' AS grp FROM daily_checkins)SELECT user_id, MIN(check_in_date) AS streak_start, MAX(check_in_date) AS streak_end, COUNT(*) AS streak_lenFROM numberedGROUP BY user_id, grp;Read that closely. For three consecutive dates 2026-01-01, 2026-01-02, 2026-01-03 with row numbers 1, 2, 3:
- Row 1:
2026-01-01 - 1 day = 2025-12-31 - Row 2:
2026-01-02 - 2 days = 2025-12-31 - Row 3:
2026-01-03 - 3 days = 2025-12-31
All three rows share the same grp value — they’re one island. A four-day gap and then two more consecutive dates would produce a different grp value, identifying a separate island. The pattern works for any sequence: integers, dates, timestamps (with a fixed-interval assumption), even arbitrary order positions.
“Gaps” — the dual problem — uses the same machinery to find the missing values. Most interview questions are about islands; gaps are a quick variant once you know the islands trick.
When to use it#
Pull out gaps and islands whenever the question is shaped like:
- “Find every continuous run of N+ consecutive [days / weeks / event indices] where some condition holds.”
- “What is the longest streak of [active status / consecutive logins / passing scores]?”
- “Identify sessions: events for the same user with gaps less than 30 minutes are one session.”
- “Find missing primary-key values in a sequence” (the gaps half).
- “Compress sequential rows where status doesn’t change” — represent
(2026-01-01..2026-01-10, active)as one row instead of ten.
The technique sits at the intersection of Lead-Lag Navigation and Rank and Row. Lead-lag detects boundaries — gaps and islands turns those boundaries into group labels using a running calculation. If you only need to detect boundaries (not aggregate islands), lead-lag is enough. If you only need ordinal positions (not runs), rank-and-row is enough. Use gaps-and-islands when you need both — boundary detection plus per-run aggregation.
How it works#
Two ways to express the same idea — pick the one that fits the data type.
Method 1: subtract row number from a numeric / date value. Works when the sequence has a known fixed step (consecutive integers, consecutive days). The arithmetic is value - rownum * step.
-- Integer sequence: find runs of consecutive numbersWITH numbered AS ( SELECT n, n - ROW_NUMBER() OVER (ORDER BY n) AS grp FROM integer_seq)SELECT grp, MIN(n) AS first, MAX(n) AS last, COUNT(*) AS lenFROM numberedGROUP BY grpORDER BY first;For input 1, 2, 3, 7, 8, 10:
1 - 1 = 0,2 - 2 = 0,3 - 3 = 0— island A (grp = 0)7 - 4 = 3,8 - 5 = 3— island B (grp = 3)10 - 6 = 4— island C (grp = 4)
Three islands: [1..3], [7..8], [10..10].
Method 2: running sum of a transition flag. Works for any sequence — no fixed-step requirement. Mark each row where the value transitions; run a cumulative sum over the markers; rows in the same island share a marker count.
-- Streak of "active" status: islands keyed by transition countWITH marked AS ( SELECT user_id, recorded_at, status, CASE WHEN LAG(status) OVER (PARTITION BY user_id ORDER BY recorded_at) IS DISTINCT FROM status THEN 1 ELSE 0 END AS is_transition FROM user_status_history),grouped AS ( SELECT user_id, recorded_at, status, SUM(is_transition) OVER (PARTITION BY user_id ORDER BY recorded_at) AS grp FROM marked)SELECT user_id, status, MIN(recorded_at) AS start_at, MAX(recorded_at) AS end_at, COUNT(*) AS row_countFROM groupedGROUP BY user_id, status, grpORDER BY user_id, start_at;The running SUM over is_transition increments by 1 at every status change. Rows between two transitions share the same sum — they’re one island. This generalises to any “session boundary” problem: define what counts as a transition, sum the flag, group.
IS DISTINCT FROM (Postgres / SQL Server / SQLite) is the null-safe inequality. MySQL: NOT (a <=> b). Plain != fails on the first row of every partition because LAG returns NULL there.
Date series with timezone. Dates in databases can be tricky around DST or timezone boundaries. The safest approach: cast both sides to a tz-aware DATE in a known zone before any arithmetic.
-- Postgres: islands of daily activity, normalised to UTCWITH normalised AS ( SELECT user_id, (event_at AT TIME ZONE 'UTC')::date AS day FROM events),numbered AS ( SELECT user_id, day, day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY day))::integer AS grp FROM normalised)SELECT user_id, MIN(day) AS streak_start, MAX(day) AS streak_end, COUNT(*) AS lenFROM numberedGROUP BY user_id, grp;Variants#
Gaps (the dual problem). Instead of grouping rows into islands, find the missing values in a sequence. The easiest form uses LEAD:
-- Find gaps in a sequential id columnSELECT id + 1 AS gap_start, next_id - 1 AS gap_endFROM ( SELECT id, LEAD(id) OVER (ORDER BY id) AS next_id FROM orders) xWHERE next_id - id > 1;For id = 1, 2, 5, 6, 10, the output is (3, 4) and (7, 9). To enumerate every missing integer in the gap, cross join with a generated series.
Island-with-condition. Restrict islands to rows where a condition holds — for example, only count consecutive days where the daily metric exceeded a threshold.
-- Runs of 3+ consecutive days with metric > 100WITH high_days AS ( SELECT day FROM daily_metrics WHERE metric > 100),numbered AS ( SELECT day, day - (ROW_NUMBER() OVER (ORDER BY day))::integer AS grp FROM high_days),runs AS ( SELECT grp, MIN(day) AS start_day, MAX(day) AS end_day, COUNT(*) AS len FROM numbered GROUP BY grp)SELECT start_day, end_day, lenFROM runsWHERE len >= 3ORDER BY start_day;Notice the filter happens before the row-numbering: only “high” days are numbered, so the trick of day - rownum = constant produces a single island per consecutive run of high days. Other days are silently absent.
Session detection by gap threshold. Define a session as events with less than 30 minutes between consecutive events. Combine lead-lag (boundary detection) with running sum (group label):
WITH marked AS ( SELECT user_id, event_at, CASE WHEN event_at - LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) > INTERVAL '30 minutes' OR LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) IS NULL THEN 1 ELSE 0 END AS is_session_start FROM events)SELECT user_id, event_at, SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_at) AS session_idFROM marked;Each user’s session_id increments by 1 every time a new session starts. Group by (user_id, session_id) to compute session duration, event count, or any per-session metric.
Compressing sequential rows. A change-data-capture table records every status update. Compress runs of unchanged status into one row per run.
WITH marked AS ( SELECT entity_id, status, recorded_at, CASE WHEN LAG(status) OVER (PARTITION BY entity_id ORDER BY recorded_at) IS DISTINCT FROM status THEN 1 ELSE 0 END AS is_change FROM status_history),grouped AS ( SELECT entity_id, status, recorded_at, SUM(is_change) OVER (PARTITION BY entity_id ORDER BY recorded_at) AS grp FROM marked)SELECT entity_id, status, MIN(recorded_at) AS from_at, MAX(recorded_at) AS to_atFROM groupedGROUP BY entity_id, status, grpORDER BY entity_id, from_at;The output is a “history table” with one row per status period, much smaller than the raw event log.
Numeric-difference shortcut. When the values themselves form an arithmetic progression with step 1, you can use value - rownum directly. When the step is k, use value - rownum * k. For weekly data, that’s week - rownum * 7. For data with no fixed step, use Method 2 (running sum).
Common pitfalls#
Off-by-one in the row-number subtraction. If the data is daily and the row number starts at 1, the subtraction is date - rownum * 1 day. Sometimes engineers write date - rownum (forgetting the interval multiplier) — which silently fails in date arithmetic on Postgres but quietly returns a wrong-but-syntactically-valid result in MySQL.
Wrong — bare integer subtraction.
SELECT day, day - ROW_NUMBER() OVER (ORDER BY day) AS grpFROM daily_events;On Postgres this errors with operator does not exist: date - bigint. On MySQL it silently coerces to days, but is fragile.
Right — interval-typed subtraction.
SELECT day, day - (ROW_NUMBER() OVER (ORDER BY day))::integer AS grpFROM daily_events;Cast the row number to integer so Postgres treats it as days. For multi-day steps, multiply by the step.
Mixed timezones in date-island queries. Two events that happen seconds apart but straddle midnight in different time zones can end up on different calendar days — splitting one island in two, or merging two. Normalise to a single timezone (UTC is the safe default) before the row-numbering step.
Forgetting PARTITION BY in the row number. A query that should give per-user islands but writes ROW_NUMBER() OVER (ORDER BY day) without partitioning produces one row number per row across the whole table — every user’s islands collide. Always PARTITION BY on the entity that defines a sequence.
Running sum without ORDER BY. SUM(is_change) OVER (PARTITION BY user_id) without ORDER BY produces the total sum per user, not a running sum. Add ORDER BY recorded_at to get the running form.
Counting an island as one when the condition has gaps in the underlying data. The “island-with-condition” variant relies on filtering before row-numbering. If you keep both rows (high and low) in the same input and try to handle the condition in the group expression, you get one island per row instead of one island per run. Always filter first.
Treating ties as adjacent. When ordering by a non-unique column (day when multiple events per day exist), row numbers are still distinct (no ties), but the value - rownum arithmetic breaks because the same day appears multiple times. De-duplicate first (SELECT DISTINCT day) or pick the right unique key.
Performance: window without an index. Both methods involve ORDER BY in the window — that’s a sort. On a billion-row events table, an index matching the partition + order columns is essential. See B-tree indexes and query execution and plans.
UNION ALL vs UNION in island-aggregation queries. Once islands are identified, the final aggregation step groups by the group key. Use plain GROUP BY; no set operation is involved. If you find yourself reaching for UNION here, the query is almost certainly mis-structured.
Filtering on len >= N in the wrong place. Putting WHERE len >= 3 in the same SELECT as COUNT(*) errors out — len doesn’t exist yet at WHERE time. Use HAVING COUNT(*) >= 3 if filtering during aggregation, or wrap the aggregation in a CTE and filter outside.
Practice problem#
You have a daily_metrics table:
CREATE TABLE daily_metrics ( metric_date DATE NOT NULL PRIMARY KEY, value NUMERIC NOT NULL);Find every continuous run of 3 or more consecutive days where value exceeded 100. Return one row per run with run_start (first date), run_end (last date), and run_length (number of days). Order by run_start ascending.
Solution
WITH high_days AS ( SELECT metric_date FROM daily_metrics WHERE value > 100),numbered AS ( SELECT metric_date, metric_date - (ROW_NUMBER() OVER (ORDER BY metric_date))::integer AS grp FROM high_days),runs AS ( SELECT grp, MIN(metric_date) AS run_start, MAX(metric_date) AS run_end, COUNT(*) AS run_length FROM numbered GROUP BY grp)SELECT run_start, run_end, run_lengthFROM runsWHERE run_length >= 3ORDER BY run_start;Filter to high days first so the date - rownum trick produces one constant per consecutive run. Group by the constant to collapse each island. Filter on length in the outer query — HAVING COUNT(*) >= 3 inside the aggregation CTE would work equally. The cast to integer after ROW_NUMBER is required on Postgres because date arithmetic needs an integer for “days”.
Related patterns#
- Lead-Lag Navigation — the boundary-detection step. Gaps-and-islands extends it with running sums to label each island.
- Rank and Row —
ROW_NUMBERis the engine of the row-number-subtraction trick. - Window functions — frames and partitioning underlie both methods.
- Aggregates, GROUP BY, HAVING — the final step in every gaps-and-islands query.
- Subqueries and CTEs — multi-step island queries chain naturally through CTEs.