Gaps and Islands — DBMS · Engineering Playbook

What it is#

Gaps and islands is the canonical SQL template for finding runs of consecutive rows in an ordered sequence. The trick: subtract a row number from a value such that consecutive rows produce the same difference. Then group by that difference — each unique value is one “island” (a run), and every island can be aggregated for its length, start, end, or member rows.

-- Template: islands of consecutive daily check-ins per user
WITH numbered AS (
  SELECT user_id, check_in_date,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY check_in_date) AS rn,
         check_in_date - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY check_in_date)) * INTERVAL '1 day' AS grp
  FROM   daily_checkins
)
SELECT user_id, MIN(check_in_date) AS streak_start, MAX(check_in_date) AS streak_end, COUNT(*) AS streak_len
FROM   numbered
GROUP BY user_id, grp;

Read that closely. For three consecutive dates 2026-01-01, 2026-01-02, 2026-01-03 with row numbers 1, 2, 3:

Row 1: 2026-01-01 - 1 day = 2025-12-31
Row 2: 2026-01-02 - 2 days = 2025-12-31
Row 3: 2026-01-03 - 3 days = 2025-12-31

All three rows share the same grp value — they’re one island. A four-day gap and then two more consecutive dates would produce a different grp value, identifying a separate island. The pattern works for any sequence: integers, dates, timestamps (with a fixed-interval assumption), even arbitrary order positions.

“Gaps” — the dual problem — uses the same machinery to find the missing values. Most interview questions are about islands; gaps are a quick variant once you know the islands trick.

When to use it#

Pull out gaps and islands whenever the question is shaped like:

“Find every continuous run of N+ consecutive [days / weeks / event indices] where some condition holds.”
“What is the longest streak of [active status / consecutive logins / passing scores]?”
“Identify sessions: events for the same user with gaps less than 30 minutes are one session.”
“Find missing primary-key values in a sequence” (the gaps half).
“Compress sequential rows where status doesn’t change” — represent (2026-01-01..2026-01-10, active) as one row instead of ten.

The technique sits at the intersection of Lead-Lag Navigation and Rank and Row. Lead-lag detects boundaries — gaps and islands turns those boundaries into group labels using a running calculation. If you only need to detect boundaries (not aggregate islands), lead-lag is enough. If you only need ordinal positions (not runs), rank-and-row is enough. Use gaps-and-islands when you need both — boundary detection plus per-run aggregation.

How it works#

Two ways to express the same idea — pick the one that fits the data type.

Method 1: subtract row number from a numeric / date value. Works when the sequence has a known fixed step (consecutive integers, consecutive days). The arithmetic is value - rownum * step.

-- Integer sequence: find runs of consecutive numbers
WITH numbered AS (
  SELECT n, n - ROW_NUMBER() OVER (ORDER BY n) AS grp
  FROM   integer_seq
)
SELECT grp, MIN(n) AS first, MAX(n) AS last, COUNT(*) AS len
FROM   numbered
GROUP BY grp
ORDER BY first;

For input 1, 2, 3, 7, 8, 10:

1 - 1 = 0, 2 - 2 = 0, 3 - 3 = 0 — island A (grp = 0)
7 - 4 = 3, 8 - 5 = 3 — island B (grp = 3)
10 - 6 = 4 — island C (grp = 4)

Three islands: [1..3], [7..8], [10..10].

Method 2: running sum of a transition flag. Works for any sequence — no fixed-step requirement. Mark each row where the value transitions; run a cumulative sum over the markers; rows in the same island share a marker count.

-- Streak of "active" status: islands keyed by transition count
WITH marked AS (
  SELECT user_id, recorded_at, status,
         CASE
           WHEN LAG(status) OVER (PARTITION BY user_id ORDER BY recorded_at) IS DISTINCT FROM status
           THEN 1 ELSE 0
         END AS is_transition
  FROM   user_status_history
),
grouped AS (
  SELECT user_id, recorded_at, status,
         SUM(is_transition) OVER (PARTITION BY user_id ORDER BY recorded_at) AS grp
  FROM   marked
)
SELECT user_id, status, MIN(recorded_at) AS start_at, MAX(recorded_at) AS end_at, COUNT(*) AS row_count
FROM   grouped
GROUP BY user_id, status, grp
ORDER BY user_id, start_at;

The running SUM over is_transition increments by 1 at every status change. Rows between two transitions share the same sum — they’re one island. This generalises to any “session boundary” problem: define what counts as a transition, sum the flag, group.

IS DISTINCT FROM (Postgres / SQL Server / SQLite) is the null-safe inequality. MySQL: NOT (a <=> b). Plain != fails on the first row of every partition because LAG returns NULL there.

Date series with timezone. Dates in databases can be tricky around DST or timezone boundaries. The safest approach: cast both sides to a tz-aware DATE in a known zone before any arithmetic.

-- Postgres: islands of daily activity, normalised to UTC
WITH normalised AS (
  SELECT user_id, (event_at AT TIME ZONE 'UTC')::date AS day
  FROM   events
),
numbered AS (
  SELECT user_id, day,
         day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY day))::integer AS grp
  FROM   normalised
)
SELECT user_id, MIN(day) AS streak_start, MAX(day) AS streak_end, COUNT(*) AS len
FROM   numbered
GROUP BY user_id, grp;

Variants#

Gaps (the dual problem). Instead of grouping rows into islands, find the missing values in a sequence. The easiest form uses LEAD:

-- Find gaps in a sequential id column
SELECT id + 1 AS gap_start, next_id - 1 AS gap_end
FROM (
  SELECT id, LEAD(id) OVER (ORDER BY id) AS next_id
  FROM   orders
) x
WHERE next_id - id > 1;

For id = 1, 2, 5, 6, 10, the output is (3, 4) and (7, 9). To enumerate every missing integer in the gap, cross join with a generated series.

Island-with-condition. Restrict islands to rows where a condition holds — for example, only count consecutive days where the daily metric exceeded a threshold.

-- Runs of 3+ consecutive days with metric > 100
WITH high_days AS (
  SELECT day FROM daily_metrics WHERE metric > 100
),
numbered AS (
  SELECT day, day - (ROW_NUMBER() OVER (ORDER BY day))::integer AS grp
  FROM   high_days
),
runs AS (
  SELECT grp, MIN(day) AS start_day, MAX(day) AS end_day, COUNT(*) AS len
  FROM   numbered
  GROUP BY grp
)
SELECT start_day, end_day, len
FROM   runs
WHERE  len >= 3
ORDER BY start_day;

Notice the filter happens before the row-numbering: only “high” days are numbered, so the trick of day - rownum = constant produces a single island per consecutive run of high days. Other days are silently absent.

Session detection by gap threshold. Define a session as events with less than 30 minutes between consecutive events. Combine lead-lag (boundary detection) with running sum (group label):

WITH marked AS (
  SELECT user_id, event_at,
         CASE
           WHEN event_at - LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at)
                > INTERVAL '30 minutes'
           OR   LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) IS NULL
           THEN 1 ELSE 0
         END AS is_session_start
  FROM   events
)
SELECT user_id, event_at,
       SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_at) AS session_id
FROM   marked;

Each user’s session_id increments by 1 every time a new session starts. Group by (user_id, session_id) to compute session duration, event count, or any per-session metric.

Compressing sequential rows. A change-data-capture table records every status update. Compress runs of unchanged status into one row per run.

WITH marked AS (
  SELECT entity_id, status, recorded_at,
         CASE
           WHEN LAG(status) OVER (PARTITION BY entity_id ORDER BY recorded_at) IS DISTINCT FROM status
           THEN 1 ELSE 0
         END AS is_change
  FROM   status_history
),
grouped AS (
  SELECT entity_id, status, recorded_at,
         SUM(is_change) OVER (PARTITION BY entity_id ORDER BY recorded_at) AS grp
  FROM   marked
)
SELECT entity_id, status,
       MIN(recorded_at) AS from_at,
       MAX(recorded_at) AS to_at
FROM   grouped
GROUP BY entity_id, status, grp
ORDER BY entity_id, from_at;

The output is a “history table” with one row per status period, much smaller than the raw event log.

Numeric-difference shortcut. When the values themselves form an arithmetic progression with step 1, you can use value - rownum directly. When the step is k, use value - rownum * k. For weekly data, that’s week - rownum * 7. For data with no fixed step, use Method 2 (running sum).

Common pitfalls#

Off-by-one in the row-number subtraction. If the data is daily and the row number starts at 1, the subtraction is date - rownum * 1 day. Sometimes engineers write date - rownum (forgetting the interval multiplier) — which silently fails in date arithmetic on Postgres but quietly returns a wrong-but-syntactically-valid result in MySQL.

Wrong — bare integer subtraction.

SELECT day, day - ROW_NUMBER() OVER (ORDER BY day) AS grp
FROM   daily_events;

On Postgres this errors with operator does not exist: date - bigint. On MySQL it silently coerces to days, but is fragile.

Right — interval-typed subtraction.

SELECT day,
       day - (ROW_NUMBER() OVER (ORDER BY day))::integer
         AS grp
FROM   daily_events;

Cast the row number to integer so Postgres treats it as days. For multi-day steps, multiply by the step.

Mixed timezones in date-island queries. Two events that happen seconds apart but straddle midnight in different time zones can end up on different calendar days — splitting one island in two, or merging two. Normalise to a single timezone (UTC is the safe default) before the row-numbering step.

Forgetting PARTITION BY in the row number. A query that should give per-user islands but writes ROW_NUMBER() OVER (ORDER BY day) without partitioning produces one row number per row across the whole table — every user’s islands collide. Always PARTITION BY on the entity that defines a sequence.

Running sum without ORDER BY. SUM(is_change) OVER (PARTITION BY user_id) without ORDER BY produces the total sum per user, not a running sum. Add ORDER BY recorded_at to get the running form.

Counting an island as one when the condition has gaps in the underlying data. The “island-with-condition” variant relies on filtering before row-numbering. If you keep both rows (high and low) in the same input and try to handle the condition in the group expression, you get one island per row instead of one island per run. Always filter first.

Treating ties as adjacent. When ordering by a non-unique column (day when multiple events per day exist), row numbers are still distinct (no ties), but the value - rownum arithmetic breaks because the same day appears multiple times. De-duplicate first (SELECT DISTINCT day) or pick the right unique key.

Performance: window without an index. Both methods involve ORDER BY in the window — that’s a sort. On a billion-row events table, an index matching the partition + order columns is essential. See B-tree indexes and query execution and plans.

UNION ALL vs UNION in island-aggregation queries. Once islands are identified, the final aggregation step groups by the group key. Use plain GROUP BY; no set operation is involved. If you find yourself reaching for UNION here, the query is almost certainly mis-structured.

Filtering on len >= N in the wrong place. Putting WHERE len >= 3 in the same SELECT as COUNT(*) errors out — len doesn’t exist yet at WHERE time. Use HAVING COUNT(*) >= 3 if filtering during aggregation, or wrap the aggregation in a CTE and filter outside.

Practice problem#

You have a daily_metrics table:

CREATE TABLE daily_metrics (
  metric_date DATE    NOT NULL PRIMARY KEY,
  value       NUMERIC NOT NULL
);

Find every continuous run of 3 or more consecutive days where value exceeded 100. Return one row per run with run_start (first date), run_end (last date), and run_length (number of days). Order by run_start ascending.

Solution

WITH high_days AS (
  SELECT metric_date
  FROM   daily_metrics
  WHERE  value > 100
),
numbered AS (
  SELECT metric_date,
         metric_date - (ROW_NUMBER() OVER (ORDER BY metric_date))::integer AS grp
  FROM   high_days
),
runs AS (
  SELECT grp,
         MIN(metric_date) AS run_start,
         MAX(metric_date) AS run_end,
         COUNT(*)         AS run_length
  FROM   numbered
  GROUP BY grp
)
SELECT run_start, run_end, run_length
FROM   runs
WHERE  run_length >= 3
ORDER BY run_start;

Filter to high days first so the date - rownum trick produces one constant per consecutive run. Group by the constant to collapse each island. Filter on length in the outer query — HAVING COUNT(*) >= 3 inside the aggregation CTE would work equally. The cast to integer after ROW_NUMBER is required on Postgres because date arithmetic needs an integer for “days”.

Lead-Lag Navigation — the boundary-detection step. Gaps-and-islands extends it with running sums to label each island.
Rank and Row — ROW_NUMBER is the engine of the row-number-subtraction trick.
Window functions — frames and partitioning underlie both methods.
Aggregates, GROUP BY, HAVING — the final step in every gaps-and-islands query.
Subqueries and CTEs — multi-step island queries chain naturally through CTEs.