Deduplication
ROW_NUMBER() OVER (PARTITION BY ...) = 1 and the DISTINCT ON (Postgres) alternatives for keeping one row per group.
What it is#
Deduplication is the “keep exactly one row per group” template — when a table holds many rows that share a logical key and you want one representative row per key. Common shapes: keep the latest session per user, the highest-priced order per customer, the most recent address per account, the first event per session. The grouping key is rarely the primary key (otherwise duplicates wouldn’t exist); it’s some business identifier that the data model permits to repeat.
The canonical form uses a window function:
WITH ranked AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY started_at DESC) AS rn FROM sessions)SELECT *FROM rankedWHERE rn = 1;The window numbers rows inside each user_id partition by started_at DESC, and the outer filter keeps only the top row per partition. That structure — partition by group, order by tiebreak, keep rn = 1 — solves an enormous family of “give me the latest / largest / first / one-of-each” problems in a single pass.
When to use it#
Reach for this pattern whenever a table is logically “one row per X” but physically holds multiple rows per X and you need to collapse to the canonical one. Concretely:
- Slowly changing dimensions. Customer addresses, employee titles, product prices — history tables keep every version; queries usually want the current one.
- Idempotent ingestion. Event pipelines that retry can produce duplicate event rows. Dedup on
(event_id)keeping the row with the earliestreceived_at. - Snapshot-style reporting. Sessions, sign-ups, orders — “latest per user” / “first per user” feeds dashboards and cohort analyses.
- Cleaning legacy data. A table without a
UNIQUEconstraint may have accumulated duplicates over years; this pattern is the first half of the cleanup (the second half isDELETE WHERE rn > 1).
Reach for something else when:
- The duplicates are exact across all columns and you just want
DISTINCT. - You want all rows tied for first (multiple “latest” rows on the same timestamp) — use
RANK() = 1instead ofROW_NUMBER() = 1. - You want aggregates per group rather than a representative row —
GROUP BYplusMAX()/MIN()is simpler.
How it works#
The window function partitions the input by the grouping column, sorts each partition by the tiebreak, and assigns a per-partition sequence number. Filtering on that number picks one row per group.
-- Sessions table: many rows per user, want the latest one per user.SELECT user_id, session_id, started_at, deviceFROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY started_at DESC, session_id DESC ) AS rn FROM sessions) rankedWHERE rn = 1;The ORDER BY started_at DESC, session_id DESC carries the tiebreak intent explicitly — if two sessions for the same user start at the same instant, the one with the larger session_id wins. Without that second column the choice is undefined and may flip between runs of the same query.
The execution plan is one sort per partition. On indexed columns the engine can often stream the partition in order — Postgres’s WindowAgg over an index scan is a typical fast plan. Without a covering index the planner sorts the whole table once, which is acceptable up to tens of millions of rows.
Variants#
Postgres DISTINCT ON — cleaner syntax for the exact same job:
SELECT DISTINCT ON (user_id) user_id, session_id, started_at, deviceFROM sessionsORDER BY user_id, started_at DESC, session_id DESC;DISTINCT ON (cols) keeps the first row per cols group in the order given by the outer ORDER BY. The first column(s) of ORDER BY must match the DISTINCT ON list — that’s the only ergonomic trap. Postgres-only; not standard SQL.
MIN() / MAX() group-by trick — works when you only need the tiebreak column itself:
SELECT user_id, MAX(started_at) AS latest_startFROM sessionsGROUP BY user_id;If you also need the other columns of the latest row, you have to join back:
SELECT s.*FROM sessions sJOIN ( SELECT user_id, MAX(started_at) AS latest_start FROM sessions GROUP BY user_id) m ON m.user_id = s.user_id AND m.latest_start = s.started_at;That works but reintroduces duplicates if two sessions tie on started_at. The window-function form handles ties cleanly with the secondary sort key.
Correlated NOT EXISTS — the pre-window-function classic, still useful when the engine lacks windows (very old MySQL):
SELECT s.*FROM sessions sWHERE NOT EXISTS ( SELECT 1 FROM sessions s2 WHERE s2.user_id = s.user_id AND (s2.started_at > s.started_at OR (s2.started_at = s.started_at AND s2.session_id > s.session_id)));Reads as “keep s only if no row beats it on the tiebreak.” Plan-wise this is a semi-anti-join — often slower than the window variant on large tables but sometimes faster on small ones with good indexes.
RANK() vs ROW_NUMBER() — pick deliberately. ROW_NUMBER() always returns exactly one 1 per partition; ties broken by an arbitrary order if the ORDER BY doesn’t fully disambiguate. RANK() returns multiple 1s when rows tie on the order key — useful when you want “everyone tied for the latest start,” not “one of the people tied.”
Common pitfalls#
- Non-deterministic kept row.
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY started_at DESC)without a unique tiebreak column means the engine picks arbitrarily among ties. Two runs of the same query can return different rows. Always add a unique secondary sort column (session_id,id) when ties are possible. - Mistaking
ROW_NUMBER()forRANK(). If two rows tie for “first” and you want both,ROW_NUMBER() = 1silently drops one. UseRANK() = 1orDENSE_RANK() = 1. - Filtering inside the window instead of outside. You cannot write
WHERE ROW_NUMBER() OVER (...) = 1directly — window functions evaluate afterWHERE. Wrap in a CTE or subquery. DISTINCT ONordering mismatch. Forgetting that the leadingORDER BYcolumns must match theDISTINCT ONlist produces a syntax error in Postgres; or worse, an unintuitive result on older versions.- Group-by trick losing the rest of the row.
GROUP BY user_id, MAX(started_at)cannot also projectsession_id— the engine doesn’t know whichsession_idbelongs to the max. Either join back or use the window form. - Performance on huge partitions. A partition with millions of rows forces a large sort. If you only need the top row, an index on
(user_id, started_at DESC)lets Postgres use an index-only scan forDISTINCT ONand skips the sort entirely. - NULLs in the partition column. Rows with
user_id IS NULLall fall into a single partition together. If that’s not what you want, filterWHERE user_id IS NOT NULLfirst orCOALESCEto a sentinel. - NULLs in the order column.
ORDER BY started_at DESCputs NULLs first on Postgres, last on MySQL. Make it explicit:ORDER BY started_at DESC NULLS LAST.
Practice problem#
You have a sessions table with columns (session_id, user_id, started_at, device, ip). A user can have many sessions; rows are append-only. Return one row per user representing their most recent session — include session_id, started_at, and device. Ties on started_at should be broken by the larger session_id.
Solution
WITH ranked AS ( SELECT session_id, user_id, started_at, device, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY started_at DESC, session_id DESC ) AS rn FROM sessions)SELECT user_id, session_id, started_at, deviceFROM rankedWHERE rn = 1;Partition by user_id, sort each partition by started_at DESC with session_id DESC as the explicit tiebreak, keep the row numbered 1. On Postgres, SELECT DISTINCT ON (user_id) ... ORDER BY user_id, started_at DESC, session_id DESC is the equivalent one-liner.
Related patterns#
- Rank and Row — the broader family of “number the rows within a group” problems.
- Existence Check — when “does at least one row match” is the question rather than “which row to keep.”
- Group Bucket — when you want aggregates per group, not a representative row.
- Window Functions — the underlying mechanism.
- Subqueries and CTEs — the CTE wrapper used in the canonical form.