Deduplication — DBMS · Engineering Playbook

What it is#

Deduplication is the “keep exactly one row per group” template — when a table holds many rows that share a logical key and you want one representative row per key. Common shapes: keep the latest session per user, the highest-priced order per customer, the most recent address per account, the first event per session. The grouping key is rarely the primary key (otherwise duplicates wouldn’t exist); it’s some business identifier that the data model permits to repeat.

The canonical form uses a window function:

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY started_at DESC) AS rn
  FROM sessions
)
SELECT *
FROM ranked
WHERE rn = 1;

The window numbers rows inside each user_id partition by started_at DESC, and the outer filter keeps only the top row per partition. That structure — partition by group, order by tiebreak, keep rn = 1 — solves an enormous family of “give me the latest / largest / first / one-of-each” problems in a single pass.

When to use it#

Reach for this pattern whenever a table is logically “one row per X” but physically holds multiple rows per X and you need to collapse to the canonical one. Concretely:

Slowly changing dimensions. Customer addresses, employee titles, product prices — history tables keep every version; queries usually want the current one.
Idempotent ingestion. Event pipelines that retry can produce duplicate event rows. Dedup on (event_id) keeping the row with the earliest received_at.
Snapshot-style reporting. Sessions, sign-ups, orders — “latest per user” / “first per user” feeds dashboards and cohort analyses.
Cleaning legacy data. A table without a UNIQUE constraint may have accumulated duplicates over years; this pattern is the first half of the cleanup (the second half is DELETE WHERE rn > 1).

Reach for something else when:

The duplicates are exact across all columns and you just want DISTINCT.
You want all rows tied for first (multiple “latest” rows on the same timestamp) — use RANK() = 1 instead of ROW_NUMBER() = 1.
You want aggregates per group rather than a representative row — GROUP BY plus MAX() / MIN() is simpler.

How it works#

The window function partitions the input by the grouping column, sorts each partition by the tiebreak, and assigns a per-partition sequence number. Filtering on that number picks one row per group.

-- Sessions table: many rows per user, want the latest one per user.
SELECT user_id, session_id, started_at, device
FROM (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY user_id
           ORDER BY started_at DESC, session_id DESC
         ) AS rn
  FROM sessions
) ranked
WHERE rn = 1;

The ORDER BY started_at DESC, session_id DESC carries the tiebreak intent explicitly — if two sessions for the same user start at the same instant, the one with the larger session_id wins. Without that second column the choice is undefined and may flip between runs of the same query.

The execution plan is one sort per partition. On indexed columns the engine can often stream the partition in order — Postgres’s WindowAgg over an index scan is a typical fast plan. Without a covering index the planner sorts the whole table once, which is acceptable up to tens of millions of rows.

Variants#

Postgres DISTINCT ON — cleaner syntax for the exact same job:

SELECT DISTINCT ON (user_id)
       user_id, session_id, started_at, device
FROM sessions
ORDER BY user_id, started_at DESC, session_id DESC;

DISTINCT ON (cols) keeps the first row per cols group in the order given by the outer ORDER BY. The first column(s) of ORDER BY must match the DISTINCT ON list — that’s the only ergonomic trap. Postgres-only; not standard SQL.

MIN() / MAX() group-by trick — works when you only need the tiebreak column itself:

SELECT user_id, MAX(started_at) AS latest_start
FROM sessions
GROUP BY user_id;

If you also need the other columns of the latest row, you have to join back:

SELECT s.*
FROM sessions s
JOIN (
  SELECT user_id, MAX(started_at) AS latest_start
  FROM sessions
  GROUP BY user_id
) m ON m.user_id = s.user_id AND m.latest_start = s.started_at;

That works but reintroduces duplicates if two sessions tie on started_at. The window-function form handles ties cleanly with the secondary sort key.

Correlated NOT EXISTS — the pre-window-function classic, still useful when the engine lacks windows (very old MySQL):

SELECT s.*
FROM sessions s
WHERE NOT EXISTS (
  SELECT 1
  FROM sessions s2
  WHERE s2.user_id = s.user_id
    AND (s2.started_at > s.started_at
      OR (s2.started_at = s.started_at AND s2.session_id > s.session_id))
);

Reads as “keep s only if no row beats it on the tiebreak.” Plan-wise this is a semi-anti-join — often slower than the window variant on large tables but sometimes faster on small ones with good indexes.

RANK() vs ROW_NUMBER() — pick deliberately. ROW_NUMBER() always returns exactly one 1 per partition; ties broken by an arbitrary order if the ORDER BY doesn’t fully disambiguate. RANK() returns multiple 1s when rows tie on the order key — useful when you want “everyone tied for the latest start,” not “one of the people tied.”

Common pitfalls#

Non-deterministic kept row. ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY started_at DESC) without a unique tiebreak column means the engine picks arbitrarily among ties. Two runs of the same query can return different rows. Always add a unique secondary sort column (session_id, id) when ties are possible.
Mistaking ROW_NUMBER() for RANK(). If two rows tie for “first” and you want both, ROW_NUMBER() = 1 silently drops one. Use RANK() = 1 or DENSE_RANK() = 1.
Filtering inside the window instead of outside. You cannot write WHERE ROW_NUMBER() OVER (...) = 1 directly — window functions evaluate after WHERE. Wrap in a CTE or subquery.
DISTINCT ON ordering mismatch. Forgetting that the leading ORDER BY columns must match the DISTINCT ON list produces a syntax error in Postgres; or worse, an unintuitive result on older versions.
Group-by trick losing the rest of the row. GROUP BY user_id, MAX(started_at) cannot also project session_id — the engine doesn’t know which session_id belongs to the max. Either join back or use the window form.
Performance on huge partitions. A partition with millions of rows forces a large sort. If you only need the top row, an index on (user_id, started_at DESC) lets Postgres use an index-only scan for DISTINCT ON and skips the sort entirely.
NULLs in the partition column. Rows with user_id IS NULL all fall into a single partition together. If that’s not what you want, filter WHERE user_id IS NOT NULL first or COALESCE to a sentinel.
NULLs in the order column. ORDER BY started_at DESC puts NULLs first on Postgres, last on MySQL. Make it explicit: ORDER BY started_at DESC NULLS LAST.

Practice problem#

You have a sessions table with columns (session_id, user_id, started_at, device, ip). A user can have many sessions; rows are append-only. Return one row per user representing their most recent session — include session_id, started_at, and device. Ties on started_at should be broken by the larger session_id.

Solution

WITH ranked AS (
  SELECT session_id, user_id, started_at, device,
         ROW_NUMBER() OVER (
           PARTITION BY user_id
           ORDER BY started_at DESC, session_id DESC
         ) AS rn
  FROM sessions
)
SELECT user_id, session_id, started_at, device
FROM ranked
WHERE rn = 1;

Partition by user_id, sort each partition by started_at DESC with session_id DESC as the explicit tiebreak, keep the row numbered 1. On Postgres, SELECT DISTINCT ON (user_id) ... ORDER BY user_id, started_at DESC, session_id DESC is the equivalent one-liner.

Rank and Row — the broader family of “number the rows within a group” problems.
Existence Check — when “does at least one row match” is the question rather than “which row to keep.”
Group Bucket — when you want aggregates per group, not a representative row.
Window Functions — the underlying mechanism.
Subqueries and CTEs — the CTE wrapper used in the canonical form.