Set Compare

INTERSECT / EXCEPT / UNION and the equivalent JOIN / NOT EXISTS forms. When set algebra beats joins.

Pattern Intermediate
7 min read
sql sql-pattern set-operations union intersect except

What it is#

Set comparison is the family of queries that combine two result sets with set algebra: union, intersection, difference. SQL exposes three operators — UNION, INTERSECT, EXCEPT (called MINUS in Oracle) — and each has an ALL variant that skips duplicate elimination.

-- All users who appear in either table
SELECT user_id FROM active_users
UNION
SELECT user_id FROM paid_users;
-- Users in both tables
SELECT user_id FROM active_users
INTERSECT
SELECT user_id FROM paid_users;
-- Users active but not paid
SELECT user_id FROM active_users
EXCEPT
SELECT user_id FROM paid_users;

The two inputs must have the same number of columns and compatible types. Set operators compare whole rows, not just keys — INTERSECT keeps a row only if every column matches.

Almost every set-algebra query can be rewritten as a JOIN or a subquery, and vice versa. The choice matters because the resulting plans, NULL handling, and duplicate semantics differ.

When to use it#

Reach for set operators when the natural shape of the question is “elements in A versus elements in B”:

  • Combining heterogeneous sources. Audit log entries from three tables, all with (actor_id, ts, action)UNION ALL is the natural concatenation.
  • Cross-checking membership. “Users who are in both the trial cohort and the converted cohort” — INTERSECT.
  • Finding the difference. “Users who logged in yesterday but not today” — EXCEPT.
  • Schema-equivalence checks. Validating that two pipelines produce identical results — (A EXCEPT B) UNION ALL (B EXCEPT A) should be empty.

Reach for JOIN or NOT EXISTS when:

  • You need columns from both sides, not just the shared key.
  • The grouping or filter logic is asymmetric (only one side needs filtering).
  • The optimizer produces a better plan for a join — set operations typically sort or hash both sides; a join might use an index lookup.

How it works#

Each operator has well-defined semantics on multisets:

OperatorBehaviourDedup?
UNIONRows in A or BYes
UNION ALLRows in A or BNo
INTERSECTRows in bothYes
INTERSECT ALLRows in both (multiplicity = min)No
EXCEPTRows in A not in BYes
EXCEPT ALLRows in A not in B (multiplicity = a − b)No

INTERSECT ALL and EXCEPT ALL are Postgres-specific (and added recently to a few other engines). MySQL added INTERSECT and EXCEPT in 8.0.31; before that you had to fake them.

-- Users active but not paid — EXCEPT form
SELECT user_id FROM active_users
EXCEPT
SELECT user_id FROM paid_users;
-- Same thing — JOIN form (anti-join)
SELECT a.user_id
FROM active_users a
LEFT JOIN paid_users p ON p.user_id = a.user_id
WHERE p.user_id IS NULL;
-- Same thing — NOT EXISTS form
SELECT a.user_id
FROM active_users a
WHERE NOT EXISTS (
SELECT 1 FROM paid_users p WHERE p.user_id = a.user_id
);

All three return the same rows when active_users.user_id is unique. They differ on:

  • Duplicates. EXCEPT deduplicates the left side too; the JOIN and NOT EXISTS forms preserve duplicates.
  • NULLs. EXCEPT treats two NULLs as equal (set semantics). The JOIN equality p.user_id = a.user_id is unknown for NULL, so the join handles NULL-vs-NULL differently — NOT EXISTS does too.
  • Plans. Postgres typically does a hash-based set difference for EXCEPT (sort or hash both sides). The anti-join form can use a merge anti-join with an index, sometimes faster on indexed columns.
Set operators — declarative, symmetric, NULL-as-equal semantics. Best for “are these two result sets the same?” and one-shot reporting queries. Plans tend to materialize and sort/hash both sides.
JOIN / NOT EXISTS — better when you need columns from both sides, when indexes are available on the join key, or when the question is naturally one-sided (“active users without a payment”). NULLs follow three-valued logic.

Variants#

UNION ALL vs UNION. Almost always prefer UNION ALL unless deduplication is the intent. UNION forces a sort or hash to dedup, which is expensive on large inputs and pointless if you already know the inputs are disjoint. The query planner cannot tell — it dedups every time you write UNION.

-- Wrong: dedup pass on millions of rows for no reason.
SELECT order_id, ts FROM orders_2024
UNION
SELECT order_id, ts FROM orders_2025;
-- Right: the two partitions are disjoint by year.
SELECT order_id, ts FROM orders_2024
UNION ALL
SELECT order_id, ts FROM orders_2025;

Anti-join (EXCEPT / LEFT JOIN ... NULL / NOT EXISTS). All three express “in A, not in B.” Performance ordering on Postgres roughly: NOT EXISTS ≈ anti-join LEFT JOIN/IS NULLEXCEPT. The anti-join can use a hash anti-join or merge anti-join with indexes; EXCEPT always sorts/hashes both inputs.

Semi-join (INTERSECT / INNER JOIN / EXISTS). Same three-way choice for intersection. EXISTS short-circuits as soon as it finds one match, which is unbeatable when the right side has many rows per key.

NOT IN vs NOT EXISTS. WHERE x NOT IN (SELECT y FROM t) becomes wrong if t.y contains any NULL — the comparison returns UNKNOWN and the row is dropped. NOT EXISTS doesn’t have this bug. Prefer NOT EXISTS.

Set ops on different column counts. Most engines error out; some (SQLite historically) right-pad with NULLs. Be explicit.

ORDER BY placement. Goes at the end of the whole expression, not inside the branches. Each branch’s own ORDER BY is illegal except inside a parenthesised subquery.

(SELECT user_id, signed_up FROM active_users ORDER BY signed_up DESC LIMIT 100)
UNION ALL
(SELECT user_id, signed_up FROM paid_users ORDER BY signed_up DESC LIMIT 100)
ORDER BY signed_up DESC;

Common pitfalls#

  • UNION when UNION ALL was meant. Silent performance bug — the dedup pass costs O(n log n) for no functional reason.
  • NOT IN with NULLs. A single NULL in the right side makes the whole result empty. Switch to NOT EXISTS reflexively.
  • Different column counts. Set operators require matching arity. Refactor branches to project the same shape, padding with literal NULLs if needed: SELECT id, NULL::text AS reason FROM t1 UNION ALL SELECT id, reason FROM t2.
  • Type coercion across branches. SELECT 1 UNION SELECT '1' may work via implicit cast or fail depending on engine. Cast explicitly when types differ.
  • Row equality vs key equality. INTERSECT between (id, name) lists requires both columns to match. If you only want shared IDs, project just the ID column in each branch.
  • EXCEPT deduplicates the left side. If active_users.user_id had duplicates and you wanted to preserve them in the result, EXCEPT collapses them. Use the anti-join or EXCEPT ALL (where supported).
  • Engine version gaps. MySQL only got INTERSECT and EXCEPT in 8.0.31 (late 2022). Older deployments need the JOIN form. SQLite has INTERSECT and EXCEPT but no ALL variants.
  • Ordering within branches. Each branch is unordered until the final ORDER BY. Don’t rely on UNION ALL preserving the input order of either side — most engines do, but it’s not guaranteed.
  • NULL equality. Set operators treat NULL = NULL as true; joins on = treat it as unknown. A query that “works” with EXCEPT may behave differently when rewritten as a join.

Practice problem#

You have two tables: active_users(user_id, last_login_at) for users who logged in in the last 30 days, and paid_users(user_id, plan, started_at) for users with an active paid subscription. Both tables have user_id as a unique key. Write two queries that find users who are active but not paying: once using set algebra, once using a JOIN. Then explain when you’d prefer each.

Solution
-- Set algebra
SELECT user_id FROM active_users
EXCEPT
SELECT user_id FROM paid_users;
-- Anti-join with LEFT JOIN
SELECT a.user_id, a.last_login_at
FROM active_users a
LEFT JOIN paid_users p ON p.user_id = a.user_id
WHERE p.user_id IS NULL;

Use the EXCEPT form when you only need the user_id and want compact, symmetric code — both tables play equal roles. Use the JOIN form when you also need columns from active_users (here, last_login_at) without rejoining, or when an index on paid_users(user_id) makes a hash/merge anti-join faster than the sort-based set-difference plan. NOT EXISTS is the third form and often produces the best plan on indexed columns.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.