Set Compare
INTERSECT / EXCEPT / UNION and the equivalent JOIN / NOT EXISTS forms. When set algebra beats joins.
What it is#
Set comparison is the family of queries that combine two result sets with set algebra: union, intersection, difference. SQL exposes three operators — UNION, INTERSECT, EXCEPT (called MINUS in Oracle) — and each has an ALL variant that skips duplicate elimination.
-- All users who appear in either tableSELECT user_id FROM active_usersUNIONSELECT user_id FROM paid_users;
-- Users in both tablesSELECT user_id FROM active_usersINTERSECTSELECT user_id FROM paid_users;
-- Users active but not paidSELECT user_id FROM active_usersEXCEPTSELECT user_id FROM paid_users;The two inputs must have the same number of columns and compatible types. Set operators compare whole rows, not just keys — INTERSECT keeps a row only if every column matches.
Almost every set-algebra query can be rewritten as a JOIN or a subquery, and vice versa. The choice matters because the resulting plans, NULL handling, and duplicate semantics differ.
When to use it#
Reach for set operators when the natural shape of the question is “elements in A versus elements in B”:
- Combining heterogeneous sources. Audit log entries from three tables, all with
(actor_id, ts, action)—UNION ALLis the natural concatenation. - Cross-checking membership. “Users who are in both the trial cohort and the converted cohort” —
INTERSECT. - Finding the difference. “Users who logged in yesterday but not today” —
EXCEPT. - Schema-equivalence checks. Validating that two pipelines produce identical results —
(A EXCEPT B) UNION ALL (B EXCEPT A)should be empty.
Reach for JOIN or NOT EXISTS when:
- You need columns from both sides, not just the shared key.
- The grouping or filter logic is asymmetric (only one side needs filtering).
- The optimizer produces a better plan for a join — set operations typically sort or hash both sides; a join might use an index lookup.
How it works#
Each operator has well-defined semantics on multisets:
| Operator | Behaviour | Dedup? |
|---|---|---|
UNION | Rows in A or B | Yes |
UNION ALL | Rows in A or B | No |
INTERSECT | Rows in both | Yes |
INTERSECT ALL | Rows in both (multiplicity = min) | No |
EXCEPT | Rows in A not in B | Yes |
EXCEPT ALL | Rows in A not in B (multiplicity = a − b) | No |
INTERSECT ALL and EXCEPT ALL are Postgres-specific (and added recently to a few other engines). MySQL added INTERSECT and EXCEPT in 8.0.31; before that you had to fake them.
-- Users active but not paid — EXCEPT formSELECT user_id FROM active_usersEXCEPTSELECT user_id FROM paid_users;
-- Same thing — JOIN form (anti-join)SELECT a.user_idFROM active_users aLEFT JOIN paid_users p ON p.user_id = a.user_idWHERE p.user_id IS NULL;
-- Same thing — NOT EXISTS formSELECT a.user_idFROM active_users aWHERE NOT EXISTS ( SELECT 1 FROM paid_users p WHERE p.user_id = a.user_id);All three return the same rows when active_users.user_id is unique. They differ on:
- Duplicates.
EXCEPTdeduplicates the left side too; the JOIN andNOT EXISTSforms preserve duplicates. - NULLs.
EXCEPTtreats two NULLs as equal (set semantics). The JOIN equalityp.user_id = a.user_idis unknown for NULL, so the join handles NULL-vs-NULL differently —NOT EXISTSdoes too. - Plans. Postgres typically does a hash-based set difference for
EXCEPT(sort or hash both sides). The anti-join form can use a merge anti-join with an index, sometimes faster on indexed columns.
Variants#
UNION ALL vs UNION. Almost always prefer UNION ALL unless deduplication is the intent. UNION forces a sort or hash to dedup, which is expensive on large inputs and pointless if you already know the inputs are disjoint. The query planner cannot tell — it dedups every time you write UNION.
-- Wrong: dedup pass on millions of rows for no reason.SELECT order_id, ts FROM orders_2024UNIONSELECT order_id, ts FROM orders_2025;
-- Right: the two partitions are disjoint by year.SELECT order_id, ts FROM orders_2024UNION ALLSELECT order_id, ts FROM orders_2025;Anti-join (EXCEPT / LEFT JOIN ... NULL / NOT EXISTS). All three express “in A, not in B.” Performance ordering on Postgres roughly: NOT EXISTS ≈ anti-join LEFT JOIN/IS NULL ≥ EXCEPT. The anti-join can use a hash anti-join or merge anti-join with indexes; EXCEPT always sorts/hashes both inputs.
Semi-join (INTERSECT / INNER JOIN / EXISTS). Same three-way choice for intersection. EXISTS short-circuits as soon as it finds one match, which is unbeatable when the right side has many rows per key.
NOT IN vs NOT EXISTS. WHERE x NOT IN (SELECT y FROM t) becomes wrong if t.y contains any NULL — the comparison returns UNKNOWN and the row is dropped. NOT EXISTS doesn’t have this bug. Prefer NOT EXISTS.
Set ops on different column counts. Most engines error out; some (SQLite historically) right-pad with NULLs. Be explicit.
ORDER BY placement. Goes at the end of the whole expression, not inside the branches. Each branch’s own ORDER BY is illegal except inside a parenthesised subquery.
(SELECT user_id, signed_up FROM active_users ORDER BY signed_up DESC LIMIT 100)UNION ALL(SELECT user_id, signed_up FROM paid_users ORDER BY signed_up DESC LIMIT 100)ORDER BY signed_up DESC;Common pitfalls#
UNIONwhenUNION ALLwas meant. Silent performance bug — the dedup pass costs O(n log n) for no functional reason.NOT INwith NULLs. A single NULL in the right side makes the whole result empty. Switch toNOT EXISTSreflexively.- Different column counts. Set operators require matching arity. Refactor branches to project the same shape, padding with literal NULLs if needed:
SELECT id, NULL::text AS reason FROM t1 UNION ALL SELECT id, reason FROM t2. - Type coercion across branches.
SELECT 1 UNION SELECT '1'may work via implicit cast or fail depending on engine. Cast explicitly when types differ. - Row equality vs key equality.
INTERSECTbetween(id, name)lists requires both columns to match. If you only want shared IDs, project just the ID column in each branch. EXCEPTdeduplicates the left side. Ifactive_users.user_idhad duplicates and you wanted to preserve them in the result,EXCEPTcollapses them. Use the anti-join orEXCEPT ALL(where supported).- Engine version gaps. MySQL only got
INTERSECTandEXCEPTin 8.0.31 (late 2022). Older deployments need the JOIN form. SQLite hasINTERSECTandEXCEPTbut noALLvariants. - Ordering within branches. Each branch is unordered until the final
ORDER BY. Don’t rely onUNION ALLpreserving the input order of either side — most engines do, but it’s not guaranteed. - NULL equality. Set operators treat NULL = NULL as true; joins on
=treat it as unknown. A query that “works” withEXCEPTmay behave differently when rewritten as a join.
Practice problem#
You have two tables: active_users(user_id, last_login_at) for users who logged in in the last 30 days, and paid_users(user_id, plan, started_at) for users with an active paid subscription. Both tables have user_id as a unique key. Write two queries that find users who are active but not paying: once using set algebra, once using a JOIN. Then explain when you’d prefer each.
Solution
-- Set algebraSELECT user_id FROM active_usersEXCEPTSELECT user_id FROM paid_users;
-- Anti-join with LEFT JOINSELECT a.user_id, a.last_login_atFROM active_users aLEFT JOIN paid_users p ON p.user_id = a.user_idWHERE p.user_id IS NULL;Use the EXCEPT form when you only need the user_id and want compact, symmetric code — both tables play equal roles. Use the JOIN form when you also need columns from active_users (here, last_login_at) without rejoining, or when an index on paid_users(user_id) makes a hash/merge anti-join faster than the sort-based set-difference plan. NOT EXISTS is the third form and often produces the best plan on indexed columns.
Related patterns#
- Existence Check —
EXISTS/NOT EXISTSagainst a single side. - Join Variants — anti-joins and semi-joins as join-style equivalents of
EXCEPTandINTERSECT. - Relational Algebra Basics — the algebraic foundation for
UNION,INTERSECT,EXCEPT. - Joins (Inner, Outer, Cross) — the JOIN equivalents.
- Subqueries and CTEs —
NOT EXISTSis a correlated subquery.