Join Variants
Self-join, anti-join, semi-join, lateral join, theta-join. Beyond the standard four kinds of join.
What it is#
Past the standard four joins (INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER), there’s a second tier of join shapes that map to specific question types: self-join, anti-join, semi-join, lateral join, theta-join. Some have dedicated syntax; most are expressed as combinations of standard joins with predicates.
-- Self-join: employees with their managersSELECT e.name AS employee, m.name AS managerFROM employees eLEFT JOIN employees m ON m.id = e.manager_id;
-- Anti-join: users with no ordersSELECT u.id, u.emailFROM users uLEFT JOIN orders o ON o.user_id = u.idWHERE o.user_id IS NULL;
-- Semi-join: users who have at least one orderSELECT u.id, u.emailFROM users uWHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
-- Lateral join: three most recent orders per userSELECT u.id, recent.order_id, recent.created_atFROM users u,LATERAL ( SELECT order_id, created_at FROM orders o WHERE o.user_id = u.id ORDER BY created_at DESC LIMIT 3) recent;
-- Theta-join: pairs of overlapping reservationsSELECT a.id, b.idFROM reservations aJOIN reservations b ON a.id < b.id AND a.room_id = b.room_id AND a.starts_at < b.ends_at AND b.starts_at < a.ends_at;These shapes show up in every SQL interview and most real production queries. Recognising which one fits the question is half the work.
When to use it#
- Self-join. Same table joined to itself with different aliases — hierarchies (manager-employee), pairwise comparisons (find duplicates), graph edges, time-windowed comparisons.
- Anti-join. “Rows in A with no match in B” — users without orders, products without reviews, sessions without conversions.
- Semi-join. “Rows in A with at least one match in B” — users who have ordered, files that have been downloaded. Distinct from
INNER JOINbecause it returns one row per A regardless of how many matches in B. - Lateral join. Per-row subqueries — top-N-per-group, expanding JSON arrays, calling table-valued functions per outer row.
- Theta-join. Non-equality predicates — overlapping intervals, range matches (price between min and max), nearest-neighbour pairs.
Reach for window functions instead of self-joins when the question is “compare each row to other rows in the same group” — LAG() / LEAD() / ROW_NUMBER() are usually faster and clearer.
How it works#
Self-join. A table joined to itself produces a Cartesian-ish result restricted by the join predicate. Use a distinct alias for each occurrence; the engine treats each alias as a separate logical relation.
-- Find pairs of duplicate emails: same email, different user_idSELECT a.user_id AS user_a, b.user_id AS user_b, a.emailFROM users aJOIN users b ON b.email = a.email AND b.user_id > a.user_id;The b.user_id > a.user_id predicate avoids reporting each pair twice and skips self-pairs.
Anti-join. Two canonical forms:
-- LEFT JOIN ... IS NULL formSELECT u.idFROM users uLEFT JOIN orders o ON o.user_id = u.idWHERE o.user_id IS NULL;
-- NOT EXISTS form (usually preferred)SELECT u.idFROM users uWHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);Postgres recognises both shapes and plans them as a “hash anti-join” or “merge anti-join” — semantically equivalent but often produces slightly different plans on indexed columns. NOT EXISTS short-circuits per row and is correct even when orders.user_id can be NULL; NOT IN is not (see Set Compare).
Semi-join. EXISTS is the canonical form:
SELECT u.idFROM users uWHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);INNER JOIN orders would also return matching users, but it returns one row per matching order and would need DISTINCT or GROUP BY to collapse. EXISTS short-circuits and returns one row per user.
Lateral join. A LATERAL subquery on the right side of a JOIN can reference columns from the left side. Without LATERAL, the subquery is evaluated independently and can’t see outer columns.
SELECT u.id AS user_id, recent.order_id, recent.created_atFROM users uCROSS JOIN LATERAL ( SELECT order_id, created_at FROM orders o WHERE o.user_id = u.id ORDER BY created_at DESC LIMIT 3) recent;For each u, the subquery runs with u.id substituted. The result is at most three rows per user — the canonical top-N-per-group shape. SQL Server calls this CROSS APPLY (and OUTER APPLY for the left-join variant). Postgres has had LATERAL since 9.3. MySQL added LATERAL in 8.0.14. Oracle has CROSS APPLY and LATERAL. SQLite has no lateral support.
Theta-join. Any join with a predicate that isn’t a simple equality on columns. The “theta” comes from relational algebra where the predicate is denoted θ.
-- Overlapping reservation intervalsSELECT a.id, b.idFROM reservations a, reservations bWHERE a.id < b.id AND a.room_id = b.room_id AND a.starts_at < b.ends_at AND b.starts_at < a.ends_at;Theta-joins are expensive without good indexes — the planner can’t reduce to a hash or merge join because the predicate isn’t equality. Range indexes (Postgres GIST on tstzrange) make these queries practical at scale.
Variants#
OUTER APPLY / left-lateral. Postgres: LEFT JOIN LATERAL (...) returns NULLs for outer rows with no inner match. SQL Server: OUTER APPLY. Useful when you want “give me the top-3 orders per user, including users with zero orders.”
SELECT u.id, recent.order_idFROM users uLEFT JOIN LATERAL ( SELECT order_id FROM orders o WHERE o.user_id = u.id ORDER BY created_at DESC LIMIT 3) recent ON true;Self-anti-join. Anti-join against the same table — useful for “find rows that have no predecessor / successor.”
-- Events with no later event in the same sessionSELECT e.session_id, e.event_id, e.tsFROM events eWHERE NOT EXISTS ( SELECT 1 FROM events e2 WHERE e2.session_id = e.session_id AND e2.ts > e.ts);Non-equi self-join for time-series gaps. “Find the next event for each event in the same session” — a classic theta + self pattern:
SELECT e.event_id, MIN(n.ts) AS next_tsFROM events eLEFT JOIN events n ON n.session_id = e.session_id AND n.ts > e.tsGROUP BY e.event_id;A LAG() / LEAD() window does the same job faster on most engines.
Range-overlap joins. Postgres GIST indexes on range types make WHERE r1.range && r2.range efficient. Without them, an interval overlap join is O(n²).
JOIN LATERAL for unnesting. LATERAL + unnest(array_col) expands one row per array element while keeping outer-row columns visible:
SELECT u.id, tagFROM users uCROSS JOIN LATERAL unnest(u.tags) AS tag;Common pitfalls#
- Using
INNER JOINwhenEXISTSwas intended. Joininguserstoordersand not deduplicating returns one row per order; the user-list-multiplied-by-order-count bug. If you only want users-with-orders,EXISTS. NOT INwith NULLs. As covered in Set Compare: a single NULL in the subquery makes the wholeNOT INreturn zero rows. UseNOT EXISTSreflexively for anti-joins.- Forgetting
LATERAL. Without it, the right-side subquery cannot reference outer columns; the engine errors out with “column does not exist.” Easy mistake when porting from SQL Server’sCROSS APPLY(whereLATERALis implicit). - MySQL doesn’t have lateral pre-8.0.14. A query that works on Postgres might fail on older MySQL — rewrite as a correlated subquery in the
SELECTlist or use a window function. - Theta-joins without indexes. A self-join on
WHERE a.x < b.xagainst a million-row table is a trillion-pair scan. Range indexes or aWHEREpredicate that bounds one side dramatically. - Self-join double-counting.
JOIN users b ON b.email = a.emailreturns each pair twice plus self-pairs. Theb.id > a.idpredicate is the standard fix. LATERALperformance trap. Lateral subqueries run per outer row. With a million users and a slow subquery, that’s a million slow executions. Make sure the inner query is index-supported.- Outer apply gotcha. A
LEFT JOIN LATERAL (...) ON truealways succeeds (one outer row, zero or more inner). If the inner query returns zero rows, you get one outer row with NULL columns. If you wroteLEFT JOIN LATERAL (...) ON false, you always get the NULL row regardless of inner content — usually a bug.
Practice problem#
You have an orders(order_id, user_id, total_cents, created_at) table. For each user, return their three most recent orders (order_id, total_cents, created_at). Users with fewer than three orders should still appear with however many they have; users with zero orders can be omitted. Aim for a query that uses a single index scan per user rather than reading the whole orders table.
Solution
Lateral join with an index on orders(user_id, created_at DESC):
SELECT u.id AS user_id, recent.order_id, recent.total_cents, recent.created_atFROM users uCROSS JOIN LATERAL ( SELECT order_id, total_cents, created_at FROM orders o WHERE o.user_id = u.id ORDER BY created_at DESC LIMIT 3) recent;Each outer user triggers an index range scan on (user_id, created_at DESC) that stops after three rows — three index reads per user, not a full table scan. The window-function alternative (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) filtered to rn <= 3) is also correct but reads every row of orders because the window must see them all before numbering — slower for sparse selections.
Related patterns#
- Existence Check — the semi-join / anti-join shape as a stand-alone pattern.
- Set Compare —
INTERSECT/EXCEPTas set-algebra equivalents. - Joins (Inner, Outer, Cross) — the standard joins these variants extend.
- Subqueries and CTEs —
EXISTSis a correlated subquery; lateral joins are first-class subqueries inFROM. - Window Functions — the alternative for many self-join problems.