Join Variants

Self-join, anti-join, semi-join, lateral join, theta-join. Beyond the standard four kinds of join.

Pattern Intermediate
8 min read
sql sql-pattern joins lateral anti-join semi-join

What it is#

Past the standard four joins (INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER), there’s a second tier of join shapes that map to specific question types: self-join, anti-join, semi-join, lateral join, theta-join. Some have dedicated syntax; most are expressed as combinations of standard joins with predicates.

-- Self-join: employees with their managers
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.id = e.manager_id;
-- Anti-join: users with no orders
SELECT u.id, u.email
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.user_id IS NULL;
-- Semi-join: users who have at least one order
SELECT u.id, u.email
FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
-- Lateral join: three most recent orders per user
SELECT u.id, recent.order_id, recent.created_at
FROM users u,
LATERAL (
SELECT order_id, created_at
FROM orders o
WHERE o.user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) recent;
-- Theta-join: pairs of overlapping reservations
SELECT a.id, b.id
FROM reservations a
JOIN reservations b
ON a.id < b.id
AND a.room_id = b.room_id
AND a.starts_at < b.ends_at
AND b.starts_at < a.ends_at;

These shapes show up in every SQL interview and most real production queries. Recognising which one fits the question is half the work.

When to use it#

  • Self-join. Same table joined to itself with different aliases — hierarchies (manager-employee), pairwise comparisons (find duplicates), graph edges, time-windowed comparisons.
  • Anti-join. “Rows in A with no match in B” — users without orders, products without reviews, sessions without conversions.
  • Semi-join. “Rows in A with at least one match in B” — users who have ordered, files that have been downloaded. Distinct from INNER JOIN because it returns one row per A regardless of how many matches in B.
  • Lateral join. Per-row subqueries — top-N-per-group, expanding JSON arrays, calling table-valued functions per outer row.
  • Theta-join. Non-equality predicates — overlapping intervals, range matches (price between min and max), nearest-neighbour pairs.

Reach for window functions instead of self-joins when the question is “compare each row to other rows in the same group” — LAG() / LEAD() / ROW_NUMBER() are usually faster and clearer.

How it works#

Self-join. A table joined to itself produces a Cartesian-ish result restricted by the join predicate. Use a distinct alias for each occurrence; the engine treats each alias as a separate logical relation.

-- Find pairs of duplicate emails: same email, different user_id
SELECT a.user_id AS user_a, b.user_id AS user_b, a.email
FROM users a
JOIN users b ON b.email = a.email
AND b.user_id > a.user_id;

The b.user_id > a.user_id predicate avoids reporting each pair twice and skips self-pairs.

Anti-join. Two canonical forms:

-- LEFT JOIN ... IS NULL form
SELECT u.id
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.user_id IS NULL;
-- NOT EXISTS form (usually preferred)
SELECT u.id
FROM users u
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

Postgres recognises both shapes and plans them as a “hash anti-join” or “merge anti-join” — semantically equivalent but often produces slightly different plans on indexed columns. NOT EXISTS short-circuits per row and is correct even when orders.user_id can be NULL; NOT IN is not (see Set Compare).

Semi-join. EXISTS is the canonical form:

SELECT u.id
FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

INNER JOIN orders would also return matching users, but it returns one row per matching order and would need DISTINCT or GROUP BY to collapse. EXISTS short-circuits and returns one row per user.

Lateral join. A LATERAL subquery on the right side of a JOIN can reference columns from the left side. Without LATERAL, the subquery is evaluated independently and can’t see outer columns.

SELECT u.id AS user_id, recent.order_id, recent.created_at
FROM users u
CROSS JOIN LATERAL (
SELECT order_id, created_at
FROM orders o
WHERE o.user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) recent;

For each u, the subquery runs with u.id substituted. The result is at most three rows per user — the canonical top-N-per-group shape. SQL Server calls this CROSS APPLY (and OUTER APPLY for the left-join variant). Postgres has had LATERAL since 9.3. MySQL added LATERAL in 8.0.14. Oracle has CROSS APPLY and LATERAL. SQLite has no lateral support.

Theta-join. Any join with a predicate that isn’t a simple equality on columns. The “theta” comes from relational algebra where the predicate is denoted θ.

-- Overlapping reservation intervals
SELECT a.id, b.id
FROM reservations a, reservations b
WHERE a.id < b.id
AND a.room_id = b.room_id
AND a.starts_at < b.ends_at
AND b.starts_at < a.ends_at;

Theta-joins are expensive without good indexes — the planner can’t reduce to a hash or merge join because the predicate isn’t equality. Range indexes (Postgres GIST on tstzrange) make these queries practical at scale.

Variants#

OUTER APPLY / left-lateral. Postgres: LEFT JOIN LATERAL (...) returns NULLs for outer rows with no inner match. SQL Server: OUTER APPLY. Useful when you want “give me the top-3 orders per user, including users with zero orders.”

SELECT u.id, recent.order_id
FROM users u
LEFT JOIN LATERAL (
SELECT order_id
FROM orders o
WHERE o.user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) recent ON true;

Self-anti-join. Anti-join against the same table — useful for “find rows that have no predecessor / successor.”

-- Events with no later event in the same session
SELECT e.session_id, e.event_id, e.ts
FROM events e
WHERE NOT EXISTS (
SELECT 1 FROM events e2
WHERE e2.session_id = e.session_id AND e2.ts > e.ts
);

Non-equi self-join for time-series gaps. “Find the next event for each event in the same session” — a classic theta + self pattern:

SELECT e.event_id, MIN(n.ts) AS next_ts
FROM events e
LEFT JOIN events n
ON n.session_id = e.session_id AND n.ts > e.ts
GROUP BY e.event_id;

A LAG() / LEAD() window does the same job faster on most engines.

Range-overlap joins. Postgres GIST indexes on range types make WHERE r1.range && r2.range efficient. Without them, an interval overlap join is O(n²).

JOIN LATERAL for unnesting. LATERAL + unnest(array_col) expands one row per array element while keeping outer-row columns visible:

SELECT u.id, tag
FROM users u
CROSS JOIN LATERAL unnest(u.tags) AS tag;

Common pitfalls#

  • Using INNER JOIN when EXISTS was intended. Joining users to orders and not deduplicating returns one row per order; the user-list-multiplied-by-order-count bug. If you only want users-with-orders, EXISTS.
  • NOT IN with NULLs. As covered in Set Compare: a single NULL in the subquery makes the whole NOT IN return zero rows. Use NOT EXISTS reflexively for anti-joins.
  • Forgetting LATERAL. Without it, the right-side subquery cannot reference outer columns; the engine errors out with “column does not exist.” Easy mistake when porting from SQL Server’s CROSS APPLY (where LATERAL is implicit).
  • MySQL doesn’t have lateral pre-8.0.14. A query that works on Postgres might fail on older MySQL — rewrite as a correlated subquery in the SELECT list or use a window function.
  • Theta-joins without indexes. A self-join on WHERE a.x < b.x against a million-row table is a trillion-pair scan. Range indexes or a WHERE predicate that bounds one side dramatically.
  • Self-join double-counting. JOIN users b ON b.email = a.email returns each pair twice plus self-pairs. The b.id > a.id predicate is the standard fix.
  • LATERAL performance trap. Lateral subqueries run per outer row. With a million users and a slow subquery, that’s a million slow executions. Make sure the inner query is index-supported.
  • Outer apply gotcha. A LEFT JOIN LATERAL (...) ON true always succeeds (one outer row, zero or more inner). If the inner query returns zero rows, you get one outer row with NULL columns. If you wrote LEFT JOIN LATERAL (...) ON false, you always get the NULL row regardless of inner content — usually a bug.

Practice problem#

You have an orders(order_id, user_id, total_cents, created_at) table. For each user, return their three most recent orders (order_id, total_cents, created_at). Users with fewer than three orders should still appear with however many they have; users with zero orders can be omitted. Aim for a query that uses a single index scan per user rather than reading the whole orders table.

Solution

Lateral join with an index on orders(user_id, created_at DESC):

SELECT u.id AS user_id, recent.order_id, recent.total_cents, recent.created_at
FROM users u
CROSS JOIN LATERAL (
SELECT order_id, total_cents, created_at
FROM orders o
WHERE o.user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) recent;

Each outer user triggers an index range scan on (user_id, created_at DESC) that stops after three rows — three index reads per user, not a full table scan. The window-function alternative (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) filtered to rn <= 3) is also correct but reads every row of orders because the window must see them all before numbering — slower for sparse selections.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.