Subqueries and CTEs
Correlated vs non-correlated subqueries, the readability case for CTEs, recursive CTEs for trees.
What it is#
A subquery is a query nested inside another query. A CTE (Common Table Expression) is a named subquery introduced with WITH name AS (...) that you can reference once or many times in the outer query. Both are ways to compose SQL out of smaller pieces; both compile to the same underlying relational algebra; both can usually be rewritten as joins, but readability and reuse make them distinct tools.
The most common forms:
-- non-correlated subquerySELECT * FROM orders WHERE customer_id IN (SELECT id FROM vip_customers);
-- correlated subquerySELECT o.* FROM orders oWHERE o.amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id);
-- CTEWITH recent_orders AS ( SELECT customer_id, COUNT(*) AS n FROM orders WHERE created_at > NOW() - INTERVAL '30 days' GROUP BY 1)SELECT c.name, r.n FROM customers c JOIN recent_orders r ON c.id = r.customer_id;CTEs are a SQL-99 feature; recursive CTEs (WITH RECURSIVE ...) were added in SQL:1999 too but landed in major engines years later. MySQL only got both in 8.0 (2018).
When to use it#
Reach for a non-correlated subquery when you need a small lookup set inside IN / NOT IN / EXISTS / =. Reach for a correlated subquery when each outer row needs a different inner answer (“the average for this customer”). Reach for a CTE when:
- The same derived table is referenced more than once.
- The query is deep enough that flattening it into one block hurts readability.
- You need a recursive query (tree walk, graph reachability, generated series).
Don’t reach for subqueries/CTEs when a plain JOIN would do — the planner can almost always handle them, but inline joins read more naturally and don’t tempt junior readers into believing the subquery is somehow “separate” work.
How it works#
Logically every SQL construct is an algebraic expression, and subqueries/CTEs are just labelled subtrees in that expression. Operationally:
- Non-correlated subquery. Executed once. The result is materialised (or kept in a hash) and the outer query consumes it. For
WHERE x IN (...), modern planners convert it to a semijoin — equivalent toINNER JOIN ... DISTINCTagainst the subquery’s output. - Correlated subquery. Logically executed per outer row, because it depends on a value from that row. Planners aggressively rewrite — often into a join or window function — but if rewriting fails you can hit per-row execution and quadratic cost.
- Non-recursive CTE. A named subquery. Postgres 12+ inlines CTEs by default unless materialisation is forced (
WITH x AS MATERIALIZED (...)); MySQL and SQL Server have always inlined them. Pre-12 Postgres treated CTEs as an “optimisation fence” — they materialised, and queries that joined a CTE to a table multiple times got slow. - Recursive CTE. Two arms: a non-recursive anchor and a recursive
UNION ALLstep. The engine iterates the recursive step until it produces no new rows. Used for trees, hierarchies, graph reachability, and generated number/date series.
-- Walk a manager hierarchy down from a CEOWITH RECURSIVE chain AS ( SELECT id, name, manager_id, 0 AS depth FROM employees WHERE name = 'CEO' UNION ALL SELECT e.id, e.name, e.manager_id, c.depth + 1 FROM employees e JOIN chain c ON e.manager_id = c.id)SELECT * FROM chain ORDER BY depth, name;Variants#
EXISTSvsINvsJOIN. All three express “is there a match in the inner set?”.EXISTSis usually fastest for correlated existence checks because it stops on the first match per outer row.NOT INhas a NULL trap: if the inner set ever produces a NULL,NOT INreturns nothing.NOT EXISTSdoesn’t have this trap.- Scalar subquery. A subquery that returns a single row, single column, usable wherever a value is expected (
SELECT ..., (SELECT MAX(...) FROM ...) AS x FROM ...). If the subquery returns more than one row, runtime error. - Lateral subquery.
LEFT JOIN LATERAL (subquery) ON true— the subquery can reference earlier rows in the FROM clause. Useful for “top-N per group” patterns when a window function is awkward. - Materialised CTE. Postgres’
WITH x AS MATERIALIZED (...)forces materialisation and re-use; the inverseNOT MATERIALIZEDforces inlining. Use sparingly — the planner usually picks correctly. - Recursive vs iterative. A recursive CTE walks a hierarchy in one query; an iterative procedure (PL/pgSQL loop, app-side recursion) walks it in many. The CTE is one round-trip; the procedure is N. CTEs win for short, well-bounded hierarchies; procedures win when the iteration body needs side effects.
Trade-offs#
WITH). Named; reusable; readable; recursive cases possible. One extra mental step (you read the CTE, then the outer query). Materialisation behaviour varied by engine until ~2018 — always test on your version. Other axes:
INvsEXISTS. Functionally close; performance can differ.EXISTSshort-circuits per outer row;INagainst a small static list is unbeatable on cardinality. Read the plan.- CTE vs subquery vs temp table. A CTE is logically a one-shot named subquery; a temp table is a real on-disk table you write once and query many times. For very large derived sets queried repeatedly across statements, a temp table can outperform a CTE that re-materialises.
- Recursive depth limit. Recursive CTEs can run away. Postgres has no default limit; MySQL’s
cte_max_recursion_depthdefaults to 1000. Always addLIMITor a depth guard for production queries.
Common pitfalls#
NOT IN+ NULL.WHERE x NOT IN (SELECT y FROM t)returns empty if any row oft.yis NULL. UseNOT EXISTSinstead, or filter NULLs explicitly.- Correlated subquery in
SELECT. Looks innocuous (SELECT (SELECT ... WHERE inner.id = outer.id) AS x FROM outer) but executes per outer row. Often a window function or LATERAL is the better shape. - Self-referencing CTE without termination. Forgetting the join condition in a recursive CTE’s recursive step produces a runaway scan. The recursive step’s predicate must shrink the working set.
- CTE as “view”. A CTE is not a view — it’s scoped to the statement, evaluated as part of it, and may or may not be materialised. Don’t expect view-level reuse.
- Misleading row counts. Subqueries inside
INdeduplicate; subqueries insideEXISTSdon’t. Subtle when refactoring between them. - CTE inlining surprises. Postgres pre-12 always materialised CTEs (optimisation fence); 12+ inlines unless told otherwise. A query that was fast on 11 can be slow on 12 because the planner now joins a CTE multiple times instead of materialising once.
A recursive-CTE pattern worth memorising
WITH RECURSIVE series AS (SELECT 1 AS n UNION ALL SELECT n+1 FROM series WHERE n < 100) SELECT * FROM series; — generates 1..100 without a generate_series table. Works on every engine with recursive-CTE support. Adapt to date ranges, calendar tables, or padding zero-count buckets in time-series queries.
Related building blocks#