Nested Query
Correlated vs uncorrelated subqueries; subquery in SELECT vs WHERE vs FROM. When CTEs beat nested subqueries.
What it is#
A nested query is a SELECT inside another SELECT. SQL allows subqueries in three positions, and each position has its own semantics, performance profile, and idiomatic use:
-- 1. SELECT-clause subquery (scalar — one row, one column)SELECT u.id, u.email, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_countFROM users u;
-- 2. WHERE-clause subquery (filter — EXISTS, IN, comparison)SELECT *FROM usersWHERE id IN (SELECT user_id FROM orders WHERE total_cents > 10000);
-- 3. FROM-clause subquery (derived table)SELECT t.bucket, AVG(t.total)FROM ( SELECT user_id, SUM(total_cents) AS total, NTILE(10) OVER (ORDER BY SUM(total_cents)) AS bucket FROM orders GROUP BY user_id) tGROUP BY t.bucket;The orthogonal axis is correlated vs uncorrelated. A correlated subquery references columns from the outer query — conceptually, it re-runs per outer row. An uncorrelated subquery is self-contained and runs once.
When to use it#
- Scalar subquery in SELECT — when you want one extra value per outer row that depends on another table. Cheap when the inner query is index-supported; otherwise prefer a join or a window function.
- Subquery in WHERE —
IN,NOT IN,EXISTS,NOT EXISTS, and comparison subqueries. The right form depends on whether NULLs are possible, whether you need multiple matches, and the available indexes. - Subquery in FROM (derived table) — gives a name and shape to an intermediate result. Most useful when you need to compute a value (an aggregate, a window) and then filter or join against it.
- CTE (
WITH ... AS) — same role as a derived table but named and reusable within the same statement. Preferred when the same intermediate appears more than once or when the query is long enough that naming aids readability.
Reach for window functions instead of correlated subqueries when the question is “compare each row to others in the same partition” — the window function reads the table once where the correlated subquery may read it once per outer row.
How it works#
Correlated subquery. Logically evaluates once per outer row.
SELECT u.id, (SELECT MAX(o.total_cents) FROM orders o WHERE o.user_id = u.id) AS best_orderFROM users u;For each u, the inner query runs with u.id substituted. With a million users and no index on orders.user_id, that’s a million scans — pathologically slow. With an index, it’s a million single-key lookups — usually fine.
Modern optimisers often rewrite correlated subqueries into joins or anti-joins. Postgres calls this “subquery pull-up.” It’s not guaranteed — particularly when the subquery has aggregates, LIMIT, or sits inside SELECT.
Uncorrelated subquery. Runs once; its result is treated as a constant or a virtual table.
SELECT *FROM ordersWHERE total_cents > (SELECT AVG(total_cents) FROM orders);The inner average is computed once and cached for the duration of the query.
EXISTS vs IN. Both are uncorrelated-or-correlated set-membership checks.
-- Correlated EXISTS — short-circuits per outer rowSELECT *FROM users uWHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
-- Uncorrelated IN — collects all order user_ids firstSELECT *FROM usersWHERE id IN (SELECT user_id FROM orders);On a modern Postgres, the planner produces a hash semi-join for both. On older or simpler engines they can differ: EXISTS stops at the first match; IN materialises the whole inner set first. NOT EXISTS is safer than NOT IN (NULL pitfall — see Set Compare).
Derived tables and CTEs.
WITH ... AS) — named, can be referenced multiple times in the same statement, supports recursion (WITH RECURSIVE). Postgres 12+ inlines non-recursive non-materialised CTEs by default; before 12, every CTE was an optimisation fence — materialised regardless. MySQL 8 and SQL Server inline by default. -- Same query, two shapes-- Derived-table formSELECT t.user_id, t.totalFROM ( SELECT user_id, SUM(total_cents) AS total FROM orders GROUP BY user_id) tWHERE t.total > 100000;
-- CTE form (preferred for readability)WITH user_totals AS ( SELECT user_id, SUM(total_cents) AS total FROM orders GROUP BY user_id)SELECT user_id, totalFROM user_totalsWHERE total > 100000;On Postgres 12+ the plans are identical. On Postgres 11 and older, the CTE form materialises into a temporary result and may be slower — WITH user_totals AS MATERIALIZED (...) and AS NOT MATERIALIZED (...) let you force the choice on modern Postgres.
Variants#
Subquery in SELECT for lookups. Common in reporting queries that need a side-value per row.
SELECT o.order_id, o.user_id, (SELECT email FROM users u WHERE u.id = o.user_id) AS emailFROM orders o;This is equivalent to a LEFT JOIN users u ON u.id = o.user_id and projecting u.email. Plans are usually identical when users.id is indexed. The join form is more flexible — you can project multiple columns from users without writing one subquery per column.
Scalar subquery returning multiple rows. Runtime error in every major engine. Use LIMIT 1 with an ORDER BY to make the result deterministic.
SELECT u.id, (SELECT total_cents FROM orders o WHERE o.user_id = u.id ORDER BY created_at DESC LIMIT 1) AS latest_order_centsFROM users u;Aggregate in a correlated subquery. A common shape — “per outer row, compute something from a related table.”
SELECT u.id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count, (SELECT MAX(total_cents) FROM orders o WHERE o.user_id = u.id) AS biggestFROM users u;This pattern is readable but does two scans per outer row. A single LEFT JOIN ... GROUP BY is one pass over orders:
SELECT u.id, COALESCE(agg.order_count, 0) AS order_count, agg.biggestFROM users uLEFT JOIN ( SELECT user_id, COUNT(*) AS order_count, MAX(total_cents) AS biggest FROM orders GROUP BY user_id) agg ON agg.user_id = u.id;Recursive CTE. A CTE that references itself — covered separately in Recursive Tree. Used for hierarchies, graph traversal, and generating sequences.
MATERIALIZED hint. Postgres 12+ — force the optimiser’s hand:
WITH user_totals AS MATERIALIZED ( -- always materialise SELECT user_id, SUM(total_cents) AS total FROM orders GROUP BY user_id)SELECT * FROM user_totals WHERE total > 100000;Useful when the optimiser keeps inlining a CTE you wanted to compute once and reference many times.
Common pitfalls#
- Scalar subquery returning multiple rows. Runtime error — usually surfaces only when a particular outer row happens to match multiple inner rows.
LIMIT 1with an explicitORDER BYmakes it deterministic. - N+1 in correlated SELECT-clause subqueries. Each subquery in
SELECTruns once per outer row. Three subqueries against the same related table means three correlated scans per row — turn into oneLEFT JOIN ... GROUP BY. NOT IN+ NULL. Covered in Set Compare —WHERE x NOT IN (SELECT y FROM t)returns zero rows if anyt.yis NULL. UseNOT EXISTS.- Forgetting the correlation. Writing
SELECT 1 FROM orders o WHERE user_id = idwithout aliasingidtou.idaccidentally referencesorders.user_idagainst itself, making the subquery always true. Always qualify column names in correlated subqueries. - CTE optimisation-fence assumption. Code written for Postgres 11 may rely on
WITHbeing a materialisation barrier (“compute this once and reuse”). On Postgres 12+ the CTE is inlined by default and the plan changes. UseAS MATERIALIZEDto keep the old behaviour explicit. - CTE inlined when you wanted materialisation. Mirror of the above — a complex CTE referenced four times may be evaluated four times after inlining. Add
MATERIALIZEDif the CTE is expensive. - Derived table without an alias. Most engines require every
FROM-clause subquery to have an alias ((SELECT ...) AS t). MySQL is strict about this. - CTE column-list mismatch.
WITH t(a, b) AS (SELECT 1, 2, 3)errors — column count must match. Forgetting to include a column you reference later is a common bug. - Subquery cost not visible without
EXPLAIN. A correlated subquery inSELECTlooks innocuous in source but can dominate the plan. Always checkEXPLAIN ANALYZEfor the per-row cost.
Practice problem#
You have users(id, email) and orders(order_id, user_id, total_cents, created_at). Find each user’s most expensive order, returning (user_id, email, order_id, total_cents). Write it three ways: (a) a correlated subquery in the WHERE clause, (b) a window function, (c) a CTE with an aggregate join. Compare the plans on a million-row orders table.
Solution
-- (a) Correlated subquerySELECT u.id, u.email, o.order_id, o.total_centsFROM users uJOIN orders o ON o.user_id = u.idWHERE o.total_cents = ( SELECT MAX(o2.total_cents) FROM orders o2 WHERE o2.user_id = u.id);
-- (b) Window functionWITH ranked AS ( SELECT u.id AS user_id, u.email, o.order_id, o.total_cents, ROW_NUMBER() OVER (PARTITION BY o.user_id ORDER BY o.total_cents DESC, o.order_id DESC) AS rn FROM users u JOIN orders o ON o.user_id = u.id)SELECT user_id, email, order_id, total_centsFROM rankedWHERE rn = 1;
-- (c) CTE with aggregate, then re-joinWITH max_per_user AS ( SELECT user_id, MAX(total_cents) AS max_total FROM orders GROUP BY user_id)SELECT u.id, u.email, o.order_id, o.total_centsFROM users uJOIN max_per_user m ON m.user_id = u.idJOIN orders o ON o.user_id = u.id AND o.total_cents = m.max_total;Plan-wise: (a) does one scan of orders per user — fine with an index on (user_id, total_cents), expensive without. (b) does a single pass over orders plus a sort per partition — predictable, scales well, but reads every row. (c) does one aggregate scan plus a re-join — usually the fastest on large orders tables with a (user_id, total_cents) index, but returns multiple rows when ties exist on total_cents. Pick (b) for cleanest semantics, (c) for raw throughput when ties are impossible.
Related patterns#
- Existence Check —
EXISTS/NOT EXISTSas a focused use of correlated subqueries. - Set Compare —
IN/NOT INand their set-algebra equivalents. - Join Variants — anti-join and semi-join forms that replace many nested-query shapes.
- Subqueries and CTEs — the underlying mechanism.
- Window Functions — the common rewrite target for “per outer row” correlated subqueries.