Nested Query — DBMS · Engineering Playbook

What it is#

A nested query is a SELECT inside another SELECT. SQL allows subqueries in three positions, and each position has its own semantics, performance profile, and idiomatic use:

-- 1. SELECT-clause subquery (scalar — one row, one column)
SELECT u.id,
       u.email,
       (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
FROM users u;

-- 2. WHERE-clause subquery (filter — EXISTS, IN, comparison)
SELECT *
FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total_cents > 10000);

-- 3. FROM-clause subquery (derived table)
SELECT t.bucket, AVG(t.total)
FROM (
  SELECT user_id, SUM(total_cents) AS total,
         NTILE(10) OVER (ORDER BY SUM(total_cents)) AS bucket
  FROM orders
  GROUP BY user_id
) t
GROUP BY t.bucket;

The orthogonal axis is correlated vs uncorrelated. A correlated subquery references columns from the outer query — conceptually, it re-runs per outer row. An uncorrelated subquery is self-contained and runs once.

When to use it#

Scalar subquery in SELECT — when you want one extra value per outer row that depends on another table. Cheap when the inner query is index-supported; otherwise prefer a join or a window function.
Subquery in WHERE — IN, NOT IN, EXISTS, NOT EXISTS, and comparison subqueries. The right form depends on whether NULLs are possible, whether you need multiple matches, and the available indexes.
Subquery in FROM (derived table) — gives a name and shape to an intermediate result. Most useful when you need to compute a value (an aggregate, a window) and then filter or join against it.
CTE (WITH ... AS) — same role as a derived table but named and reusable within the same statement. Preferred when the same intermediate appears more than once or when the query is long enough that naming aids readability.

Reach for window functions instead of correlated subqueries when the question is “compare each row to others in the same partition” — the window function reads the table once where the correlated subquery may read it once per outer row.

How it works#

Correlated subquery. Logically evaluates once per outer row.

SELECT u.id,
       (SELECT MAX(o.total_cents)
        FROM orders o
        WHERE o.user_id = u.id) AS best_order
FROM users u;

For each u, the inner query runs with u.id substituted. With a million users and no index on orders.user_id, that’s a million scans — pathologically slow. With an index, it’s a million single-key lookups — usually fine.

Modern optimisers often rewrite correlated subqueries into joins or anti-joins. Postgres calls this “subquery pull-up.” It’s not guaranteed — particularly when the subquery has aggregates, LIMIT, or sits inside SELECT.

Uncorrelated subquery. Runs once; its result is treated as a constant or a virtual table.

SELECT *
FROM orders
WHERE total_cents > (SELECT AVG(total_cents) FROM orders);

The inner average is computed once and cached for the duration of the query.

EXISTS vs IN. Both are uncorrelated-or-correlated set-membership checks.

-- Correlated EXISTS — short-circuits per outer row
SELECT *
FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

-- Uncorrelated IN — collects all order user_ids first
SELECT *
FROM users
WHERE id IN (SELECT user_id FROM orders);

On a modern Postgres, the planner produces a hash semi-join for both. On older or simpler engines they can differ: EXISTS stops at the first match; IN materialises the whole inner set first. NOT EXISTS is safer than NOT IN (NULL pitfall — see Set Compare).

Derived tables and CTEs.

Derived table (subquery in FROM) — anonymous (aliased), inline, scoped to the surrounding query. Always inlined by the optimiser. Good for one-off intermediate results.

CTE (WITH ... AS) — named, can be referenced multiple times in the same statement, supports recursion (WITH RECURSIVE). Postgres 12+ inlines non-recursive non-materialised CTEs by default; before 12, every CTE was an optimisation fence — materialised regardless. MySQL 8 and SQL Server inline by default.

-- Same query, two shapes
-- Derived-table form
SELECT t.user_id, t.total
FROM (
  SELECT user_id, SUM(total_cents) AS total
  FROM orders
  GROUP BY user_id
) t
WHERE t.total > 100000;

-- CTE form (preferred for readability)
WITH user_totals AS (
  SELECT user_id, SUM(total_cents) AS total
  FROM orders
  GROUP BY user_id
)
SELECT user_id, total
FROM user_totals
WHERE total > 100000;

On Postgres 12+ the plans are identical. On Postgres 11 and older, the CTE form materialises into a temporary result and may be slower — WITH user_totals AS MATERIALIZED (...) and AS NOT MATERIALIZED (...) let you force the choice on modern Postgres.

Variants#

Subquery in SELECT for lookups. Common in reporting queries that need a side-value per row.

SELECT o.order_id,
       o.user_id,
       (SELECT email FROM users u WHERE u.id = o.user_id) AS email
FROM orders o;

This is equivalent to a LEFT JOIN users u ON u.id = o.user_id and projecting u.email. Plans are usually identical when users.id is indexed. The join form is more flexible — you can project multiple columns from users without writing one subquery per column.

Scalar subquery returning multiple rows. Runtime error in every major engine. Use LIMIT 1 with an ORDER BY to make the result deterministic.

SELECT u.id,
       (SELECT total_cents FROM orders o
        WHERE o.user_id = u.id
        ORDER BY created_at DESC
        LIMIT 1) AS latest_order_cents
FROM users u;

Aggregate in a correlated subquery. A common shape — “per outer row, compute something from a related table.”

SELECT u.id,
       (SELECT COUNT(*)        FROM orders o WHERE o.user_id = u.id) AS order_count,
       (SELECT MAX(total_cents) FROM orders o WHERE o.user_id = u.id) AS biggest
FROM users u;

This pattern is readable but does two scans per outer row. A single LEFT JOIN ... GROUP BY is one pass over orders:

SELECT u.id,
       COALESCE(agg.order_count, 0) AS order_count,
       agg.biggest
FROM users u
LEFT JOIN (
  SELECT user_id, COUNT(*) AS order_count, MAX(total_cents) AS biggest
  FROM orders
  GROUP BY user_id
) agg ON agg.user_id = u.id;

Recursive CTE. A CTE that references itself — covered separately in Recursive Tree. Used for hierarchies, graph traversal, and generating sequences.

MATERIALIZED hint. Postgres 12+ — force the optimiser’s hand:

WITH user_totals AS MATERIALIZED (  -- always materialise
  SELECT user_id, SUM(total_cents) AS total FROM orders GROUP BY user_id
)
SELECT * FROM user_totals WHERE total > 100000;

Useful when the optimiser keeps inlining a CTE you wanted to compute once and reference many times.

Common pitfalls#

Scalar subquery returning multiple rows. Runtime error — usually surfaces only when a particular outer row happens to match multiple inner rows. LIMIT 1 with an explicit ORDER BY makes it deterministic.
N+1 in correlated SELECT-clause subqueries. Each subquery in SELECT runs once per outer row. Three subqueries against the same related table means three correlated scans per row — turn into one LEFT JOIN ... GROUP BY.
NOT IN + NULL. Covered in Set Compare — WHERE x NOT IN (SELECT y FROM t) returns zero rows if any t.y is NULL. Use NOT EXISTS.
Forgetting the correlation. Writing SELECT 1 FROM orders o WHERE user_id = id without aliasing id to u.id accidentally references orders.user_id against itself, making the subquery always true. Always qualify column names in correlated subqueries.
CTE optimisation-fence assumption. Code written for Postgres 11 may rely on WITH being a materialisation barrier (“compute this once and reuse”). On Postgres 12+ the CTE is inlined by default and the plan changes. Use AS MATERIALIZED to keep the old behaviour explicit.
CTE inlined when you wanted materialisation. Mirror of the above — a complex CTE referenced four times may be evaluated four times after inlining. Add MATERIALIZED if the CTE is expensive.
Derived table without an alias. Most engines require every FROM-clause subquery to have an alias ((SELECT ...) AS t). MySQL is strict about this.
CTE column-list mismatch. WITH t(a, b) AS (SELECT 1, 2, 3) errors — column count must match. Forgetting to include a column you reference later is a common bug.
Subquery cost not visible without EXPLAIN. A correlated subquery in SELECT looks innocuous in source but can dominate the plan. Always check EXPLAIN ANALYZE for the per-row cost.

Practice problem#

You have users(id, email) and orders(order_id, user_id, total_cents, created_at). Find each user’s most expensive order, returning (user_id, email, order_id, total_cents). Write it three ways: (a) a correlated subquery in the WHERE clause, (b) a window function, (c) a CTE with an aggregate join. Compare the plans on a million-row orders table.

Solution

-- (a) Correlated subquery
SELECT u.id, u.email, o.order_id, o.total_cents
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.total_cents = (
  SELECT MAX(o2.total_cents) FROM orders o2 WHERE o2.user_id = u.id
);

-- (b) Window function
WITH ranked AS (
  SELECT u.id AS user_id, u.email, o.order_id, o.total_cents,
         ROW_NUMBER() OVER (PARTITION BY o.user_id
                            ORDER BY o.total_cents DESC, o.order_id DESC) AS rn
  FROM users u
  JOIN orders o ON o.user_id = u.id
)
SELECT user_id, email, order_id, total_cents
FROM ranked
WHERE rn = 1;

-- (c) CTE with aggregate, then re-join
WITH max_per_user AS (
  SELECT user_id, MAX(total_cents) AS max_total
  FROM orders
  GROUP BY user_id
)
SELECT u.id, u.email, o.order_id, o.total_cents
FROM users u
JOIN max_per_user m ON m.user_id = u.id
JOIN orders o ON o.user_id = u.id AND o.total_cents = m.max_total;

Plan-wise: (a) does one scan of orders per user — fine with an index on (user_id, total_cents), expensive without. (b) does a single pass over orders plus a sort per partition — predictable, scales well, but reads every row. (c) does one aggregate scan plus a re-join — usually the fastest on large orders tables with a (user_id, total_cents) index, but returns multiple rows when ties exist on total_cents. Pick (b) for cleanest semantics, (c) for raw throughput when ties are impossible.

Existence Check — EXISTS / NOT EXISTS as a focused use of correlated subqueries.
Set Compare — IN / NOT IN and their set-algebra equivalents.
Join Variants — anti-join and semi-join forms that replace many nested-query shapes.
Subqueries and CTEs — the underlying mechanism.
Window Functions — the common rewrite target for “per outer row” correlated subqueries.