Existence Check — DBMS · Engineering Playbook

What it is#

The existence check pattern asks: for each row in table A, does a related row exist (or not exist) in table B? SQL gives four ways to say this, and they are not interchangeable — they differ in NULL handling, performance, and readability.

-- Users who have placed at least one order

-- 1. EXISTS (correlated subquery)
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

-- 2. IN (subquery)
SELECT u.* FROM users u
WHERE u.id IN (SELECT user_id FROM orders);

-- 3. LEFT JOIN ... IS NOT NULL
SELECT u.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.user_id IS NOT NULL;

-- 4. INNER JOIN with DISTINCT
SELECT DISTINCT u.* FROM users u
INNER JOIN orders o ON o.user_id = u.id;

For the affirmative case (“users with orders”), all four return the same rows on a NULL-free schema and modern optimisers often produce identical plans. The interesting case is the negative form (“users without orders”) — where NOT IN will silently return zero rows if the subquery contains any NULL, and the other three are safer.

Picking the right form is a frequent interview discussion point and a real correctness hazard in production.

When to use it#

Use an existence check when:

The question is “does at least one matching row exist?” — not “how many” or “what value.”
You want to filter parent rows by the presence or absence of child rows.
You don’t need any columns from the child table in the output.

Use a regular JOIN instead when:

You need columns from the child table.
You want one row per parent–child match, not one row per parent.

Use Tally Count instead when you actually need the count of child rows. Use a HAVING COUNT(*) > N when the threshold is > N, not just “at least one.”

How it works#

EXISTS — semi-join, short-circuits, NULL-safe#

EXISTS (subquery) returns TRUE if the subquery returns one or more rows, FALSE if it returns zero. The values inside the subquery don’t matter — SELECT 1, SELECT *, SELECT user_id all behave the same.

SELECT u.* FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.user_id = u.id
);

This is a semi-join: every user appears at most once in the output, regardless of how many matching orders exist. It is the cleanest and safest form, and modern optimisers (Postgres, SQL Server 2017+, Oracle, MySQL 8.0+) translate it to a hash-semi-join or nested-loop-semi-join automatically.

EXISTS is NULL-safe: NULL user_id values in orders simply don’t match any user, and NULL id in users doesn’t match anything either — no silent zero-row trap.

NOT EXISTS — anti-join, also NULL-safe#

-- Users who have NEVER placed an order
SELECT u.* FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM orders o
  WHERE o.user_id = u.id
);

NOT EXISTS is an anti-join: every user with zero matching orders. NULL-safe in the same way EXISTS is. This is the recommended form for “missing” queries.

IN — concise for static lists, NULL-trap for subqueries#

-- Users in a known set (no NULL risk)
SELECT * FROM users WHERE id IN (1, 2, 3, 4);

-- Users with orders (subquery — works when no NULLs)
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders);

For the affirmative case, IN is fine and concise. Modern optimisers will rewrite IN (subquery) into a semi-join — identical to EXISTS.

The trap is the negative case with a nullable column.

-- DANGER: if any order has user_id IS NULL, this returns nothing
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM orders);

The semantics are three-valued logic: x NOT IN (a, b, NULL) evaluates to x <> a AND x <> b AND x <> NULL — and x <> NULL is NULL, not TRUE. The whole AND chain becomes NULL, which is treated as FALSE by WHERE. Result: every row is filtered out.

Workarounds (if you must use NOT IN):

-- Filter NULLs from the subquery
SELECT * FROM users
WHERE id NOT IN (SELECT user_id FROM orders WHERE user_id IS NOT NULL);

But NOT EXISTS is simpler, safer, and identically fast on modern planners. Just use it.

LEFT JOIN … IS NULL — anti-join via joins#

-- Users with no orders
SELECT u.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.user_id IS NULL;

This expresses the anti-join via a LEFT JOIN: every user is joined to their matching orders; users with no matches get a single row with all orders columns NULL; the WHERE o.user_id IS NULL filter keeps only those.

It is NULL-safe for the join condition (NULL never equals anything, so unmatched rows correctly come through), but the choice of column for the IS NULL check matters: pick a NOT NULL column of the right side (typically the primary key or join key), not an arbitrary column that could legitimately be NULL.

-- Wrong if orders.notes can be NULL: matches users WITH orders that have null notes
WHERE o.notes IS NULL

-- Right: orders.id is NOT NULL when the join matched
WHERE o.id IS NULL

Performance: when does each win?#

On modern optimisers (Postgres 12+, SQL Server 2017+, Oracle, BigQuery, Snowflake), EXISTS, IN, and LEFT JOIN ... IS NULL for an affirmative or negative semi/anti-join are typically planned identically — a hash-semi-join or hash-anti-join with the same cost.

Edge cases where the planner picks differently:

Large outer, small inner: nested-loop-semi-join with the inner table indexed is cheapest. EXISTS makes this most obvious to the planner.
Small outer, large inner: hash-anti-join with the outer hashed. All three forms should plan this way, but LEFT JOIN ... IS NULL sometimes loses materialisation opportunities on older versions of MySQL.
Subquery returns very few distinct values: IN (subquery) can hash the inner set and probe the outer — sometimes faster than EXISTS if the inner is much smaller than the outer.

When in doubt, EXPLAIN the query. See Query execution and plans for how to read the output.

EXPLAIN ANALYZE
SELECT u.* FROM users u
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

Variants#

Existence with extra conditions in the subquery.

-- Users with at least one paid order in the last 30 days
SELECT u.* FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.user_id = u.id
    AND o.status = 'paid'
    AND o.created_at >= NOW() - INTERVAL '30 days'
);

Cleaner than the equivalent join + DISTINCT.

Existence in multiple tables (OR EXISTS).

-- Users with an order OR a review
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders  o WHERE o.user_id  = u.id)
   OR EXISTS (SELECT 1 FROM reviews r WHERE r.user_id = u.id);

The two existence checks can be planned independently and combined.

Existence with ANY / SOME / ALL.

-- ANY/SOME are aliases — equivalent to IN
SELECT * FROM users WHERE id = ANY (SELECT user_id FROM orders);

-- ALL means: u.amount is greater than every value in the subquery
SELECT * FROM orders WHERE amount > ALL (SELECT amount FROM refunds);

ANY and IN are interchangeable but IN is more idiomatic. ALL and <> ALL have NULL-handling quirks similar to NOT IN — prefer NOT EXISTS for “greater than the max” style queries.

Existence with INTERSECT / EXCEPT.

-- Users who appear in BOTH tables (intersect)
SELECT id FROM users
INTERSECT
SELECT user_id FROM orders;

-- Users in the first table but NOT the second (except / anti-join)
SELECT id FROM users
EXCEPT
SELECT user_id FROM orders;

EXCEPT is NULL-safe the way NOT EXISTS is — it doesn’t have the NOT IN trap. MySQL added EXCEPT in 8.0.31; before that, use NOT EXISTS.

Common pitfalls#

The NOT IN + NULL trap (covered above). Default to NOT EXISTS for any “missing” query. The performance is equivalent; the correctness is not.

Forgetting the correlation in EXISTS. A subquery inside EXISTS without a reference to the outer table is TRUE whenever the subquery returns any row at all — usually a bug.

-- BUG: EXISTS is TRUE for every user as long as ANY order exists in the table
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders);

-- Correct: correlate on user_id
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

Picking the wrong column for IS NULL in the join form.

-- BUG: orders.amount can legitimately be NULL
SELECT u.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.amount IS NULL;
-- Matches users with NO orders AND users with orders that had a NULL amount

Use the primary key or the join key — they’re guaranteed non-null on a successful match.

INNER JOIN with DISTINCT instead of EXISTS. It works, but DISTINCT is more expensive than a semi-join — the engine has to materialise all joined rows and then dedupe.

Semi-join — efficient:

SELECT u.*
FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.user_id = u.id
);

Stops scanning orders as soon as one match is found per user. Hash-semi-join is O(n + m).

INNER JOIN + DISTINCT — wasteful:

SELECT DISTINCT u.*
FROM users u
INNER JOIN orders o
  ON o.user_id = u.id;

Materialises every user × matching-order row, then deduplicates by all columns of users. Becomes O(n × k) where k is the average orders per user, plus a sort/hash for DISTINCT.

Subquery rewrite quirks on older engines. MySQL 5.6 famously executed IN (subquery) as a correlated subquery — re-running the inner query per outer row, O(n × m). MySQL 5.7 fixed this with semi-join optimisation; 8.0 is faster still. If you’re stuck on an old engine, rewriting IN (subquery) to a JOIN can produce a 100x speedup. On modern engines, the planner handles it.

Mixing existence with aggregation. WHERE EXISTS (... HAVING COUNT(*) > 5 ...) is almost right but slow — the subquery materialises the count. Prefer WHERE id IN (SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 5).

Practice problem#

You have a users table with 100M rows and an orders table with 500M rows:

CREATE TABLE users (
  id    BIGSERIAL PRIMARY KEY,
  email TEXT NOT NULL
);
CREATE TABLE orders (
  id      BIGSERIAL PRIMARY KEY,
  user_id BIGINT REFERENCES users(id),     -- nullable: anonymous orders allowed
  amount  NUMERIC NOT NULL
);
CREATE INDEX ON orders (user_id);

Find every user who has never placed an order. Show three correct queries, then identify which one is safest for production given that orders.user_id is nullable.

Solution

-- 1. NOT EXISTS (NULL-safe, recommended)
SELECT u.* FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.id
);

-- 2. LEFT JOIN ... IS NULL (also NULL-safe, sometimes faster on large outer sets)
SELECT u.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.id IS NULL;

-- 3. NOT IN (UNSAFE here — orders.user_id is nullable)
SELECT * FROM users
WHERE id NOT IN (SELECT user_id FROM orders);
-- Returns 0 rows if any order has user_id IS NULL.

Use form (1) NOT EXISTS for production. It is NULL-safe, the optimiser plans it as a hash-anti-join, and on a 100M × 500M dataset with the index on orders(user_id) the cost is the same as form (2). Form (3) is incorrect for this schema because orders.user_id is nullable; even if the data happens to have no NULLs today, a single future row with user_id IS NULL will silently break the report.

Joins (inner, outer, cross) — the LEFT JOIN form of the anti-join.
Subqueries and CTEs — the building block for EXISTS and IN subqueries.
Tally Count — when “at least one” should become “how many.”
Query execution and plans — how to use EXPLAIN to verify the optimiser picked the semi-join.
SELECT and the WHERE clause — the three-valued NULL logic that drives the NOT IN trap.

What it is#

When to use it#

How it works#

EXISTS — semi-join, short-circuits, NULL-safe#

NOT EXISTS — anti-join, also NULL-safe#

IN — concise for static lists, NULL-trap for subqueries#

LEFT JOIN … IS NULL — anti-join via joins#

Performance: when does each win?#

Variants#

Common pitfalls#

Practice problem#

Related patterns#