SELECT and the WHERE Clause
The shape of every read. Predicate logic, NULL handling, the difference between AND/OR and short-circuit eval.
What it is#
SELECT is the only SQL verb that reads data, and every read query is some elaboration of SELECT col-list FROM table WHERE predicate. The clauses are evaluated in a fixed logical order — FROM (and JOIN), WHERE, GROUP BY, HAVING, SELECT, DISTINCT, ORDER BY, LIMIT — which is not the order they’re written. The clause that decides which rows are eligible is WHERE; the clause that decides which columns come back is SELECT. The rest are filters, groupings, sorts, and limits layered on top.
The WHERE clause is where most of the interesting questions live: predicate composition, NULL handling, index-usability, the difference between IN and EXISTS, and the subtle ways a slightly-wrong predicate quietly returns the wrong answer. SQL’s predicate logic is three-valued (TRUE / FALSE / UNKNOWN) because of NULL, and that fact contaminates everything downstream.
When to use it#
Always. Every relational read is a SELECT. The interesting design questions are which form:
- Point lookup —
WHERE id = ?on an indexed column. Sub-millisecond. - Range query —
WHERE created_at BETWEEN ? AND ?on an indexed column. Uses a B-tree range scan. - Filtered scan —
WHERE status = 'active' AND region = 'us'. Composite index ideally, otherwise scan + filter. - Top-N —
ORDER BY created_at DESC LIMIT 20. Index-ordered scan stops early. - Aggregate —
SELECT COUNT(*) FROM t WHERE .... Often answerable from an index-only scan. - Existence check —
SELECT 1 FROM t WHERE pk = ?orEXISTS (...). Cheaper than the equivalent count when you only care about presence.
Resist SELECT * in application code — it reads more columns than needed (often blocking index-only scans), couples your code to the schema, and breaks loudly when the table grows a new column.
How it works#
The shape#
SELECT id, email, created_atFROM usersWHERE status = 'active' AND created_at >= now() - INTERVAL '30 days'ORDER BY created_at DESCLIMIT 50;FROM users chooses the input relation. WHERE filters to active recent users. SELECT id, email, created_at projects. ORDER BY sorts. LIMIT truncates.
Predicate logic with NULL#
SQL is three-valued. Every comparison is TRUE, FALSE, or UNKNOWN:
1 = 1 → TRUE1 = 2 → FALSE1 = NULL → UNKNOWN (not FALSE)NULL = NULL → UNKNOWN (not TRUE)NULL <> 5 → UNKNOWNWHERE keeps only rows where the predicate is TRUE. UNKNOWN rows are dropped — same as FALSE for filtering, but the distinction matters when you negate:
WHERE col = 5 -- TRUE only when col is 5WHERE col <> 5 -- TRUE only when col is non-NULL and not 5 -- rows with col IS NULL are dropped here tooAND and OR have three-valued truth tables:
TRUE AND UNKNOWN = UNKNOWNFALSE AND UNKNOWN = FALSETRUE OR UNKNOWN = TRUEFALSE OR UNKNOWN = UNKNOWNNOT UNKNOWN = UNKNOWNThe right way to test for NULL is IS NULL and IS NOT NULL — they always return TRUE or FALSE, never UNKNOWN.
Predicates that the index can use#
A predicate is sargable (“Search ARGument ABLE”) if the planner can use an index on the column. Sargable forms:
WHERE col = 5WHERE col > 5WHERE col BETWEEN 1 AND 10WHERE col LIKE 'foo%' -- prefix, sargableWHERE col IN (1, 2, 3)Non-sargable: a function applied to the column hides the index:
WHERE LOWER(email) = 'a@b.com' -- non-sargable on emailWHERE date(created_at) = '2026-01-01' -- non-sargable on created_atWHERE col + 1 = 10 -- non-sargable on colWHERE col LIKE '%foo' -- leading wildcard, non-sargableFix by indexing the expression (CREATE INDEX ON users (LOWER(email))) or rewriting the predicate (WHERE created_at >= '2026-01-01' AND created_at < '2026-01-02').
Operators worth knowing#
WHERE col IS NULL -- the only correct NULL testWHERE col IS DISTINCT FROM 5 -- treats NULL as a real valueWHERE col BETWEEN 10 AND 20 -- inclusive both endsWHERE col IN ('a', 'b', 'c') -- set membershipWHERE col LIKE 'foo_%' -- _ = one char, % = anyWHERE col ~ '^foo[0-9]+$' -- regex (Postgres)WHERE EXISTS (SELECT 1 FROM other WHERE other.fk = t.id)WHERE col = ANY (ARRAY[1, 2, 3]) -- Postgres array formPredicate evaluation order#
The SQL standard does not guarantee left-to-right short-circuit evaluation. WHERE a = 5 AND b/a = 2 may divide by zero in some plans even when a = 5 filters it out — the planner may evaluate b/a first if the cost model thinks that’s cheaper. Use CASE WHEN for explicit short-circuit:
WHERE a = 5 AND (CASE WHEN a = 0 THEN false ELSE b/a = 2 END)Most engines do short-circuit in practice but the spec doesn’t require it. Don’t rely on it for correctness.
DISTINCT and ORDER BY#
SELECT DISTINCT region FROM users;SELECT * FROM events ORDER BY created_at DESC, id DESC LIMIT 100;DISTINCT deduplicates the projection. ORDER BY ... LIMIT is the canonical pagination shape — always order by a tie-breaker (id) when the main key isn’t unique, otherwise pagination skips or duplicates rows at page boundaries.
Variants#
SELECT DISTINCT ON (col)(Postgres) — keep one row percolvalue, picked byORDER BY. Concise “latest per group” pattern.SELECT ... FOR UPDATE— pessimistic row lock; see DML writeup.SELECT INTO(Postgres pl/pgsql) — assign result into a variable.SELECT ... LIMIT ... OFFSET— paginate. Cheap for small offsets, painful for large ones because the engine still scans and discards. Keyset pagination is the fix.- Lateral joins —
LATERALlets a subquery inFROMreference columns from earlierFROMitems, replacing many correlated-subquery patterns. SELECT 1for existence checks —EXISTS (SELECT 1 FROM ...)is idiomatic; the engine doesn’t care what you select inside an EXISTS, only whether a row matches.
Trade-offs#
IN (subquery) — readable, often planner-rewritten to a semi-join. Loads the inner result into a hash; works well when the inner result is small. Behaves oddly with NULL inside the IN list — WHERE col IN (1, NULL) matches when col=1 but not when col is NULL, while NOT IN (1, NULL) matches nothing because of three-valued logic. EXISTS (correlated subquery) — explicit semi-join, planner-friendly, doesn’t suffer the NULL trap of NOT IN. Marginally more verbose. The defensive choice when the inner result may contain NULLs. Other lever choices:
- Sargable predicates over fancy ones —
WHERE created_at >= '2026-01-01'beatsWHERE EXTRACT(YEAR FROM created_at) = 2026every time. LIMITwithoutORDER BYis non-deterministic — the engine returns some rows; which ones depends on storage order. Always pair withORDER BYwhen the order matters.OFFSETpaginating large result sets — scales linearly with offset. Switch to keyset pagination (WHERE id < last_seen_id ORDER BY id DESC LIMIT 20) past a few thousand rows.- Column lists vs
SELECT *— explicit columns enable index-only scans, document intent, and survive schema changes more gracefully.
Common pitfalls#
WHERE col <> 5dropping NULL rows. If you want rows wherecolis not 5 including NULLs, writeWHERE col IS DISTINCT FROM 5(Postgres) orWHERE col <> 5 OR col IS NULL.NOT INwith a NULL inside the list returns nothing.WHERE col NOT IN (1, 2, NULL)is always UNKNOWN, never TRUE. UseNOT EXISTSor filter NULLs out of the subquery.- Function on the indexed column kills the index.
WHERE date(created_at) = ?doesn’t use an index oncreated_at. Rewrite as a range. SELECT *blocking index-only scans. A covering index can answer the query without touching the heap, but only if every selected column is in the index.SELECT *reads them all.- Forgetting the tie-breaker in pagination.
ORDER BY created_at DESC LIMIT 50 OFFSET 50will return inconsistent results if multiple rows share a timestamp. Add, id DESC. - Relying on short-circuit evaluation — the spec doesn’t promise it. Use
CASE WHENif correctness depends on order. COUNT(col)vsCOUNT(*)—COUNT(col)excludes NULLs,COUNT(*)counts rows. UseCOUNT(*)unless you specifically mean “how many non-NULL values”.SELECT DISTINCThiding bugs — slappingDISTINCTon a query that’s returning unexpected duplicates usually masks a join cardinality bug. Find the source duplication first.
The mental model that helps
Treat WHERE as the only place you control row eligibility, SELECT as projection-only (no filtering, just column choice), and ORDER BY / LIMIT as the only ordering and truncation. When something doesn’t work, ask: at which logical step does the wrong row appear? Wrong rows mean WHERE; wrong columns mean SELECT; wrong order means ORDER BY. The mental model lines up cleanly with the algebra: WHERE is σ, SELECT is π, JOIN is ⋈, ORDER BY / LIMIT are SQL extensions that sit on top.
Related building blocks#