SELECT and the WHERE Clause — DBMS

What it is#

SELECT is the only SQL verb that reads data, and every read query is some elaboration of SELECT col-list FROM table WHERE predicate. The clauses are evaluated in a fixed logical order — FROM (and JOIN), WHERE, GROUP BY, HAVING, SELECT, DISTINCT, ORDER BY, LIMIT — which is not the order they’re written. The clause that decides which rows are eligible is WHERE; the clause that decides which columns come back is SELECT. The rest are filters, groupings, sorts, and limits layered on top.

The WHERE clause is where most of the interesting questions live: predicate composition, NULL handling, index-usability, the difference between IN and EXISTS, and the subtle ways a slightly-wrong predicate quietly returns the wrong answer. SQL’s predicate logic is three-valued (TRUE / FALSE / UNKNOWN) because of NULL, and that fact contaminates everything downstream.

When to use it#

Always. Every relational read is a SELECT. The interesting design questions are which form:

Point lookup — WHERE id = ? on an indexed column. Sub-millisecond.
Range query — WHERE created_at BETWEEN ? AND ? on an indexed column. Uses a B-tree range scan.
Filtered scan — WHERE status = 'active' AND region = 'us'. Composite index ideally, otherwise scan + filter.
Top-N — ORDER BY created_at DESC LIMIT 20. Index-ordered scan stops early.
Aggregate — SELECT COUNT(*) FROM t WHERE .... Often answerable from an index-only scan.
Existence check — SELECT 1 FROM t WHERE pk = ? or EXISTS (...). Cheaper than the equivalent count when you only care about presence.

Resist SELECT * in application code — it reads more columns than needed (often blocking index-only scans), couples your code to the schema, and breaks loudly when the table grows a new column.

How it works#

The shape#

SELECT id, email, created_at
FROM users
WHERE status = 'active'
  AND created_at >= now() - INTERVAL '30 days'
ORDER BY created_at DESC
LIMIT 50;

FROM users chooses the input relation. WHERE filters to active recent users. SELECT id, email, created_at projects. ORDER BY sorts. LIMIT truncates.

Predicate logic with NULL#

SQL is three-valued. Every comparison is TRUE, FALSE, or UNKNOWN:

1 = 1       → TRUE
1 = 2       → FALSE
1 = NULL    → UNKNOWN     (not FALSE)
NULL = NULL → UNKNOWN     (not TRUE)
NULL <> 5   → UNKNOWN

WHERE keeps only rows where the predicate is TRUE. UNKNOWN rows are dropped — same as FALSE for filtering, but the distinction matters when you negate:

WHERE col = 5      -- TRUE only when col is 5
WHERE col <> 5     -- TRUE only when col is non-NULL and not 5
                   -- rows with col IS NULL are dropped here too

AND and OR have three-valued truth tables:

TRUE  AND UNKNOWN = UNKNOWN
FALSE AND UNKNOWN = FALSE
TRUE  OR  UNKNOWN = TRUE
FALSE OR  UNKNOWN = UNKNOWN
NOT UNKNOWN       = UNKNOWN

The right way to test for NULL is IS NULL and IS NOT NULL — they always return TRUE or FALSE, never UNKNOWN.

Predicates that the index can use#

A predicate is sargable (“Search ARGument ABLE”) if the planner can use an index on the column. Sargable forms:

WHERE col = 5
WHERE col > 5
WHERE col BETWEEN 1 AND 10
WHERE col LIKE 'foo%'        -- prefix, sargable
WHERE col IN (1, 2, 3)

Non-sargable: a function applied to the column hides the index:

WHERE LOWER(email) = 'a@b.com'      -- non-sargable on email
WHERE date(created_at) = '2026-01-01'   -- non-sargable on created_at
WHERE col + 1 = 10                   -- non-sargable on col
WHERE col LIKE '%foo'                -- leading wildcard, non-sargable

Fix by indexing the expression (CREATE INDEX ON users (LOWER(email))) or rewriting the predicate (WHERE created_at >= '2026-01-01' AND created_at < '2026-01-02').

Operators worth knowing#

WHERE col IS NULL                          -- the only correct NULL test
WHERE col IS DISTINCT FROM 5               -- treats NULL as a real value
WHERE col BETWEEN 10 AND 20                -- inclusive both ends
WHERE col IN ('a', 'b', 'c')               -- set membership
WHERE col LIKE 'foo_%'                     -- _ = one char, % = any
WHERE col ~ '^foo[0-9]+$'                  -- regex (Postgres)
WHERE EXISTS (SELECT 1 FROM other WHERE other.fk = t.id)
WHERE col = ANY (ARRAY[1, 2, 3])           -- Postgres array form

Predicate evaluation order#

The SQL standard does not guarantee left-to-right short-circuit evaluation. WHERE a = 5 AND b/a = 2 may divide by zero in some plans even when a = 5 filters it out — the planner may evaluate b/a first if the cost model thinks that’s cheaper. Use CASE WHEN for explicit short-circuit:

WHERE a = 5 AND (CASE WHEN a = 0 THEN false ELSE b/a = 2 END)

Most engines do short-circuit in practice but the spec doesn’t require it. Don’t rely on it for correctness.

DISTINCT and ORDER BY#

SELECT DISTINCT region FROM users;
SELECT * FROM events ORDER BY created_at DESC, id DESC LIMIT 100;

DISTINCT deduplicates the projection. ORDER BY ... LIMIT is the canonical pagination shape — always order by a tie-breaker (id) when the main key isn’t unique, otherwise pagination skips or duplicates rows at page boundaries.

Variants#

SELECT DISTINCT ON (col) (Postgres) — keep one row per col value, picked by ORDER BY. Concise “latest per group” pattern.
SELECT ... FOR UPDATE — pessimistic row lock; see DML writeup.
SELECT INTO (Postgres pl/pgsql) — assign result into a variable.
SELECT ... LIMIT ... OFFSET — paginate. Cheap for small offsets, painful for large ones because the engine still scans and discards. Keyset pagination is the fix.
Lateral joins — LATERAL lets a subquery in FROM reference columns from earlier FROM items, replacing many correlated-subquery patterns.
SELECT 1 for existence checks — EXISTS (SELECT 1 FROM ...) is idiomatic; the engine doesn’t care what you select inside an EXISTS, only whether a row matches.

Trade-offs#

IN (subquery) — readable, often planner-rewritten to a semi-join. Loads the inner result into a hash; works well when the inner result is small. Behaves oddly with NULL inside the IN list — WHERE col IN (1, NULL) matches when col=1 but not when col is NULL, while NOT IN (1, NULL) matches nothing because of three-valued logic.

EXISTS (correlated subquery) — explicit semi-join, planner-friendly, doesn’t suffer the NULL trap of NOT IN. Marginally more verbose. The defensive choice when the inner result may contain NULLs.

Other lever choices:

Sargable predicates over fancy ones — WHERE created_at >= '2026-01-01' beats WHERE EXTRACT(YEAR FROM created_at) = 2026 every time.
LIMIT without ORDER BY is non-deterministic — the engine returns some rows; which ones depends on storage order. Always pair with ORDER BY when the order matters.
OFFSET paginating large result sets — scales linearly with offset. Switch to keyset pagination (WHERE id < last_seen_id ORDER BY id DESC LIMIT 20) past a few thousand rows.
Column lists vs SELECT * — explicit columns enable index-only scans, document intent, and survive schema changes more gracefully.

Common pitfalls#

WHERE col <> 5 dropping NULL rows. If you want rows where col is not 5 including NULLs, write WHERE col IS DISTINCT FROM 5 (Postgres) or WHERE col <> 5 OR col IS NULL.
NOT IN with a NULL inside the list returns nothing. WHERE col NOT IN (1, 2, NULL) is always UNKNOWN, never TRUE. Use NOT EXISTS or filter NULLs out of the subquery.
Function on the indexed column kills the index. WHERE date(created_at) = ? doesn’t use an index on created_at. Rewrite as a range.
SELECT * blocking index-only scans. A covering index can answer the query without touching the heap, but only if every selected column is in the index. SELECT * reads them all.
Forgetting the tie-breaker in pagination. ORDER BY created_at DESC LIMIT 50 OFFSET 50 will return inconsistent results if multiple rows share a timestamp. Add , id DESC.
Relying on short-circuit evaluation — the spec doesn’t promise it. Use CASE WHEN if correctness depends on order.
COUNT(col) vs COUNT(*) — COUNT(col) excludes NULLs, COUNT(*) counts rows. Use COUNT(*) unless you specifically mean “how many non-NULL values”.
SELECT DISTINCT hiding bugs — slapping DISTINCT on a query that’s returning unexpected duplicates usually masks a join cardinality bug. Find the source duplication first.

The mental model that helps

Treat WHERE as the only place you control row eligibility, SELECT as projection-only (no filtering, just column choice), and ORDER BY / LIMIT as the only ordering and truncation. When something doesn’t work, ask: at which logical step does the wrong row appear? Wrong rows mean WHERE; wrong columns mean SELECT; wrong order means ORDER BY. The mental model lines up cleanly with the algebra: WHERE is σ, SELECT is π, JOIN is ⋈, ORDER BY / LIMIT are SQL extensions that sit on top.