Relational Algebra Basics
Selection, projection, join, union, difference. The algebra SQL is compiled into.
What it is#
Relational algebra is the small, closed set of operators Codd defined to manipulate relations: selection (σ, filter rows), projection (π, pick columns), join (⋈, combine relations on a predicate), union (∪), difference (−), Cartesian product (×), and rename (ρ). Every operator takes one or two relations and returns a relation — this closure is what lets the operators be composed without limit. A few derived operators (intersection, division, the various outer joins, grouping with aggregation) round out what real query engines need.
Practically, SQL compiles into relational algebra. The planner takes a parsed query, builds an algebra tree (logical plan), applies rewrite rules that preserve semantics, and then chooses a physical implementation for each operator (sequential scan, index scan, nested loop, hash join, merge join). Knowing the algebra is what lets you read an EXPLAIN plan and predict what the engine will do.
When to use it#
The algebra is a thinking tool, not something you write directly in production. Reach for it when:
- Reading a query plan. Plans are algebra trees in disguise. Every
Seq Scanis a σ on top of a relation; everyNested Loopis a ⋈; everyHashAggregateis a γ (grouping with aggregation). - Reasoning about query equivalence. “Is
SELECT a FROM t WHERE b = 1the same asSELECT a FROM (SELECT * FROM t WHERE b = 1) s?” The algebra makes the answer immediate (yes, by projection-selection commutativity). - Debugging optimizer choices. When you push a predicate inside a subquery and the query speeds up by 100x, you’ve just done a predicate pushdown the planner missed.
- Teaching or explaining. “Why do we always say
JOIN ... ONrather thanWHERE?” — because the algebraic distinction (join is binary, selection is unary on one relation) is real and lets the planner reorder them independently.
In day-to-day SQL you’ll never write π_{a,b}(σ_{c=1}(R ⋈ S)). But the planner does, and understanding what it’s doing under the hood is a force multiplier for performance work.
How it works#
Selection (σ)#
σ_{predicate}(R) returns the subset of R whose tuples satisfy the predicate.
σ_{age > 30}(employees) = { t ∈ employees | t.age > 30 }SQL form:
SELECT * FROM employees WHERE age > 30;Selection is unary, preserves schema, and is idempotent and commutative with itself: σ_p(σ_q(R)) = σ_q(σ_p(R)) = σ_{p ∧ q}(R).
Projection (π)#
π_{a,b}(R) returns R with only the listed attributes kept. In the pure model it deduplicates; in SQL it doesn’t unless you say DISTINCT.
π_{name, salary}(employees)SELECT name, salary FROM employees;Selection and projection commute when the predicate only references projected columns: π_A(σ_p(R)) = σ_p(π_A(R)) provided p references only attributes in A. This is one of the most important rewrite rules — the planner uses it to push projections as low in the tree as possible to reduce data volume early.
Cartesian product (×) and join (⋈)#
R × S is the unconstrained pair-up: every tuple of R glued to every tuple of S. The result has |R| * |S| tuples — almost never what you want directly.
A theta join is selection on top of product: R ⋈_p S = σ_p(R × S). The common case, equi-join, has p of the form R.a = S.b:
SELECT * FROM orders o JOIN users u ON o.user_id = u.id;A natural join (rare in SQL but common in the textbook algebra) equi-joins on every attribute that shares a name in both relations.
Outer joins (left, right, full) preserve unmatched tuples by padding with NULLs. They aren’t pure-algebra primitives — they’re a SQL extension that requires NULL semantics to be defined.
Set operations: ∪, −, ∩#
These require union-compatible relations: same arity, matching attribute types.
R ∪ S -- tuples in R or S (set; deduplicated)R − S -- tuples in R but not in SR ∩ S -- tuples in bothSELECT id FROM active_usersUNIONSELECT id FROM premium_users;UNION deduplicates (set); UNION ALL keeps duplicates (bag). EXCEPT is difference; INTERSECT is intersection.
Rename (ρ)#
ρ_{new_name}(R) renames a relation or its attributes — needed when joining a relation to itself, or when two operands have colliding attribute names.
SELECT e.name, m.name AS managerFROM employees e JOIN employees m ON e.manager_id = m.id;The e and m aliases are the SQL syntax for ρ.
Grouping and aggregation (γ)#
Not in Codd’s original six, but every shipping engine needs it. Group tuples by some attributes, then compute aggregates per group:
γ_{region; COUNT(*) AS n}(users)SELECT region, COUNT(*) AS n FROM users GROUP BY region;The result is a relation with attributes (region, n) — closure preserved.
Variants#
- Bag algebra — relaxes the set semantics to allow duplicates. Closer to how SQL actually behaves; the operators are defined to handle multiplicity.
- Tuple relational calculus / domain relational calculus — declarative alternatives (“the relation of all t such that P(t)” rather than a chain of operators). SQL is closer to TRC; Datalog and QUEL are closer to DRC.
- Extended algebra — adds aggregation (γ), outer joins (⟕, ⟖, ⟗), grouping, window functions. What modern engines actually implement.
- Streaming / temporal algebra — extensions for unbounded inputs and event-time semantics, used by systems like Flink and ksqlDB.
Trade-offs#
Practical implications worth knowing:
- Predicate pushdown —
σ_p(R ⋈ S) = (σ_p(R)) ⋈ Swhenpreferences onlyR. The planner applies this aggressively; you can help by writing predicates where they belong. - Join reordering — joins are associative and commutative (modulo NULL), so the planner can pick any order. Join order is one of the deepest optimisation problems; the planner uses cost estimates to decide.
- Projection pushdown — keep only the attributes you need as early as possible to reduce intermediate row width.
- Selection-join commutativity is fragile under outer joins —
σ_pof aLEFT JOINdoes not generally equalLEFT JOINofσ_pbecause of NULL padding. The planner has to be careful here.
Common pitfalls#
- Confusing
JOIN ... ONwithWHEREfor outer joins.LEFT JOIN t ON t.col = ?andLEFT JOIN t ... WHERE t.col = ?are different — the second filters out the NULL-padded rows, turning the outer join effectively into an inner join. Algebraically,ONis part of the ⟕ operator;WHEREis a σ on top. - Forgetting projection deduplicates in the pure algebra. SQL doesn’t.
SELECT region FROM usersmay return duplicates;π_{region}(users)in the pure algebra would not. UseDISTINCTwhen you want set semantics. - NULL breaks intuition.
σ_{a = b}(R)doesn’t return rows where bothaandbare NULL, becauseNULL = NULLis UNKNOWN, not TRUE. Most algebra tutorials gloss over this; SQL forces you to handle it. - Over-projecting too late. Writing
SELECT *from a subquery and then projecting at the outer query hides which columns are actually needed. The planner usually figures it out, but explicit projections help reviewers and make column-pruning visible. - Cartesian products in disguise. A missing JOIN condition reduces to
R × S— the result is the multiplied row count, often with no error. Always specify every join condition explicitly, and preferJOIN ... ONover comma-joins inFROM.
Why the planner can rewrite your query so aggressively
Every rewrite rule the planner applies — predicate pushdown, projection pushdown, join reordering, subquery flattening, CTE inlining — is a theorem about the algebra. The optimizer’s job is to enumerate plans that are provably equivalent to the original query and pick the cheapest. Provability rests on closure: every step takes a relation and returns a relation, with semantics defined operator by operator. This is why SQL is declarative: you describe what you want in terms of operators with stable meaning, and the planner does whatever clever thing produces the same result. Take the algebra away and the planner is reduced to executing your query literally.
Related building blocks#