Relational Algebra Basics — DBMS · Engineering Playbook

What it is#

Relational algebra is the small, closed set of operators Codd defined to manipulate relations: selection (σ, filter rows), projection (π, pick columns), join (⋈, combine relations on a predicate), union (∪), difference (−), Cartesian product (×), and rename (ρ). Every operator takes one or two relations and returns a relation — this closure is what lets the operators be composed without limit. A few derived operators (intersection, division, the various outer joins, grouping with aggregation) round out what real query engines need.

Practically, SQL compiles into relational algebra. The planner takes a parsed query, builds an algebra tree (logical plan), applies rewrite rules that preserve semantics, and then chooses a physical implementation for each operator (sequential scan, index scan, nested loop, hash join, merge join). Knowing the algebra is what lets you read an EXPLAIN plan and predict what the engine will do.

When to use it#

The algebra is a thinking tool, not something you write directly in production. Reach for it when:

Reading a query plan. Plans are algebra trees in disguise. Every Seq Scan is a σ on top of a relation; every Nested Loop is a ⋈; every HashAggregate is a γ (grouping with aggregation).
Reasoning about query equivalence. “Is SELECT a FROM t WHERE b = 1 the same as SELECT a FROM (SELECT * FROM t WHERE b = 1) s?” The algebra makes the answer immediate (yes, by projection-selection commutativity).
Debugging optimizer choices. When you push a predicate inside a subquery and the query speeds up by 100x, you’ve just done a predicate pushdown the planner missed.
Teaching or explaining. “Why do we always say JOIN ... ON rather than WHERE?” — because the algebraic distinction (join is binary, selection is unary on one relation) is real and lets the planner reorder them independently.

In day-to-day SQL you’ll never write π_{a,b}(σ_{c=1}(R ⋈ S)). But the planner does, and understanding what it’s doing under the hood is a force multiplier for performance work.

How it works#

Selection (σ)#

σ_{predicate}(R) returns the subset of R whose tuples satisfy the predicate.

σ_{age > 30}(employees) = { t ∈ employees | t.age > 30 }

SQL form:

SELECT * FROM employees WHERE age > 30;

Selection is unary, preserves schema, and is idempotent and commutative with itself: σ_p(σ_q(R)) = σ_q(σ_p(R)) = σ_{p ∧ q}(R).

Projection (π)#

π_{a,b}(R) returns R with only the listed attributes kept. In the pure model it deduplicates; in SQL it doesn’t unless you say DISTINCT.

π_{name, salary}(employees)

SELECT name, salary FROM employees;

Selection and projection commute when the predicate only references projected columns: π_A(σ_p(R)) = σ_p(π_A(R)) provided p references only attributes in A. This is one of the most important rewrite rules — the planner uses it to push projections as low in the tree as possible to reduce data volume early.

Cartesian product (×) and join (⋈)#

R × S is the unconstrained pair-up: every tuple of R glued to every tuple of S. The result has |R| * |S| tuples — almost never what you want directly.

A theta join is selection on top of product: R ⋈_p S = σ_p(R × S). The common case, equi-join, has p of the form R.a = S.b:

SELECT * FROM orders o JOIN users u ON o.user_id = u.id;

A natural join (rare in SQL but common in the textbook algebra) equi-joins on every attribute that shares a name in both relations.

Outer joins (left, right, full) preserve unmatched tuples by padding with NULLs. They aren’t pure-algebra primitives — they’re a SQL extension that requires NULL semantics to be defined.

Set operations: ∪, −, ∩#

These require union-compatible relations: same arity, matching attribute types.

R ∪ S    -- tuples in R or S (set; deduplicated)
R − S    -- tuples in R but not in S
R ∩ S    -- tuples in both

SELECT id FROM active_users
UNION
SELECT id FROM premium_users;

UNION deduplicates (set); UNION ALL keeps duplicates (bag). EXCEPT is difference; INTERSECT is intersection.

Rename (ρ)#

ρ_{new_name}(R) renames a relation or its attributes — needed when joining a relation to itself, or when two operands have colliding attribute names.

SELECT e.name, m.name AS manager
FROM employees e JOIN employees m ON e.manager_id = m.id;

The e and m aliases are the SQL syntax for ρ.

Grouping and aggregation (γ)#

Not in Codd’s original six, but every shipping engine needs it. Group tuples by some attributes, then compute aggregates per group:

γ_{region; COUNT(*) AS n}(users)

SELECT region, COUNT(*) AS n FROM users GROUP BY region;

The result is a relation with attributes (region, n) — closure preserved.

Variants#

Bag algebra — relaxes the set semantics to allow duplicates. Closer to how SQL actually behaves; the operators are defined to handle multiplicity.
Tuple relational calculus / domain relational calculus — declarative alternatives (“the relation of all t such that P(t)” rather than a chain of operators). SQL is closer to TRC; Datalog and QUEL are closer to DRC.
Extended algebra — adds aggregation (γ), outer joins (⟕, ⟖, ⟗), grouping, window functions. What modern engines actually implement.
Streaming / temporal algebra — extensions for unbounded inputs and event-time semantics, used by systems like Flink and ksqlDB.

Trade-offs#

Operator-level reasoning (algebra) — clean, finite, every operator has a precise meaning. Lets you prove equivalences, design optimizer rewrite rules, and predict query behaviour. Doesn’t cover NULL, ordering, or aggregation natively without extensions.

SQL-level reasoning — the language you actually write. Captures the messy real-world parts (NULL, three-valued logic, ordering, window functions) but is verbose for proofs and equivalence reasoning. Use SQL to ship; drop down to the algebra when the planner does something unexpected.

Practical implications worth knowing:

Predicate pushdown — σ_p(R ⋈ S) = (σ_p(R)) ⋈ S when p references only R. The planner applies this aggressively; you can help by writing predicates where they belong.
Join reordering — joins are associative and commutative (modulo NULL), so the planner can pick any order. Join order is one of the deepest optimisation problems; the planner uses cost estimates to decide.
Projection pushdown — keep only the attributes you need as early as possible to reduce intermediate row width.
Selection-join commutativity is fragile under outer joins — σ_p of a LEFT JOIN does not generally equal LEFT JOIN of σ_p because of NULL padding. The planner has to be careful here.

Common pitfalls#

Confusing JOIN ... ON with WHERE for outer joins. LEFT JOIN t ON t.col = ? and LEFT JOIN t ... WHERE t.col = ? are different — the second filters out the NULL-padded rows, turning the outer join effectively into an inner join. Algebraically, ON is part of the ⟕ operator; WHERE is a σ on top.
Forgetting projection deduplicates in the pure algebra. SQL doesn’t. SELECT region FROM users may return duplicates; π_{region}(users) in the pure algebra would not. Use DISTINCT when you want set semantics.
NULL breaks intuition. σ_{a = b}(R) doesn’t return rows where both a and b are NULL, because NULL = NULL is UNKNOWN, not TRUE. Most algebra tutorials gloss over this; SQL forces you to handle it.
Over-projecting too late. Writing SELECT * from a subquery and then projecting at the outer query hides which columns are actually needed. The planner usually figures it out, but explicit projections help reviewers and make column-pruning visible.
Cartesian products in disguise. A missing JOIN condition reduces to R × S — the result is the multiplied row count, often with no error. Always specify every join condition explicitly, and prefer JOIN ... ON over comma-joins in FROM.

Why the planner can rewrite your query so aggressively

Every rewrite rule the planner applies — predicate pushdown, projection pushdown, join reordering, subquery flattening, CTE inlining — is a theorem about the algebra. The optimizer’s job is to enumerate plans that are provably equivalent to the original query and pick the cheapest. Provability rests on closure: every step takes a relation and returns a relation, with semantics defined operator by operator. This is why SQL is declarative: you describe what you want in terms of operators with stable meaning, and the planner does whatever clever thing produces the same result. Take the algebra away and the planner is reduced to executing your query literally.