Functional Dependencies — DBMS · Engineering Playbook

Summary#

A functional dependency (FD) is a statement that within one relation, the values of one set of attributes uniquely determine the values of another. Written X → Y, read “X functionally determines Y”: for every two tuples that agree on X, they must also agree on Y. student_id → name says “two rows with the same student_id always have the same name”. FDs are the formal vocabulary normalisation theory is built on — every normal form (2NF, 3NF, BCNF, 4NF) is defined in terms of which FDs are allowed in a relation.

The reason to know the theory, even casually, is that it gives you a precise way to argue “this schema has a redundancy bug”. Without FDs, the argument is hand-wavy (“looks denormalised”); with FDs, you can point to a specific dependency that violates a specific normal form.

Why it matters#

Schema bugs caused by missing or wrong dependencies are silent and durable. A schema that allows (course_code, course_name) to drift apart will eventually drift apart — different rows will record different names for the same course — and the bug surfaces months later as inconsistent reports. FD analysis catches this at design time by asking: what depends on what, and is the key strong enough to enforce it?

In interviews, FDs are the prerequisite for any serious normalisation discussion. The exam-style version is “given FDs {A → B, B → C}, is the relation in 3NF?”. The applied version is “you have a users table with (user_id, email, country_code, country_name). Find the redundancy.” Both want FD reasoning.

How it works#

What `X → Y` actually means#

X → Y is a constraint on the relation, not on the current instance. It says: for any valid state of this relation, no two tuples can agree on X and disagree on Y. The fact that today’s data happens to satisfy X → Y is not enough; you have to know it always will.

This matters because FDs come from the domain, not from the data:

ssn → name — true if we know SSN identifies a person and their name is recorded.
name → ssn — usually false (two people can share a name).
(order_id, line_number) → product_id — true if line numbers are unique within an order.

Trivial vs non-trivial FDs#

X → Y is trivial if Y ⊆ X — every set of attributes determines itself. (user_id, email) → email is trivial. Non-trivial FDs are the interesting ones; normalisation cares only about non-trivial dependencies.

Closure#

The closure of an attribute set X under a set of FDs F, written X+, is the set of all attributes that can be inferred to depend on X. Compute it by repeatedly applying FDs in F:

F = { A → B, B → C, CD → E }

X = {A}
X+ = {A}                  -- start with X itself
A → B  applies            => X+ = {A, B}
B → C  applies            => X+ = {A, B, C}
CD → E does NOT apply     (D is not in X+ yet)
done. X+ = {A, B, C}

X is a superkey of the relation iff X+ equals all attributes. X is a candidate key if it’s a minimal superkey — removing any attribute makes X+ lose at least one attribute.

This is the algorithmic core of every normalisation check: compute closures, find candidate keys, then evaluate which FDs do or don’t satisfy the form you’re targeting.

Minimal cover#

A minimal cover of an FD set F is an equivalent set F' (same closures) with three properties:

Every FD in F' has a single attribute on the right-hand side.
No FD in F' is redundant (removing it changes some closure).
No left-hand side has an extraneous attribute.

Minimal covers are what synthesis-style 3NF decomposition operates on. A relation can have several minimal covers; any of them works.

How FDs drive normal forms#

Each normal form is a constraint on the allowed FDs:

2NF — no non-prime attribute depends on only part of a candidate key. (Forbids partial dependency on composite keys.)
3NF — for every non-trivial X → A, either X is a superkey or A is a prime attribute. (Forbids transitive dependency through a non-key.)
BCNF — for every non-trivial X → A, X is a superkey. (Strict version of 3NF.)
4NF — generalises BCNF to multi-valued dependencies.

You cannot evaluate any of these without first writing down the FDs the relation is supposed to satisfy. That’s the whole reason FDs come first.

Variants and trade-offs#

Inferred from the domain — the schema author writes down the FDs they expect to hold based on business knowledge (“email uniquely identifies a user”, “ISBN determines title and author”). Robust to data drift; survives schema lifetime; the textbook approach.

Inferred from existing data — tools scan a sample of rows and propose candidate FDs that hold in the current data. Useful for archaeology on legacy schemas. Brittle: today’s FDs may not be tomorrow’s. Always validate domain-inferred FDs before trusting them.

Other angles:

Multi-valued dependencies (MVDs). X →→ Y means: given X, the set of Y values is independent of the rest. MVDs drive 4NF; they show up when a single relation tries to record two independent multi-valued facts about the same entity.
Inclusion dependencies. R[X] ⊆ S[Y] — every value of X in R appears as a value of Y in S. This is what foreign keys enforce. Inclusion dependencies are a separate family from FDs but interact with them in design.
Probabilistic / approximate FDs. “Holds in 99.5% of rows”. Used in data cleaning and quality work, not in schema design. Don’t normalise based on them.
Embedded FDs. An FD that holds on a projection of a relation but not on the relation itself. Mostly an academic concept; rare in practice.

Worked example — finding candidate keys via closures

Suppose R(A, B, C, D, E) with F = { A → BC, CD → E, B → D, E → A }. To find candidate keys, try minimal supersets and compute closures:

{A}+ = {A, B, C, D, E} ✓ (A → BC gives B and C; B → D gives D; CD → E gives E). So A is a superkey.
{E}+ = {E, A, B, C, D} ✓ (E → A, then the same chain). E is a superkey.
{B}+ = {B, D} ✗ — B alone is not a superkey.
{C}+ = {C} ✗ — not a superkey.
{BC}+ = {B, C, D, E, A} ✓ — superkey, and minimal (drop either B or C and it fails).

So the candidate keys are {A}, {E}, {BC}. Multiple candidate keys is common; the choice of primary key among them is a design call.

When this is asked in interviews#

Three shapes. As theory — “what’s a functional dependency?” — the answer is the formal definition plus an example plus how it relates to keys. As a worked problem — “given these FDs, find the candidate keys / decompose to 3NF” — the answer walks closures and applies the synthesis algorithm.

As an applied design question — “where’s the redundancy in this table?” — the candidate is expected to write down the FDs they see, find the one that violates BCNF (or 3NF), and propose a decomposition. Senior interviewers care less about the formal proof and more about whether the candidate can name the offending dependency.

A common trap is reading FDs off the data rather than the domain. “Looking at this sample, name → email” — no, that’s a coincidence in the current rows; the domain doesn’t promise it.