Functional Dependencies

A → B as the unit of schema reasoning; closures; minimal covers; how FDs drive the normal forms.

Concept Intermediate
7 min read
normalization functional-dependencies closure theory

Summary#

A functional dependency (FD) is a statement that within one relation, the values of one set of attributes uniquely determine the values of another. Written X → Y, read “X functionally determines Y”: for every two tuples that agree on X, they must also agree on Y. student_id → name says “two rows with the same student_id always have the same name”. FDs are the formal vocabulary normalisation theory is built on — every normal form (2NF, 3NF, BCNF, 4NF) is defined in terms of which FDs are allowed in a relation.

The reason to know the theory, even casually, is that it gives you a precise way to argue “this schema has a redundancy bug”. Without FDs, the argument is hand-wavy (“looks denormalised”); with FDs, you can point to a specific dependency that violates a specific normal form.

Why it matters#

Schema bugs caused by missing or wrong dependencies are silent and durable. A schema that allows (course_code, course_name) to drift apart will eventually drift apart — different rows will record different names for the same course — and the bug surfaces months later as inconsistent reports. FD analysis catches this at design time by asking: what depends on what, and is the key strong enough to enforce it?

In interviews, FDs are the prerequisite for any serious normalisation discussion. The exam-style version is “given FDs {A → B, B → C}, is the relation in 3NF?”. The applied version is “you have a users table with (user_id, email, country_code, country_name). Find the redundancy.” Both want FD reasoning.

How it works#

What X → Y actually means#

X → Y is a constraint on the relation, not on the current instance. It says: for any valid state of this relation, no two tuples can agree on X and disagree on Y. The fact that today’s data happens to satisfy X → Y is not enough; you have to know it always will.

This matters because FDs come from the domain, not from the data:

  • ssn → name — true if we know SSN identifies a person and their name is recorded.
  • name → ssn — usually false (two people can share a name).
  • (order_id, line_number) → product_id — true if line numbers are unique within an order.

Trivial vs non-trivial FDs#

X → Y is trivial if Y ⊆ X — every set of attributes determines itself. (user_id, email) → email is trivial. Non-trivial FDs are the interesting ones; normalisation cares only about non-trivial dependencies.

Closure#

The closure of an attribute set X under a set of FDs F, written X+, is the set of all attributes that can be inferred to depend on X. Compute it by repeatedly applying FDs in F:

F = { A → B, B → C, CD → E }
X = {A}
X+ = {A} -- start with X itself
A → B applies => X+ = {A, B}
B → C applies => X+ = {A, B, C}
CD → E does NOT apply (D is not in X+ yet)
done. X+ = {A, B, C}

X is a superkey of the relation iff X+ equals all attributes. X is a candidate key if it’s a minimal superkey — removing any attribute makes X+ lose at least one attribute.

This is the algorithmic core of every normalisation check: compute closures, find candidate keys, then evaluate which FDs do or don’t satisfy the form you’re targeting.

Minimal cover#

A minimal cover of an FD set F is an equivalent set F' (same closures) with three properties:

  • Every FD in F' has a single attribute on the right-hand side.
  • No FD in F' is redundant (removing it changes some closure).
  • No left-hand side has an extraneous attribute.

Minimal covers are what synthesis-style 3NF decomposition operates on. A relation can have several minimal covers; any of them works.

How FDs drive normal forms#

Each normal form is a constraint on the allowed FDs:

  • 2NF — no non-prime attribute depends on only part of a candidate key. (Forbids partial dependency on composite keys.)
  • 3NF — for every non-trivial X → A, either X is a superkey or A is a prime attribute. (Forbids transitive dependency through a non-key.)
  • BCNF — for every non-trivial X → A, X is a superkey. (Strict version of 3NF.)
  • 4NF — generalises BCNF to multi-valued dependencies.

You cannot evaluate any of these without first writing down the FDs the relation is supposed to satisfy. That’s the whole reason FDs come first.

Variants and trade-offs#

Inferred from the domain — the schema author writes down the FDs they expect to hold based on business knowledge (“email uniquely identifies a user”, “ISBN determines title and author”). Robust to data drift; survives schema lifetime; the textbook approach.
Inferred from existing data — tools scan a sample of rows and propose candidate FDs that hold in the current data. Useful for archaeology on legacy schemas. Brittle: today’s FDs may not be tomorrow’s. Always validate domain-inferred FDs before trusting them.

Other angles:

  • Multi-valued dependencies (MVDs). X →→ Y means: given X, the set of Y values is independent of the rest. MVDs drive 4NF; they show up when a single relation tries to record two independent multi-valued facts about the same entity.
  • Inclusion dependencies. R[X] ⊆ S[Y] — every value of X in R appears as a value of Y in S. This is what foreign keys enforce. Inclusion dependencies are a separate family from FDs but interact with them in design.
  • Probabilistic / approximate FDs. “Holds in 99.5% of rows”. Used in data cleaning and quality work, not in schema design. Don’t normalise based on them.
  • Embedded FDs. An FD that holds on a projection of a relation but not on the relation itself. Mostly an academic concept; rare in practice.
Worked example — finding candidate keys via closures

Suppose R(A, B, C, D, E) with F = { A → BC, CD → E, B → D, E → A }. To find candidate keys, try minimal supersets and compute closures:

  • {A}+ = {A, B, C, D, E} ✓ (A → BC gives B and C; B → D gives D; CD → E gives E). So A is a superkey.
  • {E}+ = {E, A, B, C, D} ✓ (E → A, then the same chain). E is a superkey.
  • {B}+ = {B, D} ✗ — B alone is not a superkey.
  • {C}+ = {C} ✗ — not a superkey.
  • {BC}+ = {B, C, D, E, A} ✓ — superkey, and minimal (drop either B or C and it fails).

So the candidate keys are {A}, {E}, {BC}. Multiple candidate keys is common; the choice of primary key among them is a design call.

When this is asked in interviews#

Three shapes. As theory — “what’s a functional dependency?” — the answer is the formal definition plus an example plus how it relates to keys. As a worked problem — “given these FDs, find the candidate keys / decompose to 3NF” — the answer walks closures and applies the synthesis algorithm.

As an applied design question — “where’s the redundancy in this table?” — the candidate is expected to write down the FDs they see, find the one that violates BCNF (or 3NF), and propose a decomposition. Senior interviewers care less about the formal proof and more about whether the candidate can name the offending dependency.

A common trap is reading FDs off the data rather than the domain. “Looking at this sample, name → email” — no, that’s a coincidence in the current rows; the domain doesn’t promise it.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.