Normalization — 1NF, 2NF, 3NF — DBMS

Summary#

The first three normal forms each rule out a specific kind of redundancy. 1NF says columns hold atomic values — no repeating groups, no comma-separated lists. 2NF says no non-key attribute depends on only part of a composite candidate key — no partial dependency. 3NF says no non-key attribute depends on another non-key attribute — no transitive dependency. Schemas meeting all three avoid the three classical update anomalies (insertion, update, deletion).

In practice, most relational schemas are in 3NF by construction if you follow the ER-to-relational rules and use surrogate primary keys. The forms still matter as a vocabulary for diagnosing schemas that aren’t — the spreadsheet-derived legacy table, the denormalised reporting table, the “we’ll just add it as a column” creep.

Why it matters#

Every normalisation step prevents a specific class of bug. A schema that violates 1NF can’t enforce per-value constraints (you can’t CHECK a comma-separated string). A schema that violates 2NF stores the same fact under multiple key prefixes; an update to one prefix leaves the other inconsistent. A schema that violates 3NF stores derived facts; the source-of-truth column moves but the cached one doesn’t.

In interviews, normalisation is one of the few places where the formal vocabulary genuinely pays off. “This table isn’t in 3NF because zip → city is a transitive dependency” is precise; “this looks denormalised” is not. Senior interviewers expect the candidate to identify the violating FD by name.

How it works#

Worked example — the bad table#

Consider an unnormalised enrolment record:

Enrollment
  student_id     student_name    course_code    course_name      instructor_id   instructor_name   grade
  ─────────────────────────────────────────────────────────────────────────────────────────────────────
  101            Ada Lovelace    CS101          Intro to CS      I-7             Grace Hopper      A
  101            Ada Lovelace    CS201          Data Structures  I-9             Edsger Dijkstra   B+
  102            Alan Turing     CS101          Intro to CS      I-7             Grace Hopper      A-

Candidate key: (student_id, course_code).

Multiple things are wrong here. Walk it through the three forms.

1NF — atomicity#

A relation is in 1NF if every attribute holds a single, atomic value — no arrays, no nested rows, no comma-separated lists, no JSON-as-a-stringly-typed-list. The textbook violation:

Enrollment_BAD
  student_id   courses_taken
  101          'CS101, CS201, CS301'

courses_taken is multi-valued. The fix is a separate row per course, which is exactly what the original table above already does. 1NF is mostly a “did you actually use rows” check.

Note. 1NF is famously contentious. Postgres JSONB, array columns, and composite types all technically violate the strict textbook 1NF. In practice “1NF” in a working-engineer conversation usually means “atomic enough that you can query it with normal predicates and indexes” — JSON with a clear schema and indexable expressions is fine; JSON as a dumping ground is not.

2NF — no partial dependency on a composite key#

A relation is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on every candidate key — never on a proper subset.

In the Enrollment table, candidate key is (student_id, course_code). But:

student_id → student_name (a partial dependency — only on student_id).
course_code → course_name, instructor_id (only on course_code).

grade is the only attribute that genuinely depends on the full key (student_id, course_code).

Decomposition to 2NF:

Student     (student_id PK, student_name)
Course      (course_code PK, course_name, instructor_id, instructor_name)
Enrollment  (student_id PK FK, course_code PK FK, grade)

2NF is only meaningful when the candidate key is composite. With a single-column surrogate primary key, every non-key attribute trivially depends on the whole key, so 2NF holds by construction.

3NF — no transitive dependency through a non-key#

A relation is in 3NF if it is in 2NF and no non-prime attribute is transitively dependent on a candidate key through another non-prime attribute. Formally: for every non-trivial X → A, either X is a superkey or A is part of some candidate key.

In the Course table from above:

course_code → instructor_id (good — direct).
instructor_id → instructor_name (a transitive dependency — through a non-key attribute).

So course_code → instructor_name exists, but routes through instructor_id. The redundancy: every course row taught by Grace Hopper repeats her name.

Decomposition to 3NF:

Instructor  (instructor_id PK, instructor_name)
Course      (course_code PK, course_name, instructor_id FK)

Now instructor_name lives in exactly one row, and Course references it through the FK. Updating Grace’s name touches one row.

The fully-normalised result#

CREATE TABLE student (
  student_id   bigserial PRIMARY KEY,
  student_name text NOT NULL
);

CREATE TABLE instructor (
  instructor_id   bigserial PRIMARY KEY,
  instructor_name text NOT NULL
);

CREATE TABLE course (
  course_code   text PRIMARY KEY,
  course_name   text NOT NULL,
  instructor_id bigint NOT NULL REFERENCES instructor (instructor_id)
);

CREATE TABLE enrollment (
  student_id  bigint NOT NULL REFERENCES student (student_id),
  course_code text   NOT NULL REFERENCES course  (course_code),
  grade       text,
  PRIMARY KEY (student_id, course_code)
);

Every fact is stored exactly once.

Variants and trade-offs#

Fully 3NF schema — every fact stored once, no update / insertion / deletion anomalies. Joins required for almost every read; modest query complexity; trivial integrity. The right default for OLTP.

Denormalised schema — facts repeated across rows for query speed (one-table reads, columnar-friendly layout, materialised join). Faster reads; integrity now an application concern; update anomalies real. Right for read-heavy analytics, reporting tables, well-defined materialised views.

Where the textbook 3NF and the engineering 3NF diverge:

3NF doesn’t object to surrogate keys. Adding id bigserial PRIMARY KEY doesn’t break anything — every relation in 3NF stays in 3NF when you add a surrogate.
Pre-computed columns are tolerated. total = quantity * unit_price stored alongside the inputs is a derived attribute; strictly violates 3NF if you treat total as dependent on (quantity, unit_price). Engineers ship it anyway when the column is computed by a GENERATED ALWAYS AS expression or maintained by a trigger.
Reporting tables intentionally denormalise. A wide reporting table that pulls in name, city, country, etc., is in 1NF only — and that’s fine because no human is updating it directly; it’s rebuilt from the normalised source.
3NF, BCNF, 4NF, 5NF, 6NF. The progression continues — see BCNF and beyond. For most engineering work, 3NF is the practical ceiling, with BCNF reached for as a check.

Why 2NF basically disappears with surrogate keys

2NF only fires when a candidate key is composite. The moment you make id bigserial PRIMARY KEY the primary key of every table — which is the default in ER-to-relational with surrogate IDs — every non-key attribute trivially depends on the whole key (because the key is one column, and there’s no proper subset to depend on). 2NF holds by construction. This is one reason 3NF is the form people actually talk about; 2NF is mostly a relic of natural-composite-key schemas.

When this is asked in interviews#

Three shapes. As a vocabulary check — “what’s the difference between 2NF and 3NF?” — the answer is partial dependency vs transitive dependency, with one example each. As a worked decomposition — “decompose this table into 3NF” — the candidate writes down FDs, finds the candidate key, identifies the offending dependencies, and produces the three or four resulting tables.

As a design discussion — “this table seems wrong; what would you change?” — senior interviewers want the candidate to identify the FD that’s violated and propose the split, rather than just saying “looks denormalised”.

A common trap is over-normalising — splitting a table that’s already in 3NF further “because it can be smaller”. The right counter is “is there a redundancy or anomaly that this decomposition removes?”. If no, leave it.