Entity-Relationship Model — Entities, Attributes, Keys

Summary#

The entity-relationship (ER) model is the vocabulary used to sketch a schema before writing any DDL. It has three nouns — entities (the real-world things you store data about), attributes (the facts you record about each entity), and keys (the attributes that uniquely identify entities) — and a handful of relationships and constraints that connect them. Five minutes of ER thinking saves five months of migration pain.

The model predates SQL by about a decade (Chen, 1976) and survived every fashion since because it carves the design problem at a useful joint: figure out the things and their identities first, figure out their relationships next, and only then worry about tables, columns, indexes, and queries. Most schema-design interview answers walk this exact path.

Why it matters#

Schemas outlive code. The class hierarchy you ship today gets refactored next quarter; the users table you ship today has the same primary key three years from now, with five teams’ worth of foreign keys pointing at it. Getting the entities and keys right is the highest-leverage decision in the data layer, and ER is the only standard notation for thinking about that decision before you’ve committed to any specific DBMS.

In interviews, “design a schema for X” is a default warm-up — library, e-commerce checkout, ride-sharing app, ticketing system. The expected motion is ER first, tables second. Skipping straight to CREATE TABLE is a tell that you treat data modelling as syntax rather than design.

How it works#

The ER model has a small, precise vocabulary. Get the nouns straight and the rest follows.

Entity vs entity set vs entity type#

Entity — a single real-world thing the database tracks. Specific instance: “user with email ada@example.com”.
Entity set — the collection of all current entities of one type. The contents of the users table at any moment.
Entity type — the schema describing those entities. The definition of the users table itself.

This three-way distinction matters because schema migrations change the entity type, while everyday writes change the entity set. Confusing the two leads to muddled discussions about “the data model”.

Attributes#

Attributes are the named facts about an entity. The ER model recognises four flavours:

Simple (atomic) — one indivisible value. email, created_at.
Composite — decomposable. address is composed of street, city, zip. In the relational translation, you usually flatten these into separate columns.
Multi-valued — one entity has many values. A user has many phone_numbers. In the relational translation, this becomes a separate table.
Derived — computable from other attributes. age from date_of_birth. Usually not stored; computed on read.

Keys#

A key is an attribute (or set of attributes) that uniquely identifies an entity. The ER model defines several flavours; the relational model inherits them:

Superkey — any set of attributes that uniquely identifies an entity. {email, name} is a superkey if email alone is unique.
Candidate key — a minimal superkey. Removing any attribute breaks uniqueness. email is a candidate key; {email, name} is not (it’s not minimal).
Primary key — the chosen candidate key. Every entity type has exactly one. Used for identity, for foreign-key references, and usually for the clustering index.
Surrogate key — an artificial value (UUID, auto-increment integer) introduced as the primary key, independent of any business attribute.

Entity:        User
Attributes:    user_id (PK), email (candidate key), name, created_at
Constraint:    email UNIQUE, NOT NULL

Notation#

The classic Chen notation uses rectangles for entities, ovals for attributes, diamonds for relationships, and underlines for primary keys. The Crow’s-foot notation (used in most modern ERDs and ORM-generated diagrams) drops the diamonds and uses line-end markers for cardinality. Either works; consistency within one diagram matters more than the choice.

Variants and trade-offs#

The ER model has one big interpretive decision and several small ones.

Natural primary keys — use a real-world attribute (email, ISBN, license plate, SSN) as the primary key. Saves one column, makes the key human-meaningful, removes the need for a join when the natural value is what callers know.

Surrogate primary keys — introduce a synthetic column (UUID, bigserial) as the primary key. Natural attributes become unique constraints. Stable under business rule changes; consistent column type across the schema; safer for distributed ID generation.

The surrogate side usually wins in production. Natural keys break when the business decides email can change, when a country’s SSN format updates, when a vendor restructures their ISBN. Refactoring a primary key referenced by five other tables is one of the worst migrations to write. The cost of carrying a 16-byte UUID per row is small; the cost of a primary-key migration is enormous.

Other variants:

Single-column vs composite primary keys. Junction tables for many-to-many relationships often have composite keys (user_id, role_id). Strong entities almost always have single-column surrogate keys.
Integer vs UUID surrogate. Auto-increment integers are smaller and pack better in B-tree indexes; UUIDs are independent of a central allocator and safer for offline / multi-region writers. Modern compromise: time-ordered UUIDs (UUIDv7, ULID) get most of the locality benefit without the central-allocator cost.
Strong vs weak entities. Most entities are strong (have their own primary key). Weak entities exist only in relation to a parent and use a composite key combining the parent’s key with a discriminator. Covered in detail in Weak Entities.

When natural keys do win

ISO country codes (country_code = 'IN'), currency codes (USD), language tags (en-US), and similar dictionary-style tables. These are stable, short, human-readable, and the value is what every caller already has. Adding a surrogate would just force a country_code lookup on every join. The rule: natural keys are fine when the natural value is itself a stable, universally-agreed identifier maintained by someone other than your engineering team.

When this is asked in interviews#

Three places. First, as the warm-up to a schema design question — “design the data model for X”. Walk the ER drill aloud; the interviewer is checking that you have a method rather than improvising. Second, as a follow-up to a system design — “what does your orders table look like?” — where the right answer pulls out three or four entities, names the keys, and notes which columns are nullable. Third, as a pure DBMS question — “explain the difference between a candidate key and a primary key” — where the expected answer is one sentence each plus an example.

A common follow-up: “why did you pick user_id over email as the primary key?” The answer wants to mention stability (emails change), distribution (surrogate IDs are easier to shard), and the cost of cascading FK references when the key value moves.

The trap question is “are nulls allowed in a primary key?” — no, entity integrity says they aren’t. Composite keys can’t have any null component either.