MVCC — Multi-Version Concurrency Control — DBMS

What it is#

Multi-Version Concurrency Control (MVCC) is the concurrency mechanism where every write creates a new version of the affected row instead of overwriting in place, and every read sees the version visible to its transaction’s snapshot. Readers see a consistent point-in-time view of the data without locking; writers don’t block readers; readers don’t block writers. Conflicts between concurrent writers on the same row are detected at write time (first-committer-wins or first-updater-wins) rather than prevented by exclusive read locks.

MVCC is the default in PostgreSQL, MySQL InnoDB, Oracle, SQL Server (when snapshot isolation is enabled), CockroachDB, TiDB, YugabyteDB, FoundationDB, and Spanner. Pure lock-based concurrency (2PL — two-phase locking) survives in some embedded systems and DB2, but every major distributed and most major OLTP engines moved to MVCC because reads not blocking writes is the single biggest practical win for mixed workloads.

When to use it#

MVCC is the right default whenever:

The workload mixes long-running reads with short writes. Analytics queries, reporting jobs, or ad-hoc dashboards on a live OLTP database all degrade catastrophically under 2PL because read locks pile up; MVCC handles them gracefully.
Snapshot isolation is the contract the application can build against. Most applications can be expressed correctly against snapshot isolation; the genuine write-skew cases that require serializable are rare and explicit.
The engine is distributed — every modern distributed transactional store (Spanner, Cockroach, FoundationDB, TiKV) uses MVCC because version IDs (timestamps) are the natural unit of coordination across nodes, while distributed lock managers don’t scale.
Time travel queries are useful — Postgres doesn’t expose them directly but Snowflake, BigQuery, Cockroach, and Spanner all let users read at a past timestamp, which is essentially free in an MVCC engine.

Avoid MVCC (or accept that some workloads suffer under it) when:

The workload is update-heavy on a hot row — incrementing a single counter at 10K/sec under MVCC generates 10K versions per second; vacuum or compaction must keep pace.
Tight latency on a small in-memory dataset matters more than read-write concurrency — some embedded engines and ultra-low-latency systems still pick fine-grained locks.

How it works#

Versions and visibility#

Every row physically stored has two transaction IDs attached:

xmin — the ID of the transaction that created this version.
xmax — the ID of the transaction that deleted (or updated, producing a new version) this row, or zero if still live.

Each transaction takes a snapshot when it starts (REPEATABLE READ) or at the start of each statement (READ COMMITTED). The snapshot contains the highest committed transaction ID and the list of in-progress transactions. A row version is visible to a transaction T iff:

xmin is committed and xmin < snapshot_max and xmin is not in the in-progress list.
xmax is zero, or xmax is not yet committed at snapshot time, or xmax is in the in-progress list.

Reads walk the heap (or the index, then the heap) and apply the visibility check; rows invisible to the snapshot are skipped. Writers don’t block this — they may have added or marked versions, but those don’t pass the visibility test.

Write conflicts#

When two transactions update the same row, MVCC engines use one of:

First-updater-wins (Postgres, MySQL InnoDB). The first transaction to acquire the row-level write lock proceeds; the second blocks until the first commits or aborts. If the first commits, the second’s update applies to the new version (READ COMMITTED) or aborts with a serialization failure (REPEATABLE READ).
First-committer-wins / optimistic (Spanner, FoundationDB, CockroachDB). Both transactions proceed without blocking; at commit time, the engine checks whether the row’s version has changed since the read; if so, the transaction aborts and the application retries.

The optimistic variant scales better under low contention but bursts of retries under contention; the pessimistic variant offers more predictable latency at the cost of write-side blocking.

Garbage: vacuum, undo, or compaction#

MVCC generates dead row versions — old versions superseded by newer ones, and rows deleted by committed transactions. Three approaches to cleanup:

In-place with vacuum (Postgres) — dead tuples stay in the heap until a VACUUM (autovacuum) reclaims their space. Vacuum is concurrent with reads and writes but can lag under heavy update load, producing bloat — tables larger than their live data.
Undo segments (Oracle, MySQL InnoDB) — old versions live in a separate undo tablespace rather than the main heap. The main heap holds only the current version; old versions are reconstructed from undo when needed. The undo log is purged in transaction-ID order once no snapshot needs it.
Compaction (LSM-based MVCC engines, Spanner, Cockroach) — old versions live as separate KV pairs keyed by (key, timestamp); a background compaction process drops versions older than the configured retention plus any active snapshot.

The garbage collector — vacuum, purge thread, compactor — is the single most operationally significant background process in an MVCC engine. Tune it for the write rate or accept the bloat.

Isolation levels#

MVCC naturally implements snapshot isolation (SI): each transaction sees a consistent snapshot. SI prevents dirty reads, non-repeatable reads, and most phantom reads, but it allows write skew — two transactions read disjoint sets, write into each other’s read sets, and both commit because they don’t conflict on any single row.

To get serializable on top of MVCC, engines layer on:

Serializable Snapshot Isolation (SSI) — Postgres’s approach. Track read-write dependencies between concurrent transactions; if a cycle forms, abort one. Detects write skew without taking read locks.
Predicate locking — Spanner and some others; locks the range of keys the predicate could cover, not just the keys read.
Timestamp-based serialization — Spanner’s TrueTime, FoundationDB’s deterministic timestamps. The commit timestamp orders transactions globally; the engine can prove the schedule is serializable.

Indexes and MVCC#

Indexes in an MVCC engine must also handle versions. Postgres’s secondary indexes point to every version of a row — index entries for dead tuples persist until vacuum cleans both. InnoDB’s secondary indexes only point to the primary key; the visibility check happens after the primary-key lookup, which is why InnoDB indexes don’t suffer the same bloat as Postgres’s.

Variants#

Postgres MVCC — in-place, vacuum-driven. Simple model, well-documented, prone to bloat under update-heavy workloads.
InnoDB MVCC — undo segments, secondary indexes point only at primary key, no index bloat from MVCC directly.
Oracle MVCC — undo segments with explicit retention period; “snapshot too old” error if the undo is recycled before a long-running query finishes.
CockroachDB / Spanner MVCC — KV pairs keyed by (key, timestamp); older versions are just older keys in the LSM, dropped by compaction past the configured retention.
HyPer / hekaton style — in-memory MVCC where versions are linked lists and the engine optimises for the all-in-RAM case.
Hybrid Logical Clocks (HLC) — used by CockroachDB and YugabyteDB to assign timestamps in a distributed cluster without a TrueTime-style external clock.

Trade-offs#

MVCC — readers and writers don’t block each other, snapshot isolation comes naturally, distributed transactions tractable. Garbage collection (vacuum / undo / compaction) competes with foreground work; storage amplification on update-heavy tables; write skew is allowed by default.

2PL (lock-based) — straight forward serializability semantics, no version bookkeeping, no GC overhead. Read locks block writes and vice versa; long-running reads cripple throughput; deadlocks and lock convoys are recurring operational problems.

Specific axes:

Storage overhead. MVCC adds 16-24 bytes per row for version metadata (Postgres: 23-byte tuple header). For wide rows this is rounding error; for narrow rows it’s a meaningful fraction.
Hot-row throughput. A single row updated 100K times per second produces 100K versions per second; the GC must keep pace or storage explodes and reads slow down chasing version chains.
Long-running transactions. They pin the snapshot, preventing GC of any version newer than them. A reporting query that runs for hours can balloon bloat on every table it touches. Postgres exposes this as xmin horizon; the operational answer is “don’t run hour-long transactions on a write-heavy primary, use a read replica or a snapshot dump”.
Cross-version reads. An index range scan that crosses many dead versions does extra work checking visibility. InnoDB’s primary-key-pointing indexes sidestep this; Postgres’s “index-only scans” require the visibility map to be up to date.

Common pitfalls#

Forgetting that snapshot isolation allows write skew. Two transactions each read “no on-call doctor is off-duty” and each mark a different doctor off-duty; both commit; now there’s no on-call doctor. The fix is either SELECT FOR UPDATE on the constraint-relevant rows, an explicit SERIALIZABLE isolation level, or modeling the invariant as a uniqueness constraint.
Letting autovacuum lag. Postgres deployments where autovacuum is too conservative end up with bloated heaps and indexes; queries get slower and the database needs more RAM to keep working sets hot. Tune autovacuum_vacuum_scale_factor for the workload, not the default.
Long-lived transactions pinning the horizon. A connection that ran BEGIN and forgot to commit can hold off vacuum across the whole cluster. Add idle-in-transaction timeouts; monitor pg_stat_activity for backend_xid and backend_xmin.
Transaction ID wraparound (Postgres). 32-bit transaction IDs eventually wrap; autovacuum must “freeze” old tuples before that. Misconfiguration plus heavy write rates have caused real outages where the database goes into protective read-only mode.
Indexes growing under updates (Postgres). Every update creates a new version with a new TID; every secondary index gets a new entry. HOT (Heap-Only Tuple) updates avoid this when no indexed column changes, but a workload that updates indexed columns frequently bloats every index. REINDEX CONCURRENTLY is the operational fix.
Assuming SELECT FOR UPDATE makes things faster. It re-introduces 2PL-style blocking on the path it’s added to. Use it for correctness, not performance.
Mixing isolation levels per query and expecting serializable. Postgres applies serializable per-transaction; one READ COMMITTED statement in a SERIALIZABLE transaction doesn’t downgrade the isolation, but mixing levels across the application without thinking is a recipe for surprise anomalies.

Why MVCC won despite the GC tax

Pre-MVCC databases had SELECT block on writers and writers block readers; the operational reality was that any analytical query ran on a separate snapshot dump because the primary couldn’t tolerate the read locks. MVCC eliminated that — a long-running report can run against the production database alongside live writes. That single property turned out to be worth the entire vacuum / undo machinery. The distributed-database era cemented the choice: distributed locking is a hard problem, distributed versioning (timestamps + visibility checks) is tractable. Spanner, CockroachDB, FoundationDB, TiKV — every modern distributed transactional engine ships MVCC. The cost is real (autovacuum, undo retention, version chasing) but it’s bought a property the alternative architecture can’t match.