Write-Ahead Logging and Recovery — DBMS

What it is#

A write-ahead log (WAL) is an append-only log file that records every change a database is about to make to its data pages, written and flushed to durable storage before the corresponding data-page change is allowed to reach disk. The discipline is captured in the WAL protocol’s single rule: log first, then page. On crash, the recovery process replays the WAL to bring the on-disk pages back to a consistent state — redoing committed work and undoing uncommitted work.

Every serious durable database uses a WAL. PostgreSQL calls it pg_wal, MySQL InnoDB calls it the redo log, Oracle calls it the redo log too, SQL Server calls it the transaction log, SQLite calls it the WAL or rollback journal depending on the journal mode, RocksDB and LevelDB use a manifest-plus-write-ahead-log pair. The names differ; the mechanism is the same. The reference algorithm is ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), published by Mohan et al. at IBM in 1992, and it is the template every production WAL implementation follows in some form.

When to use it#

The question isn’t really “when to use a WAL” — it’s “when can you possibly skip one”. You can skip a WAL when:

The data is derived and rebuildable — a search index that can be regenerated from the source of truth, a cache, an analytical replica.
The store explicitly trades durability for throughput — Redis without AOF, MemSQL’s columnstore before checkpoint, in-memory test fixtures.
The workload is batch idempotent — a Spark job writing Parquet files where re-running the whole job on failure is the recovery story.

Use a WAL when:

Crashes must not lose committed data — every OLTP database, every transactional KV store, every queue with at-least-once delivery semantics.
Atomic multi-page writes are required — most B-tree page splits, LSM compaction handoffs, and index creations touch multiple pages and need to either fully happen or fully not.
Replication is in scope — physical replication is “ship the WAL”; logical replication is “decode the WAL”. A database without a WAL has no natural change feed.
Point-in-time recovery is required — restore a base backup, replay WAL up to a chosen timestamp.

How it works#

The WAL protocol#

The protocol has two rules:

Log before page — a data-page modification record must be in the durable log before the modified page itself reaches disk.
Log before commit ack — the log records for a transaction (including its COMMIT record) must be durably flushed before the client is told the commit succeeded.

Both are necessary. Rule 1 ensures recovery can redo or undo any in-flight change. Rule 2 ensures durability — once a client sees a commit ack, the transaction’s effects survive a crash even if the data pages were still dirty in RAM.

LSN: the heartbeat#

Every log record is tagged with a monotonically increasing Log Sequence Number (LSN). Pages carry the LSN of the last log record that touched them (the “page LSN”). On recovery, the engine reads a page, checks its LSN, and decides whether each log record needs to be reapplied. LSNs also order replication, drive checkpointing, and let backups know how much WAL to ship.

The recovery algorithm (ARIES, in three passes)#

[ANALYSIS]   scan WAL from last checkpoint forward
             reconstruct dirty page table + active transaction table
                            │
                            ▼
[REDO]       scan from earliest dirty-page recLSN forward
             reapply every change whose LSN > page LSN
             (idempotent — safe to repeat)
                            │
                            ▼
[UNDO]       for each transaction still active at crash:
             walk its log records backwards, generating
             compensation records, until BEGIN

After all three passes, the database is in a state where:

Every transaction that committed before the crash has its effects applied.
Every transaction that was active at the crash has its effects rolled back.
The WAL contains compensation records so a crash during recovery still recovers correctly.

The brilliance of ARIES is that the algorithm is idempotent — recovery can be interrupted by another crash, and the next recovery just resumes from the start without correctness problems. CLRs (compensation log records) ensure that the undo work itself is logged and won’t be redone.

Checkpointing#

Without checkpoints, recovery would replay the entire WAL from the database’s creation. A checkpoint writes a marker record into the WAL plus enough information that recovery can skip the WAL prefix before the marker:

Write dirty pages older than some threshold to disk.
Note the oldest LSN of any still-dirty page (the “recovery LSN”).
Write a CHECKPOINT record into the WAL with that recovery LSN.

After a crash, recovery starts from the recovery LSN of the most recent CHECKPOINT, not from the WAL’s beginning. Fuzzy checkpoints (the modern default) don’t pause the database; they flush dirty pages in the background while normal work continues. Postgres’s checkpoint_timeout and checkpoint_completion_target tune this.

Redo vs undo records#

WAL records describe the change at two levels:

Redo information — what to do to reapply the change (e.g., “page 47, offset 128, set bytes to X”).
Undo information — what to do to roll back the change (e.g., “page 47, offset 128, restore bytes to Y”).

A pure redo log (most LSM engines, RocksDB) doesn’t need undo because uncommitted transactions are not persisted anywhere durable — the memtable is in-memory only. A B-tree engine with steal-buffer policy (where dirty uncommitted pages can be evicted to disk to free buffer pool) needs both redo and undo, hence ARIES.

Group commit#

The naive WAL has one fsync per commit. At 100 transactions/sec on a disk with 5 ms fsync latency, that’s the ceiling. Group commit batches commits: the engine collects pending commits for a few microseconds, fsyncs once, then acks all of them. Postgres commit_delay, MySQL binlog_group_commit_sync_delay, and the equivalent in every engine are knobs on this trade-off — latency for throughput.

Variants#

ARIES (steal + no-force) — the reference. Allows dirty uncommitted pages to be stolen from the buffer pool and doesn’t force committed pages to disk on commit. Maximises throughput; requires full redo + undo.
No-steal + force — never write uncommitted pages, always flush committed pages. Simple recovery (no undo) but terrible performance. Only seen in toy databases.
Redo-only logs — LSM engines, LevelDB / RocksDB. Memtable is the only place uncommitted writes live; WAL replay reconstructs the memtable.
Logical replication WAL — Postgres logical decoding, MySQL row-based binlog. The same WAL is decoded into row-level change events for downstream consumers.
Combined log + storage (some NewSQL engines) — the WAL is the storage. The Raft log in Spanner, CockroachDB, and TiKV plays the role of both replication log and durability log; data is materialised lazily.
WAL on a separate device — co-locating the WAL on the same disk as data competes for IOPS. High-end OLTP deployments put WAL on dedicated NVMe with battery-backed cache (or its modern equivalent).

Trade-offs#

With WAL — durable commits, atomic multi-page operations, replication-ready, point-in-time recovery, well-understood crash semantics. Every commit pays an fsync; sequential WAL write competes with data writes for disk; some workloads spend 30%+ of total I/O on the log.

Without WAL (or with WAL disabled) — peak throughput on workloads that can lose recent writes; some bulk-load paths in every database support a WAL-bypass mode. No crash safety; no replication; restart starts from the last full backup.

Tuning surface:

fsync mode — strict (every commit fsyncs), batched (group commit), or off (durability disabled). Most engines default to strict.
WAL segment size — Postgres ships 16 MB; some deployments raise to 64 MB or more to reduce filesystem overhead at scale.
Retention — how much WAL to keep for replication slots, archive shipping, PITR. Keep too little and lagging replicas need a base-backup reseed; keep too much and disk fills.
Compression — many engines compress WAL records (wal_compression in Postgres, binlog row-image compression in MySQL). Trades CPU for I/O.

Common pitfalls#

Disabling fsync to make benchmarks look good — a database with fsync=off is a fast non-durable database, which is not a useful artifact. The fsync overhead is the price of “didn’t lose data”.
WAL on the same volume as data — both compete for IOPS; under load, commit latency tracks data-write contention. Always put WAL on its own volume past modest scale.
Forgetting to monitor WAL retention — replication slots that aren’t consuming, archives that aren’t shipping, or PITR retention misconfigured all silently grow pg_wal until the disk fills and the database refuses writes.
Mixing replication and PITR retention — Postgres archive_mode = on plus an unread replication slot can each pin different WAL ranges. Reason about both before pruning.
Believing single-disk durability — even an fsync on a single SSD can be lost if the device fails. WAL is necessary but not sufficient; replication and backups handle the failure modes WAL doesn’t.
Long-running transactions block WAL recycling — Postgres can’t recycle WAL past the oldest in-progress transaction’s snapshot. A leaked transaction (e.g., from a stuck application) can fill the disk.
WAL-ignorant bulk loads — COPY with wal_level=minimal and TRUNCATE in the same transaction can skip WAL writes in Postgres; outside that pattern, bulk loads fully log and amplify disk usage dramatically.

Why ARIES from 1992 is still the reference

ARIES nailed three properties at once that no prior algorithm did: it allowed steal + no-force (the buffer-management policy that gives the best throughput), it produced idempotent recovery (so a crash during recovery is just another crash), and it supported partial rollback (savepoints, nested transactions). Every property was already known in isolation; ARIES was the synthesis that worked in production. Every commercial database adopted it or a close variant within a decade. Thirty years later, the names of the log records have changed and the storage media are different, but if you sketch the data flow of Postgres’s WAL or InnoDB’s redo log on a whiteboard, the diagram is recognisably ARIES.