Crash Consistency — fsck and Journaling

Write-ahead logging, metadata-only vs full journaling, ordered mode, soft updates, and why fsck stopped scaling.

Building Block Intermediate
10 min read
crash-consistency journaling file-system write-ahead-log fsck

What it is#

Crash consistency is the property that a file system survives a power loss or kernel panic without being corrupted. The hard problem: a single high-level operation like “append one block to this file” updates several on-disk structures — the inode (to record the new file size and the new data-block pointer), the data block (with the new bytes), and the data-block bitmap (to mark that block as allocated). If the machine crashes after some of those writes have hit disk but not others, the file system is in a state that doesn’t make sense.

Journaling is the dominant solution: write a description of the intended change to a separate journal area first, force it to disk, then apply the change to the file system proper. On crash recovery, replay the journal — operations that were fully logged either complete or are rolled back; operations that weren’t started yet are simply lost.

When to use it#

Every modern Linux file system (ext3, ext4, XFS), the BSD UFS+ family with SU+J, NTFS, HFS+ — all use journaling. The alternatives have receded:

  • No journal, no recovery aid — used in early UNIX and on FAT-style file systems. Crash recovery means running fsck over the entire disk, which on a multi-TB disk can take hours.
  • Soft updates — the FreeBSD UFS approach, carefully ordering writes so a crashed file system is always recoverable without a full fsck. Elegant; complex to implement; was overtaken by journaling because journaling is easier to extend.
  • Copy-on-write file systems (ZFS, btrfs, APFS) — never overwrite live data; new versions land in fresh blocks and a single atomic pointer swap commits. Crash consistency comes for free; the journal disappears as a separate concept.

The “when to use” question for the engineer designing application-level persistence is the same shape: any state that must survive crashes goes through an append-only log first, then a separate process applies the log to the live state. Databases (WAL), message queues (log segments), and even some application caches use the same trick.

How it works#

The fsck era#

Before journaling, the recovery path was a separate program — fsck (file system check) — that walked the entire on-disk structure looking for inconsistencies and patching them up. Some of what it had to detect:

  • Inodes marked allocated but not pointed to by any directory (orphans → moved to lost+found).
  • Data blocks pointed to by inodes but not marked allocated in the bitmap.
  • Data blocks marked allocated but not pointed to by any inode.
  • Two inodes claiming the same data block (cross-linked files).
  • Directory entries pointing to invalid inodes.
  • Link counts not matching the number of directory entries.

The problem with this approach is that running time scales with the size of the file system, not with the amount of unflushed work. A 4 TB disk takes hours; a 40 TB disk is unreasonable. When disk sizes grew past the point where fsck could finish in a reasonable boot window, journaling took over.

Write-ahead logging — the journal#

The basic idea: before modifying the on-disk structures, write a record describing the modification to a separate journal area, and force that record to be durable. Only then proceed to apply the change to the file system. On recovery, replay any logged-but-not-yet-applied transactions.

A typical journaled write of “append a block to file X”:

1. Write to journal:
TxBegin
write inode for X (new size, new block ptr)
write block bitmap (new block marked allocated)
write data block (the bytes themselves)
TxEnd
2. fsync the journal — durable on disk now.
3. Checkpoint: apply the same writes to the file system in place.
4. Free the journal space for the transaction.

If the machine crashes between (2) and (3), recovery sees a committed transaction in the journal and replays it. If it crashes during (1), the transaction is incomplete (no TxEnd) and is discarded — the user’s write call hadn’t returned, so no promise was broken.

The crucial invariant is write ordering: the journal record must hit disk before any of the corresponding in-place updates. File systems achieve this with explicit barriers (cache-flush commands to the drive) and ordered I/O dependencies.

What’s in the journal#

Three flavours, in decreasing order of safety and increasing order of speed:

ModeWhat’s journaledSpeedSafety
Full data + metadataEvery byte of every write, plus all metadataSlowest (2× write amplification)Strongest — data is always consistent with metadata
Ordered (metadata only, data first)Only metadata is journaled; data is written to its final location before the metadata that points to it is committedMiddleNo data corruption can be created by the FS, but data can still be lost (a partially written data block points nowhere)
Writeback (metadata only)Metadata journaled, data written wheneverFastestMetadata is consistent, but a file can end up pointing to stale or garbage data from a previous file

ext3/ext4 defaults to ordered mode — the sensible compromise. Databases that already do their own WAL sometimes mount in writeback mode and accept the file-system-level data risk because they have their own recovery.

Recovery#

On mount after an unclean shutdown:

1. Find the start of the journal (a known location, or a superblock pointer).
2. Walk forward through journal records.
3. For each TxBegin..TxEnd pair, replay the writes.
4. Discard any incomplete transaction (no TxEnd found).
5. Mark the journal empty, mount the file system normally.

This runs in time proportional to the size of the unfinished journal, not the size of the file system. Tens of seconds vs. hours.

Checkpointing and the journal’s circular buffer#

The journal isn’t infinite — it’s a fixed-size circular buffer (typically 128 MB on ext4). After a transaction is durably applied to the file system, its journal space can be reused. The file system periodically checkpoints — writes back dirty buffers to their final locations and advances the journal’s tail pointer.

If checkpointing falls behind and the journal fills, writes block until space is freed. A common production issue on write-heavy workloads is checkpoint contention.

Variants#

Metadata-only vs. full journaling#

Already touched on above. The math is brutal: full data journaling roughly doubles write bandwidth (every byte is written twice — once to journal, once to final location). Metadata journaling adds only the small overhead of metadata blocks. Most file systems make metadata-only the default; full data journaling is a mount option for the paranoid.

Soft updates (FreeBSD UFS)#

An alternative that uses no journal. The file system carefully tracks dependencies between writes and only issues them in an order that guarantees any crash leaves the file system in a recoverable state — recoverable in the sense that the only inconsistencies are leaked resources (e.g. allocated-but-unreferenced blocks), which a background fsck can clean up while the system is online.

Pros: no journal write-amplification. Cons: implementation complexity is enormous (the dependency tracking touches every operation), and extending the file system means re-deriving the dependency rules.

Log-structured file systems#

LFS goes further: instead of writing changes in place after journaling them, always write to the log. The file system is the log. Reads consult an in-memory map to find the latest version of each block; writes are sequential. Garbage collection reclaims old log segments. Covered in detail in Log-Structured File System (LFS); the design influenced flash file systems and SSD FTLs.

Copy-on-write (ZFS, btrfs, APFS)#

A different solution to the same problem. Instead of mutating blocks in place, new versions are written to fresh blocks, and a single root-pointer write atomically commits the new tree. On crash, the file system reads the last successfully committed root and is automatically consistent — no recovery pass needed. The cost is fragmentation (related blocks land far from each other over time) and a more complex on-disk format.

Trade-offs#

Journaling — recovery proportional to journal size, not disk size. Well-understood implementation. Easy to extend (add a new operation, log a new record type). Cost: every metadata write becomes two writes (journal + in-place), and write barriers are required for ordering.
Copy-on-write — recovery is instant; no journal at all. Snapshots and clones come for free. Cost: write amplification from rewriting parent blocks up the tree on every change, fragmentation over time, and an in-memory garbage collector that can be hard to tune.

Other axes:

  • Journal location. Inline with the file system (ext4), separate device (ext4 external journal on a fast SSD), or in NVRAM (high-end storage arrays). Putting the journal on a fast device is one of the cheapest tuning wins for write-heavy workloads.
  • Ordered vs. writeback. Ordered avoids data corruption but pays the cost of forcing data before metadata. Writeback is faster but exposes the “metadata says size=4 KB, data block contains last user’s garbage” hazard.
  • Disk caches and barriers. Disks lie. They report writes complete from their own DRAM cache, before the data hits oxide / flash. Without explicit FLUSH CACHE commands (or FUA writes), the journal’s ordering guarantees are illusory. Linux’s barrier=1 mount option (now the default on ext4) makes this correct, at the cost of some performance.
  • fsync semantics. A file system can be journaled and still have surprising fsync behaviour. The classic bug is “rename to overwrite without fsync” — many file systems used to lose data on this pattern. ext4’s auto_da_alloc heuristic was added specifically to make naive rename-and-replace patterns safe.

Common pitfalls#

  • Trusting write() for durability. A successful write call only copies bytes into the page cache. Without fsync (or O_SYNC / O_DSYNC flags), nothing is durable. Applications that crash-test this discover it the hard way.
  • fsync on the file but not on the directory. Creating a new file and calling fsync(file_fd) flushes the file’s data, but the directory entry that names the file might still be cached. Crash recovery can lose the directory entry, leaving the data orphaned. Correct pattern: fsync(file_fd); fsync(dir_fd);.
  • Renaming without fsync. The “write to tmp, then rename” pattern is atomic at the file-system level — but the contents of the new file must be fsync’d before the rename, or the rename can commit while the data is still in cache.
  • Mounting with barrier=0 for performance. Disables the journal’s write ordering. Faster benchmarks; corrupts the file system on power loss. Don’t.
  • Assuming the journal protects user data in ordered mode. It doesn’t — only metadata is in the journal. A partially-written data block produces a file with garbage at the end, not file-system inconsistency.
  • Filling the journal. A burst of metadata-heavy operations (rm -rf, untar) can saturate the journal. Symptoms: writes block intermittently. Tuning options: larger journal, faster journal device, batch writes in the application.
Why is fsync still slow in 2026 even on NVMe?

fsync forces all pending writes for that file (and on most file systems, the journal up to and including the relevant transaction) to durable storage. On NVMe that means flushing the device’s DRAM cache to NAND — which can take milliseconds even on fast drives, because NAND program operations are slow. Some workloads use fdatasync (skip the inode-mtime update) or rely on application-level batching to amortise. On enterprise arrays with battery-backed NVRAM caches, fsync is back to microseconds because the “durable” point is RAM with a UPS.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.