Data Integrity — Checksums and Scrubbing — Operating Systems

What it is#

Storage devices fail in ways that don’t show up as a read error. The drive returns bytes; the bytes are wrong. Causes: a cosmic ray flipped a bit in a DRAM buffer, the controller wrote the right data to the wrong location, the cable corrupted a sector in transit, the firmware has a latent bug. To the OS and to RAID, the disk looks healthy.

Data integrity at the file-system layer is the set of techniques — checksums on every block, periodic scrubbing of the entire device, mismatched-write protection — that catch these errors and, with sufficient redundancy, repair them. ZFS made this mainstream in 2005; modern file systems (btrfs, APFS, ReFS) followed; ext4 added optional metadata checksums.

When to use it#

You always want some level of integrity protection. The question is how much:

Critical data on commodity hardware — full data-and-metadata checksums + scrubbing + redundancy (ZFS / btrfs raidz / Reed-Solomon erasure). Worth the CPU and space overhead.
Bulk storage on enterprise hardware — metadata checksums + RAID 6 + occasional scrubbing. The drives have stronger internal ECC; the file system catches the rare miss.
Ephemeral data (caches, scratch) — no checksums needed; rebuild from source if corrupted.

The hyperscaler argument that integrity matters at all: at petabyte scale, a 10^-15 per-bit error rate produces several silent corruptions per day. ZFS’s “Cosmic Ray Per Hour” calculation popularised this — modern data-protection programs all assume corruption is normal, not exceptional.

How it works#

What goes wrong#

Latent sector errors (LSE) — a sector that was successfully written becomes unreadable later. Drives report this as a read error (UNRECOVERED READ). RAID can repair from parity if the LSE happens to be at a known location and the array has redundancy.
Silent corruption — the drive returns data without flagging an error, but the data is wrong. Causes: bit flips in cache, firmware bugs, cable / connector issues, mismatched writes (writing block A’s data to block B’s location).
Mismatched writes — a hardware or firmware fault causes a write intended for LBA X to land at LBA Y. The drive thinks both writes succeeded. Reads of either LBA return wrong data.
Memory corruption above the drive — ECC catches single-bit DRAM errors; double-bit errors and corruption in non-ECC paths (CPU caches, PCIe transactions) slip through.

The frequency: published numbers from CERN and NetApp put silent corruption at 10^-14 to 10^-15 per bit, with order-of-magnitude variation across drive families.

Checksums on every block#

A checksum (CRC32C, fletcher4, SHA-256, xxhash) is computed over the data and stored separately — not in the same block as the data, so a mis-direction or torn write can be detected.

write(block):
  data + metadata header + checksum  =>  written to data block
                                         checksum stored in parent inode / extent record
read(block):
  read data + recompute checksum
  compare with stored checksum
  if mismatch: report integrity error, try repair from redundancy

The checksum-elsewhere placement is critical. A simple inline checksum (last bytes of the same block) catches bit-flips but not mis-direction — the drive could write the wrong-block-but-correct-checksum to the correct location and the file system wouldn’t notice.

Scrubbing#

A background process reads every block on the file system periodically (weekly or monthly), verifies its checksum, and repairs any errors from redundancy.

Why bother? Bit rot is silent. A cold archive can sit for years without being read. The bit that flips today won’t be noticed until you read the file in 2031 — and by then, the redundant copy may have rotted too. Periodic scrubbing catches errors before correlated failures destroy all copies.

ZFS’s zpool scrub and btrfs’s btrfs scrub start are the canonical commands; mdadm has echo check > /sys/block/mdX/md/sync_action for RAID arrays without integrated checksums.

End-to-end checksums#

The strongest pattern: compute the checksum at the application (or at least at the highest layer that has the raw data), pass it through the entire stack, verify at every layer. Standards:

T10 PI (Protection Information) — a 8-byte trailer on every 512-byte sector containing a CRC, reference tag, and application tag. Travels from the application through HBAs, switches, and into the drive’s NAND. Catches in-flight corruption.
Object stores’ content-hash addressing — S3, IPFS, content-addressable backups. The hash is the identifier; corruption is detected on every read for free.

Mismatched-write protection#

The file system stores not just (data, checksum) but (data, checksum, expected location). On read, the location is verified — if the read returns block X but the stored “expected location” says Y, the data is rejected. ZFS uses 256-bit block pointers that include the LBA the block was written to.

Variants#

Checksum algorithm choice#

CRC32C — fast, hardware-accelerated on Intel and ARM, catches all common error patterns. Default for ext4 metadata, btrfs (legacy), SCSI T10 PI.
Fletcher4 — ZFS default, fast on CPU, slightly weaker than CRC32C.
xxhash — fast non-cryptographic, used in newer storage stacks (Btrfs new default since 5.5).
SHA-256, BLAKE3 — cryptographic, used where security matters (content-addressable storage, deduplication, immutable backups).

CPU overhead with hardware-accelerated CRC32C is ~1% on a typical workload; not the bottleneck.

Per-block vs per-file#

Per-block is the dominant approach — granular repair, easy to integrate with RAID. Per-file (file-level Merkle tree) is what content-addressable systems use; useful for deduplication and signing but harder to repair partial corruption.

Erasure coding integration#

Reed-Solomon K + M codes are checksums and redundancy in one mechanism. Object stores (S3, Ceph) detect corruption on read via the code’s syndrome and reconstruct on the fly. The cost: every read requires K shards, so single-block reads pay a fan-out tax.

Trade-offs#

Inline integrity (no checksums) — fastest, simplest. Trusts the drive. Adequate for ephemeral data. Cost: silent corruption is undetectable and propagates into backups.

Integrated checksums + scrubbing (ZFS, btrfs) — catches everything from cosmic rays to firmware bugs. Self-healing on read if redundancy exists. Cost: 1-5% CPU overhead, extra IOs for scrubbing, larger metadata, more complex file system.

Where to checksum. Application-level catches the most (everything below is in scope) but requires application changes. File-system level catches everything below the page cache. Drive-level (ECC + T10 PI) catches in-transit and on-media but not memory corruption.
Scrub schedule. Too often: wastes IO bandwidth and wears SSDs. Too rare: silent corruption goes undetected long enough that all redundant copies rot. Weekly or monthly is typical.
Throughput vs latency on errors. A read that hits a corrupted block must wait for repair from redundancy — 5-50 ms extra. Storage systems usually log and continue rather than failing the read, which is the right call for availability.
Tail-latency cost of in-line verification. Hot data with checksums verified on every read pays a small CPU tax. Caching the verification result (verified at most once per cache lifetime) is the optimisation.

Common pitfalls#

Trusting “successfully written” as durable. A drive can return “write OK” from its volatile cache, then lose power before the data hits NAND. Checksums + redundancy + power-loss protection are all required to actually be durable.
Single-copy data on bit-rot-sensitive media. A single SSD, even a good one, will silently flip bits over years. Backups on bit-rot-prone media without scrubbing eventually rot, and you discover it when you try to restore.
Skipping scrub in RAID 5/6 without integrity. Plain RAID can’t tell which mirror is correct; without checksums, “self-healing” just means it picks one and trusts it.
Checksum stored next to data. Inline checksums catch flips but miss mis-directed writes. Always store checksums in the parent metadata structure.
Not monitoring iostat, SMART, or zpool status. Drives often emit early-warning signs (slow reads, increasing reallocated sectors, CRC error counts) hours or days before silent corruption begins. SMART exposes this; many operators never look.
Disabling ECC RAM “for cost”. ECC catches bit-flips in DRAM before they reach the drive. Non-ECC systems with checksummed file systems still have a window where corruption can sneak in — between checksum computation and disk write. Use ECC RAM on anything you care about.

Why doesn't ext4 do data checksums by default?

ext4 supports metadata checksums (the journal, inodes, directory blocks) but not data checksums — the design choice was that data integrity is the application’s or the storage layer’s job. Adding data checksums to an update-in-place file system is expensive: every data block needs a checksum stored somewhere, and overwriting that block requires updating both the data and its checksum atomically. ZFS and btrfs can do it cheaply because their COW design already rewrites parent metadata on every write. ext4 would have to either embed the checksum in the data block (bad — see mis-direction) or pay a write to parent on every data write. The team picked simplicity; integrity-conscious users mount over LVM with dm-integrity or move to a COW file system.

RAID — Striping, Mirroring, Parity — the redundancy layer that checksums combine with for self-healing.
Hard Disk Drives — the substrate that produces latent sector errors.
Flash SSDs and the Flash Translation Layer — different failure modes; same need for checksums above the device.
File System Implementation — where checksums live in the layout.
Crash Consistency — fsck and Journaling — the orthogonal defence against torn writes.