Log-Structured File System (LFS)

Sequential writes, segments, inode maps, garbage collection — the design that influenced flash file systems and modern databases.

System Advanced
8 min read
lfs log-structured sequential-write garbage-collection file-system
Companies this resembles: UC Berkeley

What it is#

The Log-Structured File System (LFS), introduced by Rosenblum and Ousterhout at UC Berkeley in 1991, took FFS’s locality idea to its extreme: never write in place; always append to the end of a single ever-growing log. The file system is the log. Every modification — data, inode updates, directory changes — lands at the current write pointer; old versions are left behind until a garbage collector eventually reclaims their space.

LFS was motivated by two observations: (1) write performance on HDDs was bounded by seek + rotation, but a workload that always wrote sequentially could approach raw media bandwidth; (2) RAM caches were growing fast enough that reads would increasingly hit the cache, so reads-from-disk were becoming less common — making it acceptable to trade some read locality for huge write throughput.

LFS itself didn’t become a mainstream HDD file system, but its ideas are everywhere: flash SSD firmware (the FTL is an LFS), modern key-value stores (RocksDB, LevelDB, Cassandra’s SSTables), log-structured merge trees in databases, and the Btrfs / ZFS / APFS copy-on-write designs.

Architecture overview#

LFS divides the disk into large fixed-size segments (typically 0.5-1 MB). Writes accumulate in an in-memory segment buffer and are flushed sequentially when the buffer fills:

disk: | segment 0 | segment 1 | segment 2 | ... | free | free | ... |
^
write pointer (next to be written)

A segment can contain anything — data blocks for many different files, inodes, directory blocks, even fragments of inode-mapping metadata. The point is that within a segment, all blocks are written together, sequentially, and segments themselves are written in physical order.

The hard problem LFS solves: how do you find an inode that lives wherever the log happened to be when it was last written? The answer is the inode map (imap):

  • Inodes are no longer at fixed locations.
  • An imap entry maps inode_number → disk_address_of_latest_version.
  • The imap itself is also written to the log in chunks.
  • A small fixed-location checkpoint region stores pointers to the latest imap chunks.

To look up inode N: read the checkpoint, find the imap chunk that covers N, find the inode’s current disk address, read it. With caching, this is one lookup; without caching, it’s a few extra reads compared to FFS.

Layout and allocation policy#

LFS has no allocation policy in the FFS sense — there are no cylinder groups to pick, no locality to maintain. The write pointer advances; the next free segment receives the next batch of writes.

What gets buffered before a segment flush:

  • All dirty data blocks.
  • All dirty inodes (each pointing at the new disk addresses of its data).
  • New imap chunks reflecting the moved inodes.
  • Directory updates.

Everything is written in one big sequential I/O. On HDDs of the era, a 1 MB sequential write was an order of magnitude faster than a million scattered random writes — even if the random writes touched the same total amount of data.

The crucial dependency: write order#

Within a segment, blocks are ordered so that on a crash, recovery can scan forward and rebuild the imap from the log itself:

[segment header] [data block 1] [data block 2] [...] [inode] [imap chunk]

The inode’s block pointers reference disk addresses within this segment (now known, because we’re filling the segment in order). The imap chunk references the inode’s address. The segment header records the segment’s place in log order.

Read and write paths#

Writing a 4 KB block to file F at offset 100 KB:

  1. Compute which data block of F is being modified.
  2. Allocate a new disk address for the block from the in-memory segment buffer.
  3. Update F’s inode in-memory: the block pointer now points at the new address.
  4. Mark F’s inode dirty (its new version will go into the same segment).
  5. When the segment fills, flush sequentially. The new data block, the new inode, and a new imap chunk all land in one I/O.

Reading the same block back:

  1. Look up file’s inode number from the directory entry.
  2. Read the imap entry for that inode number (often cached).
  3. Read the inode at the address from the imap.
  4. Follow the block pointer to the data block address.
  5. Read the data block.

The extra indirection (imap) is the cost of not pinning inodes to fixed locations. With a hot inode cache, the cost vanishes.

Crash recovery#

LFS recovery is simple because the log is contiguous and self-describing:

  1. Start from the last checkpoint region (one of two; LFS alternates between them for atomic commit).
  2. The checkpoint records “log was complete up to segment S, imap pointed at X, Y, Z.”
  3. Scan forward from segment S, replaying any complete segments that appear after the checkpoint.
  4. Each segment’s header lets recovery validate that it’s a real segment and not garbage.
  5. Stop at the first incomplete segment.

Recovery is bounded by the amount of unflushed work since the last checkpoint, not by file system size. Checkpoints happen every ~30 seconds or so; worst-case recovery walks ~30 seconds of log.

Compare to FFS + fsck: hours on a multi-TB file system. LFS’s recovery design is one of its lasting contributions; modern journaled file systems and database WALs use the same pattern.

Operational characteristics#

  • Write throughput approaches raw media bandwidth on workloads dominated by writes. Benchmark numbers from the original LFS paper showed 4-10x improvements over Sprite FFS.
  • Read locality is unpredictable. A file written across many segments (because each modification happened with other unrelated work) ends up scattered. Reads of cold data hit the disk many times.
  • Checkpoint overhead is modest — a few writes per checkpoint interval to commit the new imap and roll the head.
  • Memory footprint is higher than FFS because of the in-memory segment buffer and the active portion of the imap. On 1990s hardware, this was a real constraint.

Trade-offs and gotchas#

LFS — write at media bandwidth, simple crash recovery, natural snapshots (old log versions persist until GC’d), perfect fit for write-once media (flash). Cost: garbage collection, fragmentation on reads, more complex data structures (imap, segment summary blocks).
FFS — predictable layout, fast reads, simple on-disk structures. Cost: in-place updates make crashes painful (motivated journaling), random small writes fight disk geometry.

Garbage collection — the central challenge#

A segment that was once full of valid data ages: files get rewritten elsewhere in the log, deletions invalidate blocks. After enough time, segments become a patchwork of valid and stale blocks. LFS must reclaim space:

  1. Pick a victim segment (heuristics: most stale blocks, oldest, coolest).
  2. Scan it; identify still-live blocks using the imap.
  3. Copy live blocks to the current write pointer.
  4. Mark the victim segment free.

This is cleaning, and it’s expensive — every byte cleaned is a byte read and a byte rewritten. The “write amplification” from cleaning can dwarf the user’s actual write workload on a nearly-full file system. The original LFS paper noted that cleaning policy choice is the single biggest determinant of real-world performance.

Cleaning policies tried over the years:

  • Greedy — pick segment with most free space. Works well on stationary workloads, badly on workloads with hot and cold data mixed.
  • Cost-benefit — weight free space against segment age. Avoids cleaning cold segments that are mostly full.
  • Hot/cold separation — write hot data to one log, cold data to another. The cold log barely needs cleaning.

The flash FTL uses exactly these algorithms in firmware today. See Flash SSDs and the Flash Translation Layer.

Why LFS didn’t take over HDDs#

Several reasons, summarised:

  • Reads matter. RAM caches help but never reach 100%. LFS’s read penalty was real on workloads with cold reads of small files.
  • Cleaning is hard to tune. A poorly-tuned cleaner can grind a busy file system to a halt. FFS + journaling was an “incremental improvement” path with familiar operational characteristics.
  • Sun’s UFS won. Sun Microsystems backed FFS-derived UFS as Solaris’s default; the industry standardised around that lineage.
  • The win was bigger on flash. When SSDs arrived, the LFS pattern became mandatory because of erase-before-write. The FTL is essentially LFS, hidden in drive firmware.
What modern systems are LFS-shaped?

Almost every system that does heavy writes on flash. RocksDB and LevelDB use log-structured merge trees — sorted runs that are merged together, written sequentially. Cassandra’s SSTable layer is similar. Kafka’s log segments are explicitly log-structured. The Linux F2FS file system is LFS for flash. ZNS NVMe drives expose the log structure directly to the host. The FTL inside any SSD. Even modern PostgreSQL is moving toward more log-structured storage with HEAP-only-tuple optimisations. The pattern is everywhere — just rarely named “LFS.”

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.