File System Implementation — Operating Systems

What it is#

A file system implementation is the on-disk layout and the code that maintains it. It answers: “Where on disk is inode N? Which blocks belong to it? Which blocks are free? How do you find a name inside a directory?” The classic UNIX answer — inodes plus indirect block pointers plus a bitmap — is still the bones of every modern file system from ext4 to APFS, even when they layer journaling, copy-on-write, or extent trees on top.

This page is about the implementation layer that sits between the Files and Directories abstraction (what user code sees) and the block device (what the disk presents). It’s the layer where I/O cost is determined.

When to use it#

You don’t choose to have a file system implementation — every OS ships one. The interesting choices are between implementations:

ext4 — Linux default. Mature, fast, journaled, extent-based. Good for everything from laptops to multi-PB servers.
XFS — high-throughput parallel writes, allocation groups (similar to FFS cylinder groups), no shrink. Default on RHEL, fits big-data workloads.
Btrfs / ZFS — copy-on-write, integrated RAID, snapshots, checksums. Worth the operational complexity for storage servers; less ideal for laptops.
APFS — Apple’s COW file system. Snapshots, encryption, clones, optimised for SSD.
NTFS — Windows default. B-tree directories, journaled, extensible attributes.
tmpfs — RAM-backed. Volatile, fast, no on-disk format.

How it works#

A simple layout#

Pretend we have a 64-block disk. A minimalist file system divides it into regions:

| S | i-bmap | d-bmap | inode table (5 blks) | data blocks (56 blks) |
  0     1        2          3-7                    8-63

Superblock (S) — describes the file system: total size, where each region is, magic number, root inode.
Inode bitmap (i-bmap) — one bit per inode; 1 = allocated, 0 = free.
Data bitmap (d-bmap) — one bit per data block.
Inode table — fixed-size array of inodes. With 256-byte inodes and a 4 KB block, each block holds 16 inodes.
Data blocks — the file contents.

Reading a file from the root: open("/foo/bar") walks (root inode) → (root dir block) → "foo" → (foo inode) → (foo dir block) → "bar" → (bar inode). That’s potentially 4-8 disk I/Os just to resolve the path before reading any bytes. Caching is everything; see below.

Inodes and block pointers#

A classical UNIX inode contains:

12 direct block pointers — 12 * 4 KB = 48 KB directly addressable.
1 indirect pointer — points at a block of 1024 pointers (4-byte each on 32-bit). 1024 * 4 KB = 4 MB.
1 double-indirect pointer — 1024 * 1024 * 4 KB = 4 GB.
1 triple-indirect pointer — 1024^3 * 4 KB = 4 TB.

Small files cost no indirection — the direct pointers are inline. Large files pay one extra read for every ~4 MB. Modern file systems (ext4, XFS) replace this scheme with extents — (start_block, length) records — which describe contiguous runs in much less metadata.

Free-space tracking#

Bitmaps — one bit per block, simple, dense. Cost: scan to find free runs.
Free lists — chain of free blocks. Easy to allocate one block; bad for finding contiguous runs.
Extents trees — like the data-block extents but for free space. XFS uses B+ trees of free extents indexed by both size and location.

Modern file systems typically use bitmaps for inodes (small) and extent trees for data (large).

Directories as files#

A directory is a file whose contents are (name, inode) records. Looking up a name in a directory is just sequential scan in the simplest case. Real implementations:

Linear — historical UNIX. Fine for small directories; O(n) on lookup.
Hashed (ext3 htree, ext4) — hash table on top of the linear records; O(1) lookups, still readable as a linear file for backward compatibility.
B+ tree (XFS, NTFS, APFS) — sorted by name. O(log n) lookups, naturally supports range scans (ls sorted).

Read and write paths#

Reading a single byte at offset O of a file:

Resolve the path to an inode (often free from dcache).
Compute block_index = O / BLOCK_SIZE and offset_in_block = O % BLOCK_SIZE.
Walk the inode’s pointer tree (or extent map) to find the disk block.
Read the block (free if it’s in the page cache).
Return the byte at offset_in_block.

Writing a byte is similar but with allocation: if the target block doesn’t exist yet, the file system finds a free block in the data bitmap, marks it allocated, updates the inode, and writes the data. Each of these touches a different region of disk and is the source of the “consistency between metadata structures” problem solved by journaling.

Caching#

The page cache holds blocks of files in RAM keyed by (inode, offset). The dcache holds path-component → inode lookups. The inode cache holds recently-used inodes. These caches are the reason real workloads don’t perform like the worst-case I/O math: a hot directory walk costs zero disk I/O once it’s cached.

Variants#

Block sizes#

Small (1 KB) — minimises internal fragmentation on directories full of tiny files. Wastes metadata.
Large (64 KB+) — better sequential throughput, fewer pointers per large file. Wastes space on small files.

Most general-purpose systems pick 4 KB — matches CPU page size, balances waste and throughput. Object stores and analytical workloads use larger blocks (64 KB-1 MB).

Extent-based vs block-list#

Old ext2/ext3 used per-block pointers; ext4 and XFS use extents. A 100 MB file allocated contiguously needs a single extent (one pointer pair) instead of 25,000 individual block pointers.

Copy-on-write file systems#

ZFS, btrfs, APFS never overwrite live data. Each modification allocates new blocks, including writing new parent pointers up to a single root. The root pointer write commits atomically; the old version remains until garbage-collected. Crash recovery is just “use the last committed root” — no journal needed.

Log-structured#

Take COW to its extreme: every write goes to the end of a single log; in-place updates never happen. Reads consult an in-memory map to find the latest version. Covered in Log-Structured File System (LFS).

Trade-offs#

Update-in-place + journal (ext4, XFS, NTFS) — writes go to their final location; a journal protects metadata consistency. Compact on disk, cheap reads, harder to do snapshots. Dominant choice for most workloads.

Copy-on-write (ZFS, btrfs, APFS) — never overwrite live data; new versions land in fresh blocks. Atomic root swap commits. Cheap snapshots, instant clones, built-in checksums. Cost: write amplification, fragmentation over time, in-memory GC.

Inline data. Tiny files (<= 60 bytes) can fit inside the inode itself, skipping data-block allocation entirely. ext4 and XFS both support this.
Allocation strategy. Pick a free block close to other blocks of the same file (locality) and close to the inode (FFS cylinder group idea). On SSDs, locality matters less for performance but still helps GC and TRIM behaviour.
Block group / cylinder group. Modern HDDs no longer expose cylinders, but the abstraction of “a group of nearby blocks containing both inodes and data” is preserved — ext4 calls them block groups, XFS calls them allocation groups.
Directory hashing collisions. Hashed directory implementations have to handle collisions gracefully — ext4 falls back to scanning the bucket linearly. With a well-chosen hash, average lookups stay O(1).

Common pitfalls#

Inode exhaustion. A file system with millions of tiny files can hit the inode limit before running out of disk space. df -i shows this. Reformatting with a higher inode-density (mkfs.ext4 -i 4096) is the fix.
Internal fragmentation in directories. Many file systems don’t shrink directory files when entries are removed — a directory that briefly held a million files will keep its allocated size. Symptom: an “empty” directory takes seconds to list.
Block-size mismatch with the workload. A 64 KB-block file system storing millions of 1 KB files wastes 63 KB per file. The OS-level “block” doesn’t equal the user-level “file size”.
Forgetting that the dcache is finite. A walk over a billion-file tree can blow out the dcache and slow subsequent operations to disk speed. find / on a cold cache is O(disk reads).
Assuming directory listings are sorted. They aren’t, by default — readdir returns entries in implementation order. ls sorts as a courtesy. Scripts that depend on order should sort explicitly.

Why don't modern file systems just use a B+ tree for everything?

XFS does, essentially — extents, free space, directories, all B+ trees. The cost is constant: every update potentially rewrites a parent node, which means small writes get amplified. ext4 keeps direct/indirect for backward compatibility with ext2/ext3 and uses extents (a much lighter abstraction than a full B+ tree) for new data. The pragmatic answer is “the right structure for each metadata type, balanced against complexity of the code that has to maintain them all.”

Files and Directories — the abstraction this implementation supports.
The Fast File System (FFS) — the canonical disk-aware implementation.
Crash Consistency — fsck and Journaling — how this layout survives crashes.
Log-Structured File System (LFS) — the alternative where in-place updates don’t exist.
Hard Disk Drives — the substrate whose costs drove the design.