Flash SSDs and the Flash Translation Layer — Operating Systems

What it is#

A solid-state drive (SSD) stores data in NAND flash cells, organised into pages (typically 4-16 KB) inside blocks (typically 128-512 pages). NAND has three physical constraints that shape everything: you read at page granularity, you write at page granularity, but you can only erase at block granularity, and a page cannot be overwritten — it must be erased first.

Worse: each flash cell wears out. After ~3,000 program/erase (P/E) cycles on consumer TLC NAND (and ~100,000 on SLC, ~10,000 on enterprise MLC), the cell loses its ability to reliably hold charge. Drives compensate via the Flash Translation Layer (FTL) — firmware that hides the erase-before-write constraint, levels wear across cells, and presents the OS with a standard block device interface.

When to use it#

SSDs are the default storage in 2026 for almost everything except cold tiers and high-capacity archives:

OS boot drives, application binaries — fast random reads (~10-100 µs) make cold start dramatically faster.
Databases, OLTP workloads — random 4 KB IOPS at 100k-1M per drive vs ~100 for HDDs.
Build and CI workloads — random small reads of millions of source files. The dcache helps, but cold builds benefit enormously.
Hot tiers in tiered storage — recent data on SSD, older data tiered to HDD or object storage.

HDDs survive only where capacity per dollar matters more than latency (see Hard Disk Drives).

How it works#

Flash hierarchy#

SSD
 └── 8-16 channels (parallel buses to flash chips)
       └── 1-8 chips per channel
             └── 1-4 dies per chip
                   └── 2-4 planes per die
                         └── thousands of blocks per plane
                               └── 128-512 pages per block
                                     └── 4-16 KB per page

The number of parallel channels times planes per channel sets the IOPS ceiling — a modern enterprise NVMe SSD has 8-16 channels and exposes 1M+ random read IOPS by issuing reads concurrently across them.

The erase-before-write asymmetry#

Read a page — fast, 25-100 µs, non-destructive.
Program a page (write to an erased page) — 200-1000 µs.
Erase a block — slow, 2-5 ms, and “wears” the block by one P/E cycle.

Because erase is block-granular and slow, the FTL never updates in place. A write to logical block address (LBA) X goes to whatever free physical page is currently being written; the FTL’s mapping table is updated so that LBA X now points at that new physical page. The old physical page is marked invalid but stays where it is until the block it lives in is eventually erased.

This is exactly the log-structured design from Log-Structured File System (LFS), implemented in drive firmware.

The FTL mapping table#

The FTL holds a logical → physical page map. At ~4 GB per 4-byte entry per 4 KB page, a 1 TB SSD needs roughly 1 GB of mapping table — too big to fully cache in the drive’s DRAM. Drives compromise:

Page-level mapping with a smaller DRAM cache (recent translations) and the rest backed by flash itself.
Block-level mapping — coarser, smaller table, but worse on small random writes.
Hybrid mapping — most LBAs at block granularity, recently-written “log blocks” at page granularity.

Modern enterprise drives use page-level mapping with 1 GB of onboard DRAM per 1 TB of NAND, which is why drives without DRAM (cheap consumer drives) have noticeably worse random-write performance.

Garbage collection#

When most blocks contain a mix of valid and invalid pages and few clean blocks remain, the FTL must reclaim space:

Pick a victim block (heuristics: most invalid pages, oldest, coolest).
Copy its valid pages to a fresh block.
Erase the victim.

The valid-page copying is write amplification: a user write of one page can trigger multiple physical-page writes during GC. WA is typically 2-3x on enterprise workloads, can spike to 10x under sustained random writes on a nearly-full drive. The headline endurance number assumes a workload’s WA — sequential writes wear the drive less than random.

Wear leveling#

The FTL spreads writes across all blocks so that no single block hits the P/E limit before the others. Two flavours:

Dynamic wear leveling — when picking a free block to write to, prefer ones with lower P/E counts.
Static wear leveling — periodically migrate cold data (rarely modified) off blocks with low wear so those blocks can take more writes.

Without static leveling, blocks holding the boot image would never wear out while heavily-modified blocks would burn through their cycles.

TRIM and over-provisioning#

TRIM (or DISCARD / UNMAP) is a command the OS sends when it deletes a file: “logical blocks X through Y are no longer in use, free to discard.” Without TRIM, the drive doesn’t know those LBAs are dead and copies their stale data during GC, wasting cycles and wearing the drive.
Over-provisioning — the drive secretly reserves 7-28% of its NAND as spare capacity not exposed to the OS. More OP means more headroom for GC, less write amplification, longer endurance. Enterprise drives ship with 28% OP by default; consumer drives 7%.

Variants#

SLC, MLC, TLC, QLC#

Type	Bits/cell	Endurance (P/E)	Speed	Cost/GB
SLC	1	`~100,000`	Fastest	Highest
MLC	2	`~10,000`	Fast	Medium-high
TLC	3	`~3,000`	Medium	Medium
QLC	4	`~1,000`	Slow	Lowest

Consumer drives in 2026 are almost all TLC or QLC; enterprise mixes TLC with SLC caches.

SLC caching#

QLC and TLC drives reserve some cells configured as SLC (faster, more endurance) as a write cache. Bursty writes land in SLC at full speed; when the cache fills, write performance falls to native TLC/QLC speed — the “post-cache plateau” you see in sustained-write benchmarks.

Interface and form factor#

SATA SSD — ~550 MB/s, ~100k IOPS. Limited by the SATA AHCI bus.
NVMe — over PCIe, single-queue throughput up to ~7 GB/s (PCIe 4.0 x4), ~1M IOPS.
U.2 / E1.S / E3.S — hot-swap form factors for data centres.

Zoned namespaces (ZNS)#

A newer NVMe spec where the drive exposes its log structure to the host: writes must be sequential within zones, and the host (not the FTL) does garbage collection. Reduces write amplification dramatically when paired with a log-structured file system (F2FS, ZFS, ZenFS).

Trade-offs#

HDD — $15/TB, ~10 ms latency, no wear-out, unlimited overwrites. Wins on cost for cold and archival.

SSD — $70-150/TB (TLC NVMe, 2026), ~100 µs latency, P/E-cycle limited, write amplification. Wins on latency, IOPS, power, density — anywhere random access matters.

Write amplification. The single number that determines real-world endurance. A 1 TB drive rated 3 DWPD (drive writes per day) for 5 years actually delivers 3 / WA DWPD of useful capacity. A workload with WA=5 cuts endurance by 5x.
Read disturb. Reading a page slightly disturbs adjacent pages. After enough reads, neighbours need to be refreshed. Drives handle this transparently but it contributes to background activity.
Data retention. A powered-off SSD slowly loses charge. Enterprise drives are spec’d for 3 months at 40°C; consumer drives for 1 year at 30°C. Long-term archives on SSD need periodic powering.
Tail latency. GC runs are unpredictable. p99 latency on a write-heavy SSD can be 10-100x the median. Enterprise drives reserve more OP and use deterministic GC to control this; consumer drives are noisier.

Common pitfalls#

Filling the drive to 100%. Headroom is where GC happens. A >95% full drive has degraded performance and accelerated wear. Keep at least 10-20% free.
Disabling TRIM. Common on RAID configurations that historically didn’t support TRIM pass-through. Without TRIM, the drive treats deleted data as live, GC-copies it, and burns through its life faster than expected.
Benchmarking with the drive’s SLC cache. A consumer QLC drive can sustain 2 GB/s for the first 50 GB of writes, then collapse to 100 MB/s. Single-burst benchmarks miss the cliff.
Treating SSD as drop-in HDD replacement for RAID arrays. SSDs in the same batch have correlated wear-out — they fail around the same number of writes. A RAID 5 of 5 identical SSDs is at risk of multiple correlated failures.
fsync on a drive with volatile write cache. Cheap consumer drives sometimes lie about FLUSH CACHE completion. Enterprise drives have power-loss protection (PLP) — a small capacitor that lets the drive drain its DRAM cache to flash on sudden power loss. PLP is the single most important spec for any production write workload.

Why do FTL-level log structures look so similar to LFS?

Because the physical constraints are the same: write-once medium that requires a separate erase/cleanup pass, with the cleanup costing actual time and wear. LFS solved this for magnetic disks in 1991; SSDs in the 2000s rediscovered the design because flash has the same shape. The historical irony is that LFS was unpopular on HDDs but is the only workable design for flash — the FTL is essentially “LFS, but in firmware, hidden from the host.”

Hard Disk Drives — the contrast that makes SSD trade-offs vivid.
Log-Structured File System (LFS) — the design pattern the FTL implements internally.
RAID — Striping, Mirroring, Parity — correlated-failure concerns for SSD arrays.
File System Implementation — the layer above; modern file systems increasingly speak SSD-aware (TRIM, alignment, ZNS).
I/O Devices and Drivers — the NVMe driver path that makes 1M IOPS possible.