Hard Disk Drives
Geometry, seek + rotational latency + transfer time, disk scheduling (SSTF, SCAN, C-SCAN), and the I/O-cost math.
What it is#
A hard disk drive is a stack of magnetic platters spinning at constant angular velocity, each side served by a read/write head on a moving arm. Data sits in concentric tracks, each track is divided into sectors (historically 512 bytes, now 4 KB on Advanced Format drives), and the set of all tracks at the same arm position across all platters forms a cylinder.
To read a sector, the drive has to (1) move the arm to the right track — seek time, milliseconds — and (2) wait for the platter to rotate the target sector under the head — rotational latency, also milliseconds. Only then does the transfer happen at the drive’s media rate, which is hundreds of MB/s. The cost model T = seek + rotation + transfer is the foundation of every disk-aware piece of OS code.
When to use it#
HDDs survived in 2026 because cost per byte is still 5-10x lower than SSDs at the high-capacity end. Where they fit:
- Cold storage and archives. Backups, video archives, large datasets accessed infrequently. The latency penalty doesn’t matter when you read once a month.
- Sequential bulk workloads. Big-data scans (Hadoop circa 2010, modern object-store backends for cold tiers) where every read is
>= 1 MB— the per-IO seek overhead amortises. - Capacity-bound databases. The “warm tier” of analytical databases that need many TB per node and can tolerate
~5 msreads when caches miss.
For anything random or latency-sensitive, SSDs (and increasingly NVMe) won by 2020. The interesting OS-level question today is the scheduling and layout policy that still applies — modern file systems still arrange blocks to favour sequential access, and the cost model of seek + rotation + transfer carries directly over to log-structured SSD designs.
How it works#
Geometry and latency math#
A 7200 RPM drive completes one rotation in 60 / 7200 = 8.33 ms. Average rotational latency is half that: ~4.2 ms. Full-stroke seek (innermost to outermost track) is ~10-15 ms; average seek (across a random pair of tracks) is ~4-8 ms. Add the time to actually stream the bytes (a 4 KB sector at 200 MB/s is 0.02 ms), and a single random 4 KB read costs roughly ~10 ms.
random 4 KB read on 7200 RPM HDD ≈ 5 ms seek + 4 ms rotation + 0 ms xfer ≈ 9 msrandom 4 KB read on a SATA SSD ≈ 0.1 msrandom 4 KB read on a NVMe SSD ≈ 0.01 ms (100x faster than SATA)That order-of-magnitude gap between HDD and SSD is why so much OS code worries about disk layout — on an HDD, a bad layout can turn a 1-second workload into a 100-second workload.
Disk scheduling#
When multiple requests are queued, the order in which you service them matters enormously. The classical algorithms:
| Algorithm | What it does | Pros | Cons |
|---|---|---|---|
| FIFO | Serve in arrival order | Fair, simple | Random-access workloads thrash the arm |
| SSTF (shortest seek time first) | Pick the closest pending sector | Best average latency | Starvation — far requests never serviced under load |
| SCAN / elevator | Sweep arm one direction, service everything in path, then reverse | No starvation, good throughput | Mid-tracks get serviced 2x more often than edges |
| C-SCAN | Like SCAN but only serve on one sweep direction, fast return | Fairer than SCAN | Slightly higher average latency |
| SPTF (shortest positioning time first) | SSTF that also accounts for rotational position | Best throughput | Needs accurate rotation model |
Modern Linux (post-blk-mq) uses mq-deadline or bfq, both of which are variants on SCAN with per-process fairness. Many modern HDDs implement their own internal scheduler (NCQ — Native Command Queuing) and the OS hands them up to 32 outstanding requests; the drive reorders.
How the file system helps#
A file system that knows it’s on an HDD will:
- Place a file’s inode in the same cylinder group as its data blocks (FFS) so a
stat+readpair doesn’t need a long seek. - Allocate contiguous blocks for sequential writes when possible, so reading the file later costs one seek instead of N.
- Cluster directory entries with their inodes for fast directory traversal.
- Pre-fetch neighbouring blocks on read, because the disk is going to have to seek there next anyway.
Variants#
Capacities and form factors#
- 3.5” enterprise HDDs — 18-26 TB in 2026, 7200 RPM, helium-filled, used in storage arrays and hyperscaler cold tiers.
- 2.5” enterprise — 1-2 TB, 10k/15k RPM, lower latency, mostly displaced by SSDs except for niche budget use.
- Consumer 3.5” — 4-12 TB, 5400 or 7200 RPM, still common in DIY NAS and home backups.
SMR (shingled magnetic recording)#
Tracks overlap like roof shingles to push density. Sequential writes are fine; random writes are catastrophic because writing one track requires rewriting the next-overlapping tracks. SMR drives expose a host-managed mode where the OS knows it must write sequentially, or a drive-managed mode where the firmware does internal logging — much like an SSD’s FTL.
MAMR / HAMR#
Microwave / heat-assisted magnetic recording — the technologies pushing past ~30 TB per drive. Same access model as conventional HDDs; capacity uplift only.
Trade-offs#
$15/TB in 2026), high capacity (26 TB+), tolerable for sequential workloads. Cost: ~10 ms random access, mechanical wear, sensitive to vibration, power-hungry (~10 W idle on spinning drives). 100 µs-class), no mechanical wear, low power. Cost: 5-10x $/TB premium, finite write endurance (P/E cycles), write amplification, FTL complexity. Other axes:
- Write caching. Drives have onboard DRAM caches (
64-256 MB) that buffer writes. WithoutFLUSH CACHEorFUAcommands, a “successful write” might still be lost on power failure. The journal in any modern file system depends on explicit flush commands hitting the drive. - NCQ depth. Production HDD workloads benefit from a queue depth of 8-32; deeper queues let the drive reorder more aggressively. Too deep and tail latency suffers.
- Bad sectors and reallocation. Drives maintain a spare pool of sectors. When a sector goes bad, the firmware silently remaps it. Reads of a borderline sector get slow (multiple retries) before being remapped. SMART data exposes this.
Common pitfalls#
- Treating disk latency as a constant. Average latency hides a distribution with a long tail —
p99can be 10x the median, especially under load when the elevator queue grows. - Random 4 KB workloads on HDDs. A 1 GB random-read workload on a 7200 RPM drive is
262,144 reads * 9 ms ≈ 40 minutes. The same on SSD:~26 seconds. Architectures that worked on SSD do not work on HDD. - Ignoring the rotational position. Two requests for sectors on the same track but on opposite sides of the platter still cost
~8 ms(one full rotation). Scheduling that only accounts for track position misses this. - Trusting the drive’s write cache without a flush. A power blip with dirty cache loses the writes. Journaled file systems issue
FLUSH CACHEat every commit; disabling barriers for performance is a footgun. - Forgetting that defragmentation matters. On a heavily fragmented HDD, sequential reads turn into random reads. Modern Linux file systems mostly avoid this; older FAT-style file systems on Windows historically needed periodic defrag.
Why is enterprise HDD throughput often quoted as 250 MB/s and not 500 MB/s?
Spec sheets quote outer-track sustained transfer rate at full media bandwidth. Inner tracks have fewer sectors per revolution (constant angular velocity, smaller circumference) and read at maybe 60% of the outer-track rate. Workloads accessing data across the full platter average down. The same drive that does 250 MB/s on a fresh sequential write can drop to 120 MB/s as it fills.
Related building blocks#
- I/O Devices and Drivers — the driver layer above the disk.
- RAID — Striping, Mirroring, Parity — how multiple disks combine into a single logical volume.
- Flash SSDs and the Flash Translation Layer — the modern replacement for most random-access workloads.
- File System Implementation — the layout policies that exploit disk geometry.
- The Fast File System (FFS) — the canonical disk-aware file system design.