Hard Disk Drives

Geometry, seek + rotational latency + transfer time, disk scheduling (SSTF, SCAN, C-SCAN), and the I/O-cost math.

Building Block Foundational
7 min read
hdd disk scheduling latency storage

What it is#

A hard disk drive is a stack of magnetic platters spinning at constant angular velocity, each side served by a read/write head on a moving arm. Data sits in concentric tracks, each track is divided into sectors (historically 512 bytes, now 4 KB on Advanced Format drives), and the set of all tracks at the same arm position across all platters forms a cylinder.

To read a sector, the drive has to (1) move the arm to the right track — seek time, milliseconds — and (2) wait for the platter to rotate the target sector under the head — rotational latency, also milliseconds. Only then does the transfer happen at the drive’s media rate, which is hundreds of MB/s. The cost model T = seek + rotation + transfer is the foundation of every disk-aware piece of OS code.

When to use it#

HDDs survived in 2026 because cost per byte is still 5-10x lower than SSDs at the high-capacity end. Where they fit:

  • Cold storage and archives. Backups, video archives, large datasets accessed infrequently. The latency penalty doesn’t matter when you read once a month.
  • Sequential bulk workloads. Big-data scans (Hadoop circa 2010, modern object-store backends for cold tiers) where every read is >= 1 MB — the per-IO seek overhead amortises.
  • Capacity-bound databases. The “warm tier” of analytical databases that need many TB per node and can tolerate ~5 ms reads when caches miss.

For anything random or latency-sensitive, SSDs (and increasingly NVMe) won by 2020. The interesting OS-level question today is the scheduling and layout policy that still applies — modern file systems still arrange blocks to favour sequential access, and the cost model of seek + rotation + transfer carries directly over to log-structured SSD designs.

How it works#

Geometry and latency math#

A 7200 RPM drive completes one rotation in 60 / 7200 = 8.33 ms. Average rotational latency is half that: ~4.2 ms. Full-stroke seek (innermost to outermost track) is ~10-15 ms; average seek (across a random pair of tracks) is ~4-8 ms. Add the time to actually stream the bytes (a 4 KB sector at 200 MB/s is 0.02 ms), and a single random 4 KB read costs roughly ~10 ms.

random 4 KB read on 7200 RPM HDD ≈ 5 ms seek + 4 ms rotation + 0 ms xfer ≈ 9 ms
random 4 KB read on a SATA SSD ≈ 0.1 ms
random 4 KB read on a NVMe SSD ≈ 0.01 ms (100x faster than SATA)

That order-of-magnitude gap between HDD and SSD is why so much OS code worries about disk layout — on an HDD, a bad layout can turn a 1-second workload into a 100-second workload.

Disk scheduling#

When multiple requests are queued, the order in which you service them matters enormously. The classical algorithms:

AlgorithmWhat it doesProsCons
FIFOServe in arrival orderFair, simpleRandom-access workloads thrash the arm
SSTF (shortest seek time first)Pick the closest pending sectorBest average latencyStarvation — far requests never serviced under load
SCAN / elevatorSweep arm one direction, service everything in path, then reverseNo starvation, good throughputMid-tracks get serviced 2x more often than edges
C-SCANLike SCAN but only serve on one sweep direction, fast returnFairer than SCANSlightly higher average latency
SPTF (shortest positioning time first)SSTF that also accounts for rotational positionBest throughputNeeds accurate rotation model

Modern Linux (post-blk-mq) uses mq-deadline or bfq, both of which are variants on SCAN with per-process fairness. Many modern HDDs implement their own internal scheduler (NCQ — Native Command Queuing) and the OS hands them up to 32 outstanding requests; the drive reorders.

How the file system helps#

A file system that knows it’s on an HDD will:

  1. Place a file’s inode in the same cylinder group as its data blocks (FFS) so a stat + read pair doesn’t need a long seek.
  2. Allocate contiguous blocks for sequential writes when possible, so reading the file later costs one seek instead of N.
  3. Cluster directory entries with their inodes for fast directory traversal.
  4. Pre-fetch neighbouring blocks on read, because the disk is going to have to seek there next anyway.

Variants#

Capacities and form factors#

  • 3.5” enterprise HDDs — 18-26 TB in 2026, 7200 RPM, helium-filled, used in storage arrays and hyperscaler cold tiers.
  • 2.5” enterprise — 1-2 TB, 10k/15k RPM, lower latency, mostly displaced by SSDs except for niche budget use.
  • Consumer 3.5” — 4-12 TB, 5400 or 7200 RPM, still common in DIY NAS and home backups.

SMR (shingled magnetic recording)#

Tracks overlap like roof shingles to push density. Sequential writes are fine; random writes are catastrophic because writing one track requires rewriting the next-overlapping tracks. SMR drives expose a host-managed mode where the OS knows it must write sequentially, or a drive-managed mode where the firmware does internal logging — much like an SSD’s FTL.

MAMR / HAMR#

Microwave / heat-assisted magnetic recording — the technologies pushing past ~30 TB per drive. Same access model as conventional HDDs; capacity uplift only.

Trade-offs#

HDD — cheap ($15/TB in 2026), high capacity (26 TB+), tolerable for sequential workloads. Cost: ~10 ms random access, mechanical wear, sensitive to vibration, power-hungry (~10 W idle on spinning drives).
SSD — fast random access (100 µs-class), no mechanical wear, low power. Cost: 5-10x $/TB premium, finite write endurance (P/E cycles), write amplification, FTL complexity.

Other axes:

  • Write caching. Drives have onboard DRAM caches (64-256 MB) that buffer writes. Without FLUSH CACHE or FUA commands, a “successful write” might still be lost on power failure. The journal in any modern file system depends on explicit flush commands hitting the drive.
  • NCQ depth. Production HDD workloads benefit from a queue depth of 8-32; deeper queues let the drive reorder more aggressively. Too deep and tail latency suffers.
  • Bad sectors and reallocation. Drives maintain a spare pool of sectors. When a sector goes bad, the firmware silently remaps it. Reads of a borderline sector get slow (multiple retries) before being remapped. SMART data exposes this.

Common pitfalls#

  • Treating disk latency as a constant. Average latency hides a distribution with a long tail — p99 can be 10x the median, especially under load when the elevator queue grows.
  • Random 4 KB workloads on HDDs. A 1 GB random-read workload on a 7200 RPM drive is 262,144 reads * 9 ms ≈ 40 minutes. The same on SSD: ~26 seconds. Architectures that worked on SSD do not work on HDD.
  • Ignoring the rotational position. Two requests for sectors on the same track but on opposite sides of the platter still cost ~8 ms (one full rotation). Scheduling that only accounts for track position misses this.
  • Trusting the drive’s write cache without a flush. A power blip with dirty cache loses the writes. Journaled file systems issue FLUSH CACHE at every commit; disabling barriers for performance is a footgun.
  • Forgetting that defragmentation matters. On a heavily fragmented HDD, sequential reads turn into random reads. Modern Linux file systems mostly avoid this; older FAT-style file systems on Windows historically needed periodic defrag.
Why is enterprise HDD throughput often quoted as 250 MB/s and not 500 MB/s?

Spec sheets quote outer-track sustained transfer rate at full media bandwidth. Inner tracks have fewer sectors per revolution (constant angular velocity, smaller circumference) and read at maybe 60% of the outer-track rate. Workloads accessing data across the full platter average down. The same drive that does 250 MB/s on a fresh sequential write can drop to 120 MB/s as it fills.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.