I/O Devices and Drivers — Operating Systems

What it is#

I/O devices — disks, network cards, GPUs, keyboards — sit on a bus and talk to the CPU through a small set of hardware registers and a memory region for bulk data. The OS speaks to each device through a device driver: a piece of kernel code that knows the device’s wire protocol and exposes a uniform abstraction to the rest of the kernel (a block device, a character stream, a packet queue).

The canonical interaction is a four-step protocol: poll the device’s status register until it’s ready, write a command and operands to its registers, kick off the operation, and wait for completion. Every modern OS layers smarter mechanisms — interrupts, DMA, memory-mapped I/O — on top of that primitive to avoid burning CPU on dumb spin loops.

When to use it#

You don’t choose whether to have drivers — every running kernel is at least 60% driver code. The interesting choice is how each device is driven: with polling, with interrupts, with DMA, with hybrid models.

Polling fits very-low-latency devices that complete in microseconds (modern NVMe at small queue depths, RDMA NICs). Interrupt handling overhead would dominate.
Interrupts fit devices that complete in milliseconds (HDDs, slow network) or that arrive at unpredictable times (keyboards, mice). The CPU does something useful instead of spinning.
DMA is mandatory for any bulk transfer. Without it, the CPU copies every byte from device registers to memory, which wastes cycles at 10-100x the rate of the device’s transfer rate.
Memory-mapped I/O (MMIO) is the modern register-access mechanism on PCIe — the device’s registers appear in the CPU’s physical address space and are accessed with ordinary loads and stores.

How it works#

A simplified canonical I/O sequence, with the CPU on the left and the device on the right:

CPU                                          Device
 |  while (STATUS == BUSY)  ;   <----------   spin until ready
 |  write DATA register          ---------->
 |  write COMMAND register       ---------->  start operation
 |  while (STATUS == BUSY)  ;   <----------   spin until done
 |  read DATA register           <----------

The first and last spin loops are the obvious waste — the CPU could be running another process. Interrupts replace them: the device raises an IRQ line when it changes state, the CPU traps to an interrupt handler in the driver, and the operating system resumes whatever process was waiting on this I/O. Cost: the trap-and-restore round-trip is on the order of 1-5 µs on modern hardware.

DMA replaces the per-byte copy through CPU registers. The driver tells the DMA engine “transfer N bytes from this device buffer to physical address P, then interrupt me.” The CPU is free during the transfer. For a 1 MB disk read, DMA saves the CPU ~250,000 memory operations.

The kernel side of the driver typically exposes a small set of entry points: open, close, read, write, ioctl for character devices; submit_bio for block devices; start_xmit and a receive poll for network devices. The bulk of driver code is state-machine plumbing: tracking outstanding commands, mapping them to completion interrupts, and handling errors.

Variants#

Polling vs interrupt-driven#

The classical trade-off. Polling wastes CPU when the device is slow; interrupts add per-event overhead that hurts when the device is fast. NAPI in the Linux network stack toggles between the two: under heavy packet load, the driver switches from interrupts to polling, processes a batch, then re-enables interrupts. The same idea applies to NVMe and is the reason io_uring is fast at high queue depths.

Programmed I/O (PIO) vs DMA#

PIO has the CPU copy data through device registers. DMA has a separate engine do the copy. Every modern bulk device uses DMA; PIO survives only for tiny transfers where setting up a DMA descriptor would cost more than the copy itself (a few-byte command response, for example).

Memory-mapped I/O vs port I/O#

x86 historically had separate in / out instructions that addressed a parallel I/O address space (port I/O). Modern PCIe devices appear in the regular memory address space; the device registers are accessed with normal mov instructions plus volatile qualifiers and memory barriers. MMIO is universally what you use today.

Doorbell-based command queues#

Modern devices (NVMe, virtio, RDMA NICs) don’t use a single command register. The driver writes a command into a ring buffer in main memory and rings a doorbell — a single MMIO write to tell the device “look at the queue.” The device processes commands in order and writes completions to a separate ring. This amortises register-write cost across many commands and is the foundation for 1M+ IOPS on NVMe.

Trade-offs#

Interrupts — CPU is free between events. Wins when device latency is large compared to interrupt overhead (HDD seek time ~5-10 ms, interrupt cost ~1 µs). Loses when events come in tight bursts — interrupt storms can pin a CPU.

Polling — predictable, no trap overhead, no interrupt storms. Wins when latency matters and the device is fast (NVMe ~10 µs completions). Loses when the device is mostly idle — the CPU spins for nothing.

Interrupt coalescing. NICs and SSDs aggregate completions and fire a single interrupt every ~64 µs or ~64 events. Reduces interrupt rate at the cost of small latency increases.
Per-CPU queues. Modern NICs and NVMe expose multiple submission/completion queues — one per CPU — so locking on the hot path disappears. Driver complexity rises sharply.
IOMMU. The OS wants DMA to be safe — a buggy device shouldn’t be able to scribble over kernel memory. The IOMMU is a paging unit for DMA: every device DMA goes through an address translation that the kernel controls. Cost: TLB misses on the IOMMU’s translation cache.

Common pitfalls#

Forgetting volatile on MMIO registers. The compiler will helpfully cache the value of *device_status in a register and never re-read it. The device never appears to change state. Symptom: driver works in -O0, hangs in -O2.
Missing memory barriers. On weakly-ordered architectures (ARM, POWER), the compiler may reorder MMIO accesses. Always pair MMIO with the architecture’s barrier (mb(), wmb(), rmb() in Linux).
Stale DMA descriptors. Driver hands a DMA buffer to the device, then frees it before the device finishes. The device writes into freed memory; later allocations get corrupted. Always wait for completion before recycling DMA memory.
Interrupt handler doing too much. Linux splits interrupt handling into a top half (acknowledge, schedule work) and a bottom half (softirq / tasklet / threaded IRQ). A naive handler that does memcpy(1 MB) in the top half blocks all other interrupts.
Polling at full speed. A spin loop on the status register uses 100% of a CPU. If the device takes a millisecond, that’s 100 million wasted cycles. Use the lightest sleep you can (cpu_relax() / PAUSE on x86) or yield to the scheduler.

Why does Linux still use threaded IRQs in 2026?

Hard IRQ handlers run with interrupts disabled — every other device on the system is waiting. Modern Linux pushes most work into kthreads via request_threaded_irq() so the hard handler just acknowledges the device and wakes a thread. Latency is a touch worse, but interrupt-disable windows shrink from milliseconds to microseconds, which keeps audio drivers and high-frequency NICs happy.

Hard Disk Drives — the canonical slow-and-mechanical I/O device.
Flash SSDs and the Flash Translation Layer — modern fast I/O that pushed driver design into doorbell queues.
File System Implementation — the block layer that sits on top of disk drivers.
Limited Direct Execution — the trap-and-handler mechanism that interrupts ride on.
Context Switching — what the kernel does when an interrupt wakes a different process than was running.