Swapping — Mechanisms — Operating Systems

What it is#

Swapping is the OS mechanism that lets a process’s virtual address space be larger than physical RAM. The kernel maintains swap space on disk (a swap partition or swap file) where it can spill page frames whose virtual mappings are not currently needed. When the process accesses a page that’s been swapped out, the hardware raises a page fault, the kernel reads the page back from disk, installs it in a physical frame, updates the PTE, and resumes the faulting instruction. From the process’s point of view, nothing happened — the load just took a few milliseconds.

The same machinery underlies several adjacent features:

Demand paging — pages aren’t loaded until first touch. Executables mmap their text segment; pages fault in on demand.
Copy-on-write — fork shares pages read-only between parent and child; the first write traps and copies.
Memory-mapped files — the contents of a file are mapped into the address space and faulted in from the file system instead of swap.
Zero-fill on demand — bss and freshly-allocated anonymous pages point to a shared zero page until first write.

The unifying primitive is “the present bit + the page-fault handler.” Everything else is policy on top.

When to use it#

The kernel does swapping automatically once you enable a swap device. The user choices are around tuning and avoidance:

Servers with predictable working sets often run with swap off (or vm.swappiness = 0). A swap-out under load adds milliseconds of latency to memory accesses; better to OOM-kill cleanly.
Desktops and laptops keep swap on, accept occasional pauses, and benefit from being able to suspend-to-disk (hibernate writes RAM to swap).
Memory-overcommitted hosts (cloud VMs, containers) lean on swap as a relief valve, accepting the latency hit so a single bad tenant doesn’t OOM the host.
mmap-heavy workloads (databases, media servers) rely on the same demand-paging path to load file pages on access.

Even with swap “off,” the OS still uses demand paging for file-backed pages — the mechanism never goes away, only the swap target does.

How it works#

The present bit and what triggers a fault#

Every PTE has a present bit (P=1 means the page is in physical memory). When the MMU walks the page table and finds P=0, it raises a page fault. The hardware passes the faulting virtual address (in CR2 on x86) and an error code describing the access type (read/write/exec, user/kernel).

The kernel’s fault handler classifies the fault:

page fault on virtual address V:

  1. find the VMA covering V (process's memory regions)
     - none      → SIGSEGV (invalid access)
     - has VMA   → continue

  2. check permissions against VMA
     - mismatch  → SIGSEGV
     - OK        → continue

  3. examine the PTE:
     - PTE absent + anonymous VMA       → zero-fill demand
     - PTE absent + file-backed VMA     → read page from file via page cache
     - PTE present-but-swapped (swap entry stored)
                                         → read from swap, install
     - PTE present + write but R/W=0 + COW marker
                                         → copy page, mark R/W, retry
     - PTE present + write but VMA RO    → SIGSEGV

After fix-up, the kernel updates the PTE (P=1, PFN set, permissions set) and returns. Hardware retries the faulting instruction with the now-valid translation.

Storing the swap reference#

When the kernel swaps a page out, it doesn’t lose the information about where the page lives. Instead of a PFN, the PTE is repurposed to hold a swap entry — a (swap-device-id, slot-number) pair encoded into the PTE bits with P=0. The page-fault handler reads this back, fetches the page from the named slot, and installs it.

PTE before swap-out:    | PFN | P=1 | flags |
PTE after swap-out:     | swap-device-id | swap-slot | P=0 |

Demand paging an executable#

Running ./prog:

execve parses the ELF header, maps the text segment as a file-backed region with P=0 PTEs everywhere.
CPU starts executing at the program’s entry point; first instruction fault → kernel reads the page from disk via the page cache, installs the PTE, returns.
Each new region (code, data, .rodata) faults in lazily as the program touches it.

No “loader” preloads the binary; the page-fault handler is the loader.

Copy-on-write on `fork`#

When a process forks:

The kernel duplicates the parent’s page table, but marks every writable PTE as read-only in both parent and child.
The kernel sets a per-page reference count so it knows both processes share the frame.
On the first write to a page by either process, the MMU faults (the page is read-only).
The handler sees the COW marker, allocates a fresh physical frame, copies the page, installs the new PTE writable in the writer’s table. The other process keeps the original.

Fork is cheap until you write. A 4 GB process forks in microseconds (a page-table copy) but a child that writes everything pays 4 GB of copies. Redis’s BGSAVE relies heavily on this.

Prefetching and clustering#

Reading one page from swap is wasteful — disk transfers are sequential, and you can read 8 or 16 pages for nearly the same cost as one. The kernel clusters nearby swap slots into a single I/O when faulting and prefaults adjacent virtual pages opportunistically. The same trick applies to file-backed faults via the page-cache readahead path.

When eviction actually happens#

Swap-out is driven by memory pressure, not by access. The kernel runs a background reclaimer (kswapd on Linux) that wakes when free memory falls below a watermark. It walks active and inactive page lists, picks victims via the replacement policy (see Page Replacement Policies), and writes dirty victims to swap. The application’s page-fault rate stays low until the working set exceeds RAM, then everything slows by ~1000× (RAM ~100 ns, SSD ~100 µs, HDD ~10 ms).

Variants#

Swap partition vs. swap file#

A swap partition is a raw block device dedicated to swap. Faster (no file-system overhead). Inflexible (must be sized at install time).

A swap file is a regular file marked for swap use via mkswap and swapon. Allocates extents in the file system; performance is close to a partition on modern file systems (ext4, XFS) with swapon-time hole-punching disabled. Easier to resize.

zswap / zram (compressed swap)#

Instead of writing to disk, compress pages in RAM. zram is a compressed RAM block device used as swap — gives you “more RAM” at CPU cost. zswap is a write-back cache in front of disk swap — compressed pages go to RAM first, only get written to disk on RAM pressure. Both are common on memory-constrained devices (Android, ChromeOS) and cloud VMs.

Pageout daemons#

Linux runs kswapd per NUMA node. Direct reclaim happens in the allocating thread when watermarks aren’t met. The interaction between background and direct reclaim is the source of many “Linux feels stuttery” complaints when the system is near OOM.

Hibernate#

A special case where the entire RAM contents are written to swap as a snapshot, and resume reads them back. Different code path from on-line swap, but the same machinery.

Trade-offs#

Swap on (with backing storage) — graceful behaviour under memory spikes, cold pages spill to disk, OOM-kill is rare. Latency cost: a swap-in from SSD is ~100 µs, from HDD ~10 ms; a thrashing workload becomes I/O-bound. Suspend-to-disk works.

Swap off — predictable latency, no surprise stalls, easier capacity planning. Cost: any over-commit triggers OOM-kill instead of slowdown, and you can’t hibernate. Standard on production databases and latency-sensitive servers.

Other tensions:

Read-ahead aggressiveness. Bigger clusters amortise seek costs but waste I/O on random-access workloads. Tunable per device.
vm.swappiness. Higher values prefer swapping anonymous pages; lower values prefer dropping file cache. Default 60; databases often set 1 or 10.
COW granularity. Page-level COW is cheap on small writes; huge-page COW (2 MB) is expensive on small writes because the whole 2 MB page copies.
Sync vs. async writeback. Eager writeback keeps dirty pages low but consumes I/O. Lazy writeback defers but spikes latency under pressure.
OOM-killer vs. infinite swap. Without OOM-kill, a runaway process can write the entire working set to swap and trigger hours of thrash. Modern kernels OOM-kill aggressively (oom_score_adj).

Why is a swap-in so much slower than expected?

The cost isn’t just the disk read — it’s the entire fault path. On a TLB miss, the walker finds P=0, traps into the kernel. The kernel saves state, finds the VMA, decodes the swap entry, issues a disk read, blocks the thread, schedules someone else, services the read on completion, allocates a frame, copies data, updates the PTE, invalidates the TLB, returns. The actual disk I/O is one part. On SSD with ~100 µs reads, the OS path adds tens of µs of overhead. On HDD with ~10 ms reads, the OS path is noise.

Common pitfalls#

Confusing swap with file-backed page cache. Both go through the page-fault handler, but file pages are dropped (clean) and reloaded from the original file, not from swap. Disabling swap doesn’t disable demand-paging executables.
Assuming swap-out frees memory immediately. The reclaimer marks pages for writeout, but the write must complete before the frame is reusable. Heavy reclaim under I/O pressure stalls the whole machine.
Letting a database swap. Postgres, MySQL, MongoDB all maintain their own buffer pool tuned to physical RAM. Letting them swap to disk replaces their carefully ordered I/O with random page faults. Disable swap or set swappiness=1.
fork-on-large-RSS in long-lived processes. A fork over a 50 GB Redis instance is fast, but if the child or parent writes a lot, the COW path will produce hundreds of thousands of page faults. Use MADV_DONTFORK on hot regions or use threads instead.
Hot threads in COW-deferred regions. Code that touches a “shared” page in many threads after fork will serialise on the COW path until one thread does the copy.
Ignoring the order of reclaim. Linux drops clean file cache before swapping anonymous pages. If your file cache is hot, the system will swap your heap to defend a cache you don’t need. Tuning swappiness and vfs_cache_pressure lets you bias the order.

Paging Fundamentals — the present bit is the trigger for swap-in.
Page Replacement Policies — what decides which pages to swap out.
Translation Lookaside Buffers (TLBs) — swap-in installs a new PTE, requiring TLB invalidation.
Multi-Level Page Tables — the tree the walker traverses to find P=0.
Linux Virtual Memory System — the production deployment of the swap machinery.