Linux Virtual Memory System — Operating Systems

What it is#

The Linux virtual-memory subsystem is the kernel layer responsible for every memory operation in the system: maintaining per-process address spaces, walking and updating page tables, servicing page faults, managing the page cache that backs the file system, reclaiming pages under pressure, and enforcing security boundaries (KASLR, KPTI, NX, SMEP/SMAP). It ties together almost every concept in the memory-virtualization stack — paging, TLBs, swap, replacement, COW, huge pages, NUMA — into a single shipping implementation that runs on machines from Raspberry Pi to 192-core servers with terabytes of RAM.

The codebase is large (~100k lines across mm/) but the architecture has clean layers: the VMA layer (per-process regions), the page-table layer (the actual mappings), the page cache (file-backed data shared across processes), the reclaim subsystem (kswapd, direct reclaim, OOM-killer), and the per-zone / per-node allocators (buddy, slab). Understanding how these interact is what “knowing Linux VM” means in interviews and on-call rotations.

Architecture overview#

                     user processes
                          │
                          │  load/store
                          ▼
                    ┌────────────┐
                    │    MMU     │  ← per-process page table (CR3)
                    │  + TLB     │
                    └─────┬──────┘
              fault       │   hit
                          │
       ┌──────────────────┴───────────────────┐
       │           kernel mm/                  │
       │                                       │
       │  VMA layer  → page-fault handler      │
       │  page tables (4 or 5 levels)          │
       │  page cache (file-backed)             │
       │  anon mem  (heap, stack, mmap-anon)   │
       │  reclaim   (kswapd, lru lists, oom)   │
       │  buddy + slab allocators              │
       │  swap subsystem                       │
       └──────────────────┬────────────────────┘
                          │  block I/O
                          ▼
                       storage

Every memory access is a load/store the CPU issues; the MMU translates via the active page table; on a miss the TLB pulls from the table; on a fault the kernel’s do_page_fault is the dispatcher to all the higher-level subsystems.

Address space layout#

A 64-bit Linux process uses 48 bits of virtual address (57 with 5-level paging enabled). The space splits in half: the low half is user, the high half is kernel.

0xFFFFFFFFFFFFFFFF  ┌────────────────────────────┐
                    │   kernel (high half)        │
                    │   - direct map of all RAM   │
                    │   - vmalloc area            │
                    │   - kernel text             │
                    │   - per-cpu / fixmap        │
0xFFFF800000000000  ├────────────────────────────┤  (non-canonical hole)
                    │     hole (~16 EB)           │
0x00007FFFFFFFFFFF  ├────────────────────────────┤
                    │   stack (grows down)        │
                    │   - randomised by ASLR      │
                    │   ...                       │
                    │   mmap area                 │
                    │   - shared libraries        │
                    │   - mmap'd files & anon     │
                    │   ...                       │
                    │   heap (grows up)           │
                    │   bss / data / text         │
0x0000000000400000  └────────────────────────────┘  (PIE binaries randomise above 0)

The kernel half is the same in every process — a single set of kernel mappings shared via the top half of the PML4. KPTI breaks this for Meltdown defence (see Security section). Per-process state (stacks, page tables, file descriptor table) lives in the kernel half but in per-process structures.

Each region is represented in software by a struct vm_area_struct (VMA) in the process’s mm_struct. A VMA records [start, end), permissions, the backing file (or NULL for anonymous), and flags. VMAs are kept in a red-black tree (now a maple tree, post-6.1) and a linked list for fast lookup on faults.

Page table structure#

Linux on x86_64 uses a 4-level page table (PGD → PUD → PMD → PTE) by default, with a 5-level mode (PGD → P4D → PUD → PMD → PTE) on capable CPUs and kernels:

48-bit VA:  | PGD(9) | PUD(9) | PMD(9) | PTE(9) | offset(12) |

each level table = 4096 bytes = 512 × 8-byte entries

mm/pgtable.c, arch/x86/mm/, and include/asm-generic/pgtable.h are where this lives. Linux’s page-table API is architecture-neutral via helpers (pgd_offset, pud_alloc, pmd_alloc, pte_alloc_map); the same code services x86_64, ARM64, RISC-V, POWER.

Huge pages are first-class: a PMD with _PAGE_PSE set is a 2 MB page; a PUD with _PAGE_PSE is a 1 GB page. Linux supports them in three flavours:

hugetlbfs — explicitly reserved at boot via hugepages=N; user maps via mmap(MAP_HUGETLB). No fragmentation, no migration.
Transparent Hugepages (THP) — khugepaged scans for promotable 4 KB regions and collapses them into 2 MB pages opportunistically. On-fault THP also tries to allocate 2 MB at first touch.
Hugetlb pool with overcommit — somewhere between the two.

Page-table updates are protected by per-mm-struct locks (mm->page_table_lock for coarse ops, the per-PMD lock for fine-grained updates). RCU is used for VMA lookups in the fault path to avoid contention.

Page cache and swap#

The page cache is Linux’s unified buffer for file-backed memory. Every page read from a file goes into the page cache; subsequent reads find it there; writes go into the cache first and are flushed by pdflush / writeback workers. The cache is indexed by (inode, page_offset) so multiple processes that mmap the same file share the underlying pages.

file → page cache pages → mapped into process VMAs as needed

process A's PTE  ──┐
                   ├──→ same physical frame
process B's PTE  ──┘

The page cache and the anonymous page set together form the LRU lists used for reclaim. Linux maintains an active list and an inactive list per memory zone (or, with MGLRU, multiple generations). New pages start on inactive; a second touch promotes to active; aging demotes active back to inactive; reclaim picks victims from the inactive tail.

Swap uses the same mechanism described in Swapping — Mechanisms. Anonymous pages spill to a swap device on reclaim; file pages drop (clean) or write back (dirty). vm.swappiness (0–200, default 60) biases the choice. The OOM-killer fires when reclaim can’t find candidates fast enough; it picks the victim with the highest oom_score.

zswap and zram provide compressed in-memory swap, common on memory-tight devices and containers.

Operational characteristics#

Page fault latency:
- Minor (allocate-on-demand, COW): ~1–5 µs.
- Major (read from disk): ~100 µs on SSD, ~10 ms on HDD.
Page-table walk cost: ~50–500 cycles depending on cache state. Walker caches accelerate sequential walks.
TLB shootdown on SMP: ~5–20 µs IPI latency per remote core. A munmap over many cores can stall for tens of µs.
Reclaim throughput: kswapd typically reclaims ~100k pages/s; direct reclaim is slower because it runs in the allocating thread.
Memory overhead: page tables consume ~0.2–1% of mapped memory on typical workloads (more with sparse mappings, less with huge pages).
Knobs that matter: vm.swappiness, vm.dirty_ratio / vm.dirty_background_ratio, vm.min_free_kbytes, vm.zone_reclaim_mode, /sys/kernel/mm/transparent_hugepage/enabled.

NUMA-awareness shows everywhere: /proc/zoneinfo shows per-node statistics; numactl controls placement; the reclaimer prefers local-node pages; AutoNUMA migrates hot pages toward the node that’s using them most.

Trade-offs and gotchas#

KASLR — randomises the kernel’s base address at boot so a leaked kernel pointer can’t be reused across reboots.
KPTI (Kernel Page-Table Isolation) — post-Meltdown mitigation that gives every process two page tables: one with the kernel mapped (used in kernel mode), one without (used in user mode). A failed speculative kernel access in user mode now finds no mapping at all. Cost: extra page-table switch on every syscall (~5–30% syscall overhead on affected CPUs).
SMEP / SMAP — CPU features that prevent kernel-mode execution / access of user-mode pages. Enabled by default; explicit stac/clac brackets allow controlled access.
NX bit — non-executable pages prevent code injection on data regions. The kernel marks stack and heap NX by default; mapping executable memory requires PROT_EXEC.
ASLR — randomises stack, heap, mmap base, and (with PIE) code base per process.
Address-space hardening — __user-tagged pointers, copy_to_user/copy_from_user accessors, get_user_pages for pinning.

Common gotchas:

THP latency spikes — khugepaged and on-fault THP can stall on fragmentation. Databases (Postgres, MongoDB, MySQL) often disable THP entirely.
Swappiness misconfiguration — swappiness=0 doesn’t disable swap; it just reduces anonymous-page eviction. For “truly no swap,” swapoff -a or no swap device.
mmap over a file with a hole — reads return zero, writes silently allocate. A footgun for memory-mapped databases.
Forking large processes — page-table duplication is O(mapped pages); a fork over 100 GB of mapped memory takes hundreds of ms even before any COW write.
Cgroup memory limits — memory.max enforces a per-cgroup ceiling. Crossing it triggers reclaim in the cgroup, then OOM-kill within the cgroup. Important to set on container hosts.
Direct I/O bypasses the page cache — O_DIRECT reads/writes go straight to the device, useful for databases that maintain their own buffer pool but disastrous if mixed with cached I/O on the same file.

Buffered I/O (page cache) — every byte goes through the cache; readahead, writeback, and shared mmap-style sharing happen automatically. Default for almost everything. Cost: double-buffering against user buffers; cache pressure interferes with anon allocations.

Direct I/O (O_DIRECT) — the application’s buffer is the only copy; bypasses the page cache; sectors-aligned only. Used by databases (Oracle, MySQL InnoDB optionally) that manage their own buffer pool. Cost: no kernel-level readahead; you own caching.

Why is `kswapd` always sitting at 0.1% CPU on idle servers?

kswapd runs periodically to keep free memory above watermarks even when the system feels idle. On a healthy box it does almost nothing — reclaims a few clean file-cache pages, clears accessed bits, updates LRU positions. The tiny background activity is the reason vmstat always shows some kswapd time. If kswapd is sustained at high CPU, the system is in reclaim pressure and you have a problem.

Paging Fundamentals — the mechanism Linux’s VM implements.
Multi-Level Page Tables — the 4-level (or 5-level) walk Linux uses on x86_64.
Translation Lookaside Buffers (TLBs) — Linux’s TLB shootdown and ASID/PCID management.
Swapping — Mechanisms — the swap path Linux’s reclaim feeds into.
Page Replacement Policies — OPT, FIFO, LRU, CLOCK — Linux’s active/inactive and MGLRU reclaim algorithms.