Paging Fundamentals — Operating Systems

What it is#

Paging is the dominant memory-virtualization mechanism in every modern OS. The virtual address space of each process is chopped into fixed-size pages (typically 4 KB), the physical memory is chopped into matching page frames, and a per-process page table records the mapping from virtual page number to physical frame number. On every memory access, the MMU translates virtual to physical by indexing the page table with the page number and adding the offset.

Compared with the alternatives (base-and-bounds, segmentation), paging trades a more expensive translation step for two big wins: external fragmentation disappears (all pages are the same size, so any free frame can hold any page), and physical placement becomes arbitrary (the OS can scatter a process’s pages across DRAM and still hand the process a contiguous virtual address space).

When to use it#

Effectively every general-purpose OS uses paging — Linux, BSD, Windows, macOS, xv6. It’s the default unless you have an unusual constraint:

Embedded systems without an MMU — microcontrollers (Cortex-M, AVR) don’t have a page table walker, so they run a single address space shared by all code. RTOSes like FreeRTOS live here.
Hypervisors add a second layer of translation (nested / two-dimensional paging) so guests can run their own page tables while the hypervisor still controls physical memory.
Specialised kernels (unikernels, library OSes) sometimes simplify to a flat one-process address space, but they still page so the hardware MMU can enforce read-only / no-execute on parts of the image.

Reach for an understanding of paging whenever you need to reason about RSS vs. VSZ, page faults, copy-on-write, shared memory, mmap, or why a process with a 16 GB virtual size only consumes 200 MB of RAM.

How it works#

The translation step#

A virtual address is split into two pieces: the high bits are the virtual page number (VPN), the low bits are the offset within the page. For a 32-bit address space with 4 KB pages, that’s a 20-bit VPN and a 12-bit offset; for x86_64 (48 bits of virtual address in current implementations) it’s a 36-bit VPN and a 12-bit offset.

The MMU reads the page-table base register (CR3 on x86, TTBR0 on ARM) to find the current process’s page table. It indexes that table with the VPN, retrieves a page table entry (PTE), extracts the physical frame number (PFN), and concatenates it with the offset to form the physical address.

virtual address: | VPN (20 bits) | offset (12 bits) |
                       │
                       │  PTE = page_table[VPN]
                       ▼
                  | PFN (20 bits) | flags |
                       │
physical address: | PFN (20 bits) | offset (12 bits) |

What a PTE actually holds#

A typical x86 4 KB PTE is 64 bits. The interesting fields:

Bit	Name	Meaning
0	Present (P)	1 if the page is in physical memory; 0 triggers a page fault on access
1	Read/Write (R/W)	1 if writable; 0 makes write attempts fault
2	User/Supervisor (U/S)	1 if user mode can access; 0 restricts to kernel
3	Page-Write-Through (PWT)	Caching policy hint
4	Page Cache Disable (PCD)	Caching policy hint
5	Accessed (A)	Set by hardware on first reference; used by replacement policies
6	Dirty (D)	Set by hardware on first write; tells the OS the page must be written to swap
7	Page Size (PS)	1 for 2 MB / 1 GB huge pages, 0 for 4 KB
12–51	Physical Frame Number	Where the page actually is in DRAM
63	No-Execute (NX)	1 forbids execution; blocks code injection on data pages

The bits the OS cares about most are P (controls page faults), R/W and U/S and NX (protection), A and D (replacement and writeback bookkeeping). The hardware sets A and D for free as a side effect of every access; the OS reads them later to make decisions.

Valid vs. present#

It’s worth distinguishing two related notions:

Valid means the virtual page is part of the process’s address space — there’s a meaningful translation defined. An invalid access is a bug (segfault).
Present means the page’s data currently lives in physical memory. A valid-but-not-present page is one that has been swapped out, or hasn’t been faulted in yet, or is a zero-fill-on-demand page; the kernel intervenes and resolves it.

Many architectures conflate the two into a single “present” bit and use other PTE flags (or a side table) to disambiguate. The OS-level distinction is real: a fault on an invalid page kills the process; a fault on a not-present page is normal demand paging.

The cost without a cache#

Naive paging multiplies every memory access by two: one access to read the PTE from the page table, one to read the actual data. That’s catastrophic — programs would run at half the speed. Two mechanisms recover the performance:

TLBs (translation lookaside buffers) cache recent VPN→PFN translations. A TLB hit is ~1 cycle; a TLB miss is the cost of a page-table walk, which on multi-level tables can be 4+ memory accesses. See TLBs for the deep dive.
The page table itself is paged. Multi-level page tables (covered in Multi-Level Page Tables) make most of a sparse address space cost nothing — entire regions of the page table are simply absent.

What happens on a page fault#

1. CPU tries to load from virtual address V.
2. MMU walks page table; PTE for V has present=0.
3. Hardware raises a page-fault exception. Faulting address goes in CR2 (x86).
4. CPU traps to kernel's page-fault handler.
5. Handler classifies the fault:
   - Invalid address?           → SIGSEGV to the process
   - Permission violation?      → SIGSEGV
   - Page in swap?              → read it back, update PTE, retry
   - First touch of an anon page? → allocate zero page, update PTE
   - COW write?                 → copy page, update PTE rw=1, retry
   - File-backed page?          → page cache lookup; read from disk if miss
6. Hardware retries the faulting instruction.

The retry is what makes the whole thing transparent — the user instruction has no idea it was preempted.

Variants#

Page size#

The classic 4 KB page is a balance: small enough that internal fragmentation (the unused tail of a partially-used page) is bearable, large enough that TLB coverage and DMA setup aren’t dominated by per-page overhead. Modern hardware supports larger sizes:

Huge pages — 2 MB on x86, 16 MB / 64 MB on POWER. One TLB entry now covers 512× more memory; useful for DB buffer pools, JVMs, scientific compute.
Giant pages — 1 GB on x86. Used for hypervisor pinning and very large heaps.

The cost is internal fragmentation (a 2 MB page allocated to hold a 64 KB struct wastes ~1.94 MB) and slower migration. Linux supports both explicit huge pages (hugetlbfs) and transparent huge pages (THP, opportunistic coalescing).

Inverted page tables#

Instead of one table per process indexed by VPN, an inverted table has one entry per physical frame, holding (PID, VPN). Translation requires hashing on (PID, VPN) to find the right entry. Wins when the address space is enormous but mostly empty; the hash lookup adds latency. PowerPC and IA-64 used this; x86 never did.

Per-process vs. global pages#

The high half of the kernel’s virtual address space is the same in every process — the kernel doesn’t want to switch page tables every system call. PTEs for kernel pages are marked global so the TLB doesn’t flush them on context switch. The cost of that optimization is the Meltdown vulnerability (kernel pages were speculatively accessible to user code), which is why KPTI now separates user and kernel page tables on affected hardware.

Trade-offs#

Fixed-size pages (paging) — uniform unit, no external fragmentation, easy to share pages across processes, hardware support is universal. Cost: a process that wants 8200 bytes uses two whole pages; the page table is a per-process data structure that must be maintained on every mapping change.

Variable-size regions (segmentation) — natural fit for code / data / stack having different sizes, easier to grow a region in place. Cost: external fragmentation as regions are allocated and freed; compaction is expensive; hardware support is mostly historical (x86 segmentation is mostly disabled in 64-bit mode).

Other trade-offs:

Larger pages reduce TLB pressure but waste memory on small allocations. The right balance depends on workload.
Linear page tables are simple but a 32-bit process with 4 KB pages needs 2^20 = 1M PTEs * 4 bytes = 4 MB just for the table — multiplied by every process. Multi-level tables compress sparse spaces aggressively.
The cost of a page fault is enormous. A minor fault (just allocation) costs microseconds; a major fault (disk read) costs milliseconds. The page replacement policy decides how often you pay.

Why isn't the page table itself just in registers?

A 4 KB page covers a 20-bit VPN — 2^20 entries. Even one cache line per PTE would mean megabytes of fast storage, which is what the entire L1+L2 cache budget is across all data. The PTE storage has to live in DRAM, and the TLB is the small SRAM cache that holds the handful of translations currently being used. The TLB is the only “registers”-equivalent here, and it’s deliberately tiny (a few hundred entries on x86) to keep lookup at 1-cycle latency.

Common pitfalls#

Confusing the page table with the TLB. The TLB is a hardware cache of PTEs; the page table is the in-memory data structure the OS owns. Modifying a PTE doesn’t automatically update the TLB — the OS must explicitly invalidate (invlpg on x86, tlbi on ARM) or the stale translation persists.
Forgetting that fork shares pages until write. A fork that “took 1 ms” can turn into 30 seconds of slow page faults as the child writes and triggers COW. Long-lived parents (Redis) plan for this.
Touching guard pages. The stack grows downward into a deliberately-unmapped page; accessing it triggers a fault and the kernel either grows the stack or kills the process. A stack-buffer-overrun bug that lands past the guard page is what produces a segfault.
Mistaking RSS for memory used. Resident set size counts pages currently present in DRAM. A 10 GB mmap’d file with 100 MB resident has RSS=100 MB and VSZ=10 GB; the difference is on-demand pages.
Modifying page tables without TLB shootdown on SMP. On a multi-core system, the cores share page tables but each has its own TLB. Unmapping a page on core 0 must send an IPI to other cores so they invalidate their TLBs too. Skipping the shootdown produces use-after-free at the hardware level.

Address Spaces — what paging is virtualizing.
Address Translation — Base and Bounds — the predecessor mechanism.
Segmentation — the other classic mechanism, sometimes combined with paging.
Translation Lookaside Buffers (TLBs) — what makes paging fast.
Multi-Level Page Tables — how real page tables are organised.