Multi-Level Page Tables

Sparse address spaces, x86_64's 4-level walk, the time-space trade-off, inverted page tables, hashed alternatives.

Building Block Advanced
9 min read
page-table multi-level x86-64 sparse page-walk

What it is#

A multi-level page table is a tree-structured page-table layout where the virtual page number is split into multiple indices, each of which selects an entry in a successive level of the tree. The leaves are the real page-table entries; the inner nodes are pointers to subtrees. The trick is that any inner-node pointer can be null — entire subtrees of the address space simply don’t exist in the table.

This matters because real address spaces are extremely sparse. A 64-bit process advertises 2^48 bytes of virtual address; it typically uses a few hundred MB. A flat (linear) page table that mapped every possible page would consume 256 GB of memory just for the table — for every process. A multi-level table compresses this to “only the pages that are actually mapped pay for table storage,” typically a few KB to a few MB.

x86_64 uses a 4-level walk by default (PML4 → PDPT → PD → PT), each level indexed by 9 bits of VPN, with the bottom 12 bits as the page offset. Recent CPUs add a 5-level mode for 57-bit virtual addresses.

48-bit virtual address:
| PML4 (9) | PDPT (9) | PD (9) | PT (9) | offset (12) |
│ │ │ │
▼ ▼ ▼ ▼
PML4[i] → PDPT[j] → PD[k] → PT[l] → (PFN, flags)

When to use it#

Every general-purpose modern OS uses multi-level page tables. The decision is which depth and which alternatives:

  • 2-level (Intel 32-bit without PAE) — 1024 × 1024 entries, supports 4 GB of address space. Enough for 32-bit processes; both levels of the table are 4 KB.
  • 3-level (32-bit PAE, some ARM modes) — handles 36-bit physical / extended virtual at 4 KB pages.
  • 4-level (x86_64, AArch64 with 4 KB / 48 bits) — the modern default.
  • 5-level (x86_64 with 5-level paging, AArch64 with 4 KB / 52 bits) — needed when you want >256 TB of virtual address; turned on at boot if the CPU supports it and the kernel was built for it.
  • Inverted page tables (PowerPC, IA-64 historically) — alternative shape with one entry per physical frame; rarely a win.
  • Hashed page tables (PA-RISC, some research) — open addressing into a hash table keyed on (ASID, VPN).

You also choose this implicitly when you pick page size. Linux’s huge pages skip the bottom level (2 MB pages on x86_64) or two bottom levels (1 GB pages) — fewer levels per walk, larger TLB coverage.

How it works#

The walk on a miss#

A TLB miss in 64-bit x86 triggers a hardware page-table walk:

1. cr3 → physical address of PML4
2. pml4e = mem[cr3 + 8*PML4_index]
- if pml4e.present == 0: page fault
3. pdpte = mem[pml4e.next_table + 8*PDPT_index]
- if pdpte.PS == 1: this is a 1 GB page; assemble PFN, done
- if pdpte.present == 0: page fault
4. pde = mem[pdpte.next_table + 8*PD_index]
- if pde.PS == 1: this is a 2 MB page; assemble PFN, done
- if pde.present == 0: page fault
5. pte = mem[pde.next_table + 8*PT_index]
- if pte.present == 0: page fault
6. PFN = pte.PFN; assemble physical address; install in TLB; retry.

Four memory loads per walk in the leaf case. Each load itself can hit or miss in the cache hierarchy, so a worst-case walk is hundreds of cycles. The MMU has small paging-structure caches that cache PML4/PDPT/PD entries to skip levels when the upper bits of VPN are similar to a recent walk.

Sparseness#

Suppose a process only has [0, 1 MB) mapped at the bottom of the address space and a small stack near the top. The page table contains:

  • 2 PML4 entries (one for the low range, one for the stack range).
  • 2 PDPT pages (one under each PML4 entry).
  • 2 PD pages (one under each PDPT).
  • 2 PT pages (one under each PD, holding 256 PTEs for the 1 MB range plus a few stack PTEs).

Total: ~8 × 4 KB = 32 KB of page-table memory. A flat table would have required 2^36 × 8 = 512 GB. Multi-level compression is what makes 64-bit virtual addresses tractable.

Walk caches and the cost of a miss#

The MMU’s paging-structure caches typically hold a handful of recent (PML4/PDPT/PD) entries. On sequential accesses through a 1 GB range, the same PML4/PDPT/PD entries are reused for every leaf miss; only the leaf load actually reaches DRAM. A random-access walk over a sparsely mapped huge dataset can miss the structure caches too, paying the full four-load cost.

How the OS modifies a multi-level table#

To map a new virtual page:

  1. Walk down to the leaf; if any inner-level entry is null, allocate a new 4 KB page table and link it in.
  2. Set the leaf PTE with the desired PFN and flags.
  3. (No TLB invalidation needed if the page is being mapped for the first time — there’s no stale entry.)

To unmap:

  1. Walk to the leaf; clear the PTE.
  2. Invalidate the TLB on every core that may have cached it (TLB shootdown).
  3. Optionally walk back up and free any now-empty inner-level tables, with shootdowns at each step. Most kernels leave the inner pages allocated and reuse them.

Huge pages skip a level#

A PDE with the PS bit set means “this is a 2 MB page, not a pointer to a PT.” The walker stops here and assembles the PFN from this entry. One fewer memory load per miss and the TLB entry covers 512× more memory. 1 GB pages stop at the PDPT level (two fewer loads).

Variants#

Linear (flat) page tables#

One entry per virtual page, no tree. Simple, fast (one load per miss), wasteful on sparse spaces. Only viable when the address space is small (small embedded CPUs, 16-bit micros).

Inverted page tables#

One entry per physical frame(PID, VPN) → frame number. A hash on (PID, VPN) finds the entry. Wins when the address space is enormous but most pages are unmapped (the table size is proportional to physical RAM, not virtual range). Loses on lookup cost (hash + collision chain) and on sharing (one frame, one entry — you can’t have two virtual addresses pointing to the same frame except via aliasing). PowerPC and IA-64 used these; x86 never did.

Hashed page tables#

A bigger open-addressed hash table keyed on (VPN, ASID). PA-RISC and some research designs. Trade lookup time (hash + probe) for table size; works well when most of the address space is empty.

5-level paging on x86_64#

A PML5 table sits on top of the existing 4 levels. The 9 highest bits of the VA index into PML5. Each PML5 entry points to a PML4. Enabled at boot via a CPUID feature bit; the kernel must be built for it. Linux’s CONFIG_X86_5LEVEL controls this.

Reverse mapping (the OS’s side)#

Independent of the page-table tree itself, the OS keeps a struct page per physical frame plus reverse maps so it can find every PTE that maps a given frame (for COW and swap-out). The page table answers “VPN → PFN”; the reverse map answers “PFN → all VPNs.”

Trade-offs#

Multi-level (tree) page tables — sparse spaces are nearly free in storage, kernel can allocate / free subtrees lazily, well-matched to TLB + walker hardware. Cost: walks are O(levels) memory accesses; mitigated by walker caches.
Inverted / hashed page tables — storage proportional to physical memory, naturally bounded. Cost: lookups need hashing + collision chains, sharing is hard, hardware support is rare today.

Other recurring tensions:

  • Depth vs. miss cost. Each level adds a memory load to a TLB miss. Five-level paging supports more virtual address space at the cost of slightly slower misses.
  • Page-table memory vs. fragmentation. Smaller pages mean more PTEs and bigger leaf tables but less internal fragmentation. Huge pages flip the trade-off.
  • Lazy vs. eager subtree allocation. Lazy keeps the table small but takes a page-fault path on first access; eager pre-allocates inner-level pages for mapped ranges.
  • Per-process tables vs. shared tables. Each process needs its own leaves but the kernel half is often shared (a single set of kernel PTEs mapped into every process’s PML4). KPTI broke this for Meltdown defense — now there’s a separate kernel page table on affected hardware.
Why doesn't the table use the same page-size hierarchy everywhere?

It does — that’s the cute design choice. Each level’s table is exactly one 4 KB page (512 × 8 bytes), the kernel allocates these pages from the same buddy allocator as user pages, and the table itself can be paged out under extreme memory pressure (rarely done, but architecturally possible). The 9-bit-per-level geometry is the smallest power of two such that each level fits in one page. 5-level paging just adds another 4 KB on top.

Common pitfalls#

  • Underestimating page-table memory under fragmentation. A process that touches one byte per 4 KB across a 1 GB range owns 1 GB / 2 MB = 512 leaf tables = 2 MB of leaf PTEs. The bookkeeping adds up.
  • Forgetting TLB shootdown when modifying inner-level entries. Promoting / demoting between huge and base pages requires invalidating any TLB entry that covered the old mapping on every core.
  • Walking page tables in software without locks. The kernel’s walk_page_range must hold the appropriate page-table lock or coordinate with RCU; otherwise concurrent unmaps tear down intermediate-level tables under your feet.
  • Conflating PFN and physical address. PFN is the frame number (physical address >> 12). Mixing them off by a 4 KB factor is a classic kernel bug.
  • Thinking 5-level paging is free. It’s not — every TLB miss does one more memory load. Most workloads are fine but extreme TLB-miss-heavy workloads (graph traversal, huge sparse matrices) see measurable slowdown when 5-level is enabled.
  • Sharing leaf PTEs naively across processes. Doing so means a modification by one process is visible to the other — a feature for shared memory, a bug if unintended. Linux’s fork uses COW at the PTE level and marks shared pages as read-only.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.