Translation Lookaside Buffers (TLBs)

The cache that makes paging fast. Miss handling (HW vs SW), context-switch invalidation, ASIDs, and a real TLB entry's bits.

Building Block Intermediate
9 min read
tlb mmu page-walk asid address-translation

What it is#

A Translation Lookaside Buffer (TLB) is a small, fast hardware cache that holds recently used virtual-page-number to physical-frame-number translations. Every load and store the CPU issues goes through the TLB first; on a hit, translation completes in a single cycle, and the access proceeds to L1 cache with the physical address. On a miss, the MMU walks the in-memory page table — multiple memory accesses — and installs the resulting translation in the TLB before retrying.

Without a TLB, paging would multiply every memory access by 2 (or 5+ on multi-level tables): one to read the PTE, one for the actual data. Programs would run at a fraction of their speed. The TLB makes paging viable; you can think of it as the cache without which the whole virtual-memory abstraction would be too slow to ship.

Typical sizes today: 64 entries for the L1 dTLB, 1024–1536 entries for the L2 unified TLB, with separate hierarchies for instruction and data. A 64-entry L1 dTLB with 4 KB pages covers 64 * 4 KB = 256 KB of address space; with 2 MB huge pages, the same 64 entries cover 128 MB.

When to use it#

You don’t “use” a TLB — every memory access on a paged machine uses one whether you want to or not. The decision is whether to optimise for the TLB:

  • Tight inner loops with locality of reference get TLB-resident pages for free.
  • Random-access workloads (graph traversals, hash-joins on huge tables) generate TLB misses on most accesses. Reducing pressure via huge pages (2 MB or 1 GB) is the standard fix.
  • Many small short-lived processes thrash the TLB on context switch. Reducing fork/exec rate or using a thread pool helps; ASIDs (next section) also help.
  • NUMA-aware workloads care because remote DRAM accesses on a TLB miss are slower than local — the cost of a miss isn’t constant.

If you’ve never thought about your TLB, you probably don’t need to. If you’re chasing the last 20% of perf on a database scan or a JIT, TLB profiling (perf stat -e dTLB-load-misses) is where you start.

How it works#

The lookup#

CPU issues load(va):
vpn = va >> 12 (top bits)
off = va & 0xFFF (bottom 12 bits = offset)
TLB lookup:
if hit → pfn = TLB[vpn].pfn
return load(pfn << 12 | off)
if miss → walk page table to find pfn
install (vpn, pfn) into TLB (evict if full)
retry

A modern TLB is a small fully-associative or set-associative SRAM. Lookup is one cycle. The walk on a miss is O(levels) memory accesses — for x86_64’s 4-level table, four loads (each of which may itself hit or miss in L1/L2/L3), so 4–100+ cycles depending on caching.

What a TLB entry holds#

A real TLB entry contains more than just (vpn, pfn):

FieldPurpose
VPNThe virtual page number being translated
PFNThe physical frame number
ValidEntry is in use
Permissions (R/W/X, user/kernel)Cached from the PTE; bad access faults without re-walking
DirtyCached from PTE; set on write, propagated back lazily
GlobalIf set, the entry survives context switches (kernel pages)
ASID (or PCID on x86)Tags the entry with the owning address space
Page size4 KB / 2 MB / 1 GB — picks which entry hierarchy to consult

Permissions cached in the TLB are why a freshly-marked-read-only PTE doesn’t take effect until you invalidate the TLB — the CPU keeps using the stale “writable” cached entry.

Hardware-managed vs. software-managed misses#

Two design schools:

  • Hardware-managed TLB (x86, ARM, RISC-V) — the CPU contains a page-table walker that knows the table format. On a miss, it walks the table itself and installs the entry. The OS never sees the miss unless a fault occurs.
  • Software-managed TLB (MIPS, SPARC, Alpha) — on a miss, the CPU raises an exception; the kernel’s fast TLB-miss handler walks the (OS-chosen) page table format and installs the entry via a special instruction. More flexible (any table format), slower per miss (trap overhead).

x86 won partly because its hardware walker is fast and its page-table format is standardised.

Context switch and ASIDs#

Two processes have completely different page tables, so a translation cached for process A is meaningless for process B. The naive fix is to flush the entire TLB on every context switch, which is correct but expensive — every flush forces hundreds of cold misses on the next process.

The fix is the ASID (Address Space Identifier, called PCID on x86, ASID on ARM). Every TLB entry is tagged with the ASID of its owning process. A lookup hits only if both VPN and ASID match. On context switch, the kernel just writes the new ASID into a register — no flush needed; entries from other processes simply don’t match.

ASIDs are typically 12 bits (x86 PCID), so 4096 simultaneous identifiers. The kernel recycles ASIDs when it runs out; recycling a tag does require flushing all entries with that tag.

Multi-level walk and a paged page table#

The deep dive is in Multi-Level Page Tables, but the TLB perspective matters: a single TLB miss on x86_64 may produce 4 cascaded memory accesses, each of which can itself miss in L2 and go to DRAM. A worst-case miss is ~500 cycles. Page-table-walker caches (paging-structure caches) inside the MMU cache intermediate-level entries so that miss patterns sharing higher-level entries are faster.

Huge pages#

If the OS installs a 2 MB page (one second-level PTE pointing at a 2 MB-aligned region), the TLB entry covers 2 MB instead of 4 KB. 512× coverage per entry → 512× fewer misses on linear scans. The cost is internal fragmentation and slower migration. Linux supports both explicit hugepages (hugetlbfs) and transparent hugepages (THP, opportunistic).

Variants#

Split vs. unified#

L1 is typically split into iTLB (instruction) and dTLB (data) so the CPU can look up code and data translations in parallel. L2 is usually unified — one large cache feeding both.

Set-associative vs. fully-associative#

L1 TLBs tend to be fully associative (so any entry can hold any VPN) at low entry counts. L2 is set-associative for capacity reasons. The choice is the usual cache trade-off: associativity vs. lookup cost.

Inclusive vs. exclusive multi-level#

Some CPUs maintain L1 entries as a subset of L2 (inclusive); others let an entry exist in only one (exclusive). Inclusive simplifies invalidation; exclusive maximises capacity.

Paging-structure caches#

Beyond the per-page TLB, the MMU also caches intermediate-level page-table entries (the PML4/PDPT/PD levels on x86_64). A miss in the leaf TLB may still hit in the structure cache, skipping 1–3 of the 4 memory accesses.

Trade-offs#

Hardware-managed TLB (x86, ARM) — fast misses (single-digit cycles when the walker hits its structure cache), fixed page-table format, no kernel involvement on misses. The cost is silicon area and the format lock-in.
Software-managed TLB (MIPS, Alpha) — flexible page-table format (the OS picks), simpler MMU silicon, but every miss is a trap into the kernel — hundreds of cycles minimum. Lost on general-purpose CPUs but still seen in some embedded designs.

Other tensions:

  • Capacity vs. lookup latency. A 1-cycle 64-entry L1 plus a 7-cycle 1500-entry L2 outperforms a single 1500-entry L1 because most lookups hit in L1.
  • Eager flush vs. ASID tagging. Tagging adds bits to every entry; flushing wastes the warm cache. ASIDs win on context-switch-heavy workloads.
  • Huge pages vs. 4 KB pages. Huge pages multiply coverage but waste memory on small allocations and complicate fork’s copy-on-write. THP’s defragmenter (khugepaged) trades CPU for opportunistic coalescing.
  • TLB shootdown cost on SMP. Unmapping a page on one core requires sending an IPI to every other core to invalidate their TLBs. Frequent shootdowns are a major bottleneck for memory-intensive multicore workloads.
Why doesn't the TLB just hold every translation?

A 4 KB page covers a 20-bit VPN — 2^20 = 1M possible entries on a 32-bit machine, vastly more on 64-bit. Even one cache line per entry would dwarf the entire L1+L2 cache budget on a modern CPU. The TLB is deliberately tiny (641536 entries) because it has to deliver a 1-cycle lookup, which means SRAM in the critical path. The 99% hit rate from a small fully-associative L1 plus a larger L2 plus the page-table walker is the right design.

Common pitfalls#

  • Modifying a PTE without invalidating the TLB. The CPU keeps using the cached translation until you execute invlpg <addr> (x86) or tlbi vae1 <addr> (ARM). Stale entries silently bypass new permissions — a common kernel bug.
  • Forgetting TLB shootdown on SMP. Each core has its own TLB. Unmapping on core 0 must IPI cores 1..N so they invalidate too. Skipping the shootdown produces use-after-free at the hardware level.
  • Assuming context switch flushes the TLB. Without ASIDs, yes. With ASIDs (default on modern Linux), the entries stay; cross-process cache hits are still impossible because of the tag mismatch.
  • Profiling TLB misses on small workloads. A microbenchmark that fits in 256 KB will have ~100% TLB hit rate. Real-world TLB pressure shows up at multi-GB working sets or with random access patterns.
  • Enabling THP and getting worse latency. THP’s defragmenter (khugepaged) and on-fault collapsing can cause unpredictable stalls. Latency-sensitive workloads (databases) often disable THP and use explicit hugepages instead.
  • Mixing huge pages and fork. A fork over a 2 MB huge page does copy-on-write at 2 MB granularity — a single child write copies the whole 2 MB page. Big surprise on memory footprint.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.