Page Replacement Policies — OPT, FIFO, LRU, CLOCK

The classic algorithms, why OPT is unachievable, how CLOCK approximates LRU cheaply, and the thrashing wall.

Building Block Intermediate
9 min read
page-replacement lru clock thrashing working-set

What it is#

A page replacement policy is the decision the kernel makes when it needs to free a physical frame and all frames are occupied: which page do I evict? Get it right and the evicted page is one that won’t be touched for a long time; get it wrong and the same page faults back in immediately, producing thrashing.

The policies tier from theoretically optimal to cheaply approximate:

  • OPT — evict the page that will be referenced furthest in the future. Provably optimal; requires an oracle.
  • FIFO — evict the page that was loaded first. Simple, fast, and wrong (a page in heavy use gets evicted just because it was loaded early).
  • LRU — evict the page that was used least recently. Excellent approximation of OPT for most workloads; expensive to implement exactly.
  • CLOCK (second-chance) — a cheap LRU approximation using the hardware “accessed” bit and a circular scan.
  • WS / working-set models — pick eviction candidates based on the recent reference set rather than per-page recency.

Real kernels (Linux, BSD, Windows) all use CLOCK variants with refinements (LRU-2, active/inactive lists, multi-generation queues).

When to use it#

You’re touching this every time the kernel reclaims memory — i.e., whenever physical RAM fills up. The user-facing controls are coarse:

  • Increase RAM. The most reliable way to make replacement irrelevant.
  • Tune working-set size. Cap the file cache (vm.vfs_cache_pressure), cap the anonymous heap (cgroup memory.max), pin hot pages (mlock, mlockall).
  • Bias the replacement algorithm. vm.swappiness shifts the eviction preference between anonymous and file-backed pages.
  • Disable the wrong things. Latency-sensitive servers turn off transparent huge pages, set swappiness=1, and disable khugepaged collapse — not because those features are broken but because their replacement-time behaviour is unpredictable.

When designing a userspace cache (memcached, Redis), you’re implementing the same problem on top of malloc — the LFU and TinyLFU variants in Caffeine are direct descendants.

How it works#

OPT — the bound#

OPT evicts the page whose next use is furthest in the future. It’s the theoretical optimum for any reference string. You can’t implement it (you’d need to know the future), but you can measure how close real algorithms get to it on a captured trace.

reference string: A B C D A B E A B C D E
frame count = 4
OPT decisions:
A B C D → loaded
A B E → evict the one used furthest in the future
C: next use at position 10
D: next use at position 11
→ evict D, load E

OPT gives the bound; everything else is an approximation.

FIFO — the cautionary tale#

Evict the page that’s been in memory longest. Implemented as a FIFO queue of allocated frames.

ref: A B C D A B E A B C D
FIFO (3 frames):
step frames miss?
A [A ] miss
B [A B ] miss
C [A B C] miss
D [B C D] miss (evict A)
A [C D A] miss (evict B)
B [D A B] miss (evict C)
...

FIFO suffers from Belady’s anomaly — adding a frame can increase miss count. A page in heavy use is evicted just because it was loaded early. Used historically; not in serious production use.

LRU — the gold standard approximation#

Evict the page that was used least recently. The intuition: recently used pages are likely to be used again (temporal locality).

Exact LRU requires moving a page to the head of an ordered list on every access. That’s untenable — every load and store cannot afford a linked-list-update side effect. The OS’s exact LRU is too expensive for the same reason exact LFU is.

But LRU as a target — the page most recently used least — is what every real kernel chases approximately.

CLOCK — LRU on the cheap#

The trick: don’t track exact recency. Instead, use the hardware accessed bit (A bit) that the MMU sets on every reference. Pages live in a circular list. A “hand” points at the next candidate:

pages laid out as a ring:
┌───────────────────┐
→ │ A: ref=1 │
├───────────────────┤
│ B: ref=0 ← hand │
├───────────────────┤
│ C: ref=1 │
├───────────────────┤
│ D: ref=0 │
└───────────────────┘
on eviction:
look at hand:
if ref=1: clear it (give a second chance), advance hand
if ref=0: evict this page, advance hand

A page with ref=1 is “recently used” — clear it and skip; if the hand comes back before another access, the bit will still be 0 and the page evicts. A page never referenced will be evicted on the first sweep.

Dramatically cheaper than exact LRU and, in practice, gets within a few percent of OPT on typical workloads. Linux’s reclaimer is a refinement of CLOCK with active/inactive lists.

Active / inactive lists#

Linux maintains two lists per zone: active (recently used) and inactive (candidates for eviction). When the reclaimer wakes, it scans the inactive list with a CLOCK-like algorithm. Pages with the accessed bit set get promoted back to active. Pages that survive the inactive scan unrefenced get reclaimed.

Pages move between lists based on access patterns:

  • A new page goes on the inactive list (so a one-time read doesn’t pollute active).
  • A second reference within the inactive lifetime promotes it to active.
  • Active pages age toward the back of the active list; if not re-referenced, demoted to inactive.

This is essentially a two-queue / LRU-2 approximation: a page must be referenced twice before earning “active” status, which filters out one-shot reads (like a cat large_file).

Multi-generation LRU (MGLRU)#

The newer Linux reclaim algorithm (mainlined in 6.1). Pages are grouped into generations based on age. Reclaim scans the oldest generation. Reference activity promotes pages to younger generations. The algorithm is simpler than the active/inactive scheme and shown to perform better on container workloads.

Variants#

Random / NRU (Not Recently Used)#

Evict any unreferenced page at random. Surprisingly competitive in some studies because random tends to spread eviction evenly. NRU buckets pages by (referenced, modified) bit pairs and evicts from the lowest-priority bucket.

LFU (Least Frequently Used)#

Track an access counter per page; evict lowest counter. Mathematically appealing but vulnerable to stale popularity: a once-hot page that’s now cold sits in cache forever. TinyLFU and W-TinyLFU (used in Caffeine) add aging and admission-window heuristics to fix this.

Working-set algorithm#

Define a process’s working set W(t, Δ) as the pages it referenced in the last Δ time units. Replace any page not in any process’s current working set. Adaptive to phase changes; expensive to track exactly.

Page-fault frequency (PFF)#

Measure each process’s fault rate. If too high, give it more frames (steal from a low-PFF process). If too low, take frames away. A feedback-loop variant of working-set.

Trade-offs#

LRU (exact or close approximation) — excellent hit rate on workloads with temporal locality, robust, well-understood. Cost: exact tracking is expensive; CLOCK approximations need the hardware accessed bit; sequential scans defeat LRU’s assumption.
FIFO / random — trivial to implement, no per-access bookkeeping. Cost: poor hit rate on workloads with reused hot pages; FIFO has Belady’s anomaly; rarely used outside teaching examples.

Other tensions:

  • Scan resistance. A cat big_file reads pages once and never again. Pure LRU promotes them all to the most-recently-used position, evicting actually-hot pages. Active/inactive lists and LRU-2 add the “must be referenced twice” rule that fixes this.
  • Aging speed. How fast should “accessed recently” decay? Too fast and you re-evict hot pages; too slow and stale pages linger.
  • Per-process vs. global pools. Global pools (one big pool, any process can evict) are simple but allow one greedy process to evict another’s hot pages. Per-process quotas (memcg) defend hot processes but waste memory if some are idle.
  • Anonymous vs. file pages. File pages are cheap to evict (drop, reload). Anonymous pages must be written to swap. vm.swappiness decides the preference; defaulting to file-cache reclaim first.
  • NUMA awareness. Evicting a local page is cheaper than a remote one (local DRAM latency is ~80 ns, remote ~140 ns). Linux’s reclaim is NUMA-aware via per-node lists.
What's the simplest production-grade policy?

CLOCK with two lists. Active list for pages referenced more than once recently; inactive list for new arrivals and demoted active pages. CLOCK-scan the inactive list to find eviction victims. This catches both temporal locality (hot pages stay active) and scan resistance (one-shot reads stay inactive and die quickly). Linux pre-MGLRU is essentially this with refinements; FreeBSD’s algorithm is similar; Windows uses a related scheme. Plain LRU and FIFO show up in textbooks and almost nowhere else.

Common pitfalls#

  • Assuming exact LRU is what real kernels run. They run approximations — CLOCK, two-list LRU, multi-generational LRU. Exact LRU is too expensive.
  • Thrashing without recognising it. When the working set exceeds RAM, every fault evicts a page that’s faulted back immediately. The system spends most of its CPU servicing faults; throughput collapses. The fix is to reduce the working set, not to “tune the replacement policy harder.”
  • Mistaking page-replacement for cache-replacement. Page replacement happens at the kernel layer over physical frames; cache replacement happens at the app layer (memcached, in-process LRU). Same algorithms, different scope.
  • Overweighting vm.swappiness. Setting it to 0 doesn’t disable swap; it disables anonymous-page eviction in favour of file-cache eviction. Under enough pressure, anonymous pages still go to swap.
  • Locking everything down. mlockall(MCL_CURRENT | MCL_FUTURE) pins every page in RAM. Eliminates faults, but if the app’s working set ever exceeds RAM, the OOM killer fires immediately. Useful for real-time apps and trading systems; dangerous as a “fix” for general apps.
  • Benchmarking with a hot file cache. First-run latency includes faults; subsequent runs benefit from cached pages. Clear caches (echo 3 > /proc/sys/vm/drop_caches) between runs for honest numbers.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.