Multi-Level Feedback Queue (MLFQ)
Adaptive priority queues, priority boosts to prevent starvation, accounting tricks to defeat gaming.
What it is#
MLFQ is a scheduler that learns each process’s behaviour and reacts without being told anything in advance. It maintains several priority queues (4–8 in real systems), each running RR internally. A new process enters the top queue; if it uses its full quantum it gets demoted to a lower queue; if it gives up the CPU early (blocks for I/O) it stays where it is. The result is that interactive, I/O-bound processes float to the top (and get quick response) and CPU-bound batch processes sink to the bottom (and get longer slices to amortise switch cost).
The design is the production answer to the problems of pure FIFO / SJF / RR. It approximates SJF without knowing job lengths (short bursts stay high-priority, long bursts get demoted). It guarantees response time for interactive jobs (they sit in the high-priority queue). It avoids starvation by periodically boosting everyone back to the top.
When to use it#
MLFQ is the historical default for general-purpose UNIX systems — Solaris’s TS class, FreeBSD’s ULE, the BSD 4.3 scheduler, NT’s priority-based scheduler. Linux used a multilevel-queue scheduler through 2.6 (the O(1) scheduler) before switching to CFS. You reach for MLFQ-style designs when:
- You don’t know job runtimes ahead of time but want SJF-like behaviour.
- You have a mixed workload of interactive and batch tasks and want both to feel reasonable.
- You want a scheduler that’s simple to implement (a small number of queues, one runqueue head per priority).
The classic counter-case is when you need fair-share semantics — “process X should get exactly 30% of CPU” — which MLFQ doesn’t promise. There, lottery or CFS-style schedulers fit better.
How it works#
The basic rules#
A canonical MLFQ has these rules:
- If priority(A) > priority(B), A runs. Higher-priority queues empty before lower ones get the CPU.
- If priority(A) = priority(B), A and B run in RR within that queue.
- A new process enters the top queue. Optimistic guess: it might be interactive.
- If a process uses its full quantum, demote it one level. It looked CPU-bound; give it a longer slice next time at a lower priority.
- If a process gives up the CPU before its quantum (blocks on I/O), keep it at the same priority. Or in some variants, promote it.
- Periodically boost every process to the top. Prevents starvation; resets bad guesses; handles processes that change behaviour over their lifetime.
Quanta grow with priority depth: the top queue might use a 10 ms quantum, the bottom a 200 ms quantum. Longer slices at the bottom amortise switch cost on batch workloads.
Why each rule exists#
- Top-queue start approximates SJF for new short jobs — they often finish in the top queue without ever being demoted. If they don’t finish, they get demoted and accepted as long jobs.
- Demotion on full quantum use is the discovery mechanism. The scheduler doesn’t know who’s CPU-bound; it observes.
- Stay-at-priority on I/O block keeps interactive jobs interactive. A text editor that blocks on every keystroke never gets demoted.
- Priority boost solves two problems at once: long-CPU starvation (a process stuck at the bottom queue gets occasional CPU) and behaviour change (a previously CPU-bound process that becomes interactive gets to refloat).
Defeating the gaming attack#
The naive rule “stay at the same priority if you block before the quantum ends” is exploitable. A clever process could usleep(1) just before its quantum expires, releasing the CPU, and stay at top priority forever — getting 99.9% of CPU as a top-priority “interactive” task.
The fix is time-based accounting: each process accumulates its total CPU usage at the current priority. When it crosses the quantum threshold (regardless of how many times it relinquished and was rescheduled), it gets demoted. The Solaris and BSD schedulers both implemented this. The rule becomes “once you’ve used N ms at this priority level, you’re demoted, no matter how you used it.”
A worked example#
Three jobs: A is interactive (does a 1 ms burst, blocks on I/O, repeats), B is CPU-bound (10 second compute), C arrives at t=5s with a 50 ms burst then exit.
queues: Q0 (top, quantum=10ms), Q1, Q2 (bottom, quantum=200ms)
t=0 : A and B in Q0 A bursts 1ms, blocks; B bursts 10ms, demoted to Q1t≈10ms : A wakes, runs 1ms in Q0, blocks; B in Q1 runs 20ms, demoted to Q2... A keeps cycling at Q0 (small bursts, fast response) B sits in Q2 getting 200ms slices when nobody else is runnablet=5s : C enters at Q0; runs 10ms, blocks (or completes 50ms across a few quanta) C finishes long before B (SJF-like behaviour without knowing C's length)t=10s : boost — A and B both back to Q0; B demotes again immediatelyWithout the scheduler being told anything, A gets snappy response, C finishes fast, B doesn’t starve, and gaming via short sleeps can’t elevate B’s effective priority.
Variants#
Number of queues and quanta#
Solaris’s TS scheduler had 60 priority levels with finely-tuned quanta. FreeBSD ULE has fewer but with explicit interactive / batch zones. The 4–5 queue version in the textbook is a teaching simplification; production systems used more queues for finer behavioural classification.
Adaptive quanta#
Some MLFQ variants scale the quantum continuously with priority instead of using a fixed table. The intuition is the same: high-priority short slices, low-priority long slices.
Process aging#
Linux’s pre-CFS O(1) scheduler used a “bonus / penalty” form of aging where each process’s effective priority shifted based on its sleep ratio (sleep time / runtime). High sleep ratio → bonus → priority floats up. CPU-hog → penalty → priority sinks. Same idea as MLFQ implemented via a continuous bonus instead of discrete queues.
Priority inheritance#
When a low-priority process holds a lock that a high-priority process needs, naive MLFQ can deadlock or stall the high-priority job arbitrarily long (“priority inversion”). The fix — priority inheritance — temporarily boosts the lock holder to the priority of whoever is waiting. Famously diagnosed on the Mars Pathfinder mission, where a low-priority “meteorological data” task held a lock that a high-priority bus management task needed.
Real-time MLFQ hybrids#
Many production schedulers reserve top-priority queues for real-time (FIFO or RR) tasks that aren’t subject to demotion or boost. Linux’s SCHED_FIFO and SCHED_RR are this — they sit above the normal CFS class, always run when runnable, and the kernel trusts the application not to monopolise.
Trade-offs#
nice and sleep behaviour drive vruntime. Replaced MLFQ in Linux 2.6.23. Specific tensions:
- Tuning brittleness. MLFQ’s behaviour depends sensitively on the boost interval and quantum table. Production tuning required workload-specific knobs that didn’t transfer between server / desktop / mobile.
- Fairness is implicit, not guaranteed. Two CPU-bound jobs both end up in the bottom queue and split RR, but two jobs of very different priorities don’t have a proportional guarantee — they have a strict priority ordering.
- Boost interval sensitivity. Too long and CPU-bound jobs feel starved; too short and the “discovery” of who’s batch is undone too often. CFS sidestepped this by not having discrete queues to discover.
Common pitfalls#
- Building MLFQ without anti-gaming accounting. A teaching MLFQ that uses “did you finish your quantum?” is exploitable as soon as a clever process learns to yield. The accumulated-CPU rule is non-optional.
- Forgetting the priority boost. Without it, a CPU-bound process that was demoted at boot can stay starved forever if interactive workload arrives later. The boost is the safety valve.
- Treating the top queue as real-time. It isn’t; it’s the “newly arrived or freshly boosted” queue. Real real-time tasks need a separate scheduling class above MLFQ.
- Ignoring priority inversion. If your MLFQ holds locks that cross priority levels, you must implement priority inheritance or your high-priority tasks will stall on low-priority lock holders.
- Choosing quanta without measuring switch cost. Quanta should be tens to hundreds of times the context-switch cost — too short and the lower queues spend all their CPU on switching, too long and the response-time benefit of the top queue disappears.
Why did Linux move from MLFQ to CFS?
Three reasons. First, MLFQ’s discrete queues made nice value semantics fuzzy — what exactly does “nice -10” mean in queue terms? Second, MLFQ behaviour drifted under bursty workloads in ways that were hard to explain to users (the GNOME-laggy-during-make-j problem). Third, Ingo Molnar’s CFS demonstrated that a single ordered structure (red-black tree on vruntime) could deliver clean fairness semantics with simpler tuning — one knob (sched_latency_ns) instead of a quantum table. The trade was less behavioural adaptation, more provable fairness. Linux took the trade.
Related building blocks#
- CPU Scheduling — FIFO, SJF, STCF, RR — the disciplines MLFQ refines.
- Lottery Scheduling and Linux CFS — the fair-share family that replaced MLFQ in Linux.
- Context Switching — the cost MLFQ tries to amortise with longer low-priority quanta.
- The Process Abstraction — what’s being placed into the queues.
- Multi-CPU Scheduling — how MLFQ extends across cores.