Lottery Scheduling and Linux CFS — Operating Systems

What it is#

Lottery scheduling and Linux’s CFS (Completely Fair Scheduler) are the fair-share family — schedulers that guarantee each process a proportional slice of CPU rather than discovering its needs adaptively. Lottery does it probabilistically: every process holds some number of “tickets,” and at each scheduling decision the scheduler draws a random ticket and runs that process. Over time, share converges to ticket count. CFS does it deterministically: each process accumulates virtual runtime (vruntime) weighted by its nice value, and the scheduler always picks the process with the lowest vruntime — keeping every process as close to its fair share as possible at every instant.

Both designs replace MLFQ’s adaptive discovery with explicit guarantees. If you say “process X has 30% of the system’s tickets” or “process X has nice=-5 relative to nice=0 siblings,” the scheduler will deliver that share over any reasonable time window.

When to use it#

Fair-share schedulers are the default on every modern Linux system (CFS since 2.6.23, replaced by EEVDF in 6.6+ but the same family). You reach for the design when:

You need proportional guarantees. A multi-tenant server where each tenant pays for a specific CPU share. A container runtime that enforces cgroup CPU shares. A scientific cluster where users have ratio-based quotas.
You want simple, explicit policy. nice and cgroup weights are direct knobs; you don’t have to reason about queue counts and boost intervals.
You want robustness across unfamiliar workloads. Fair-share doesn’t depend on classifying tasks as “interactive” vs “batch” — it just allocates time proportionally and lets sleep-vs-run patterns shape outcomes.

The counter-cases are real-time systems (where you want strict priority, not fairness) and embedded systems with a known workload (where MLFQ-style tuning can extract more throughput).

How it works#

Lottery scheduling#

Each process holds an integer ticket count proportional to its desired share. On each scheduling decision the scheduler:

Sums the tickets of all runnable processes (call it T).
Picks a random number r in [0, T).
Walks the runnable list, subtracting each process’s ticket count from r until r < 0. That process runs next.

Properties: share converges to ticket ratio by the law of large numbers; no starvation because every process with non-zero tickets has positive probability; easy composition because tickets aggregate (give a parent 100 tickets, it can distribute them among children as it sees fit — “currency conversion”). Costs: random-number generation per tick, O(n) walk per decision (or O(log n) with a tree). The probabilistic guarantee is loose at short time scales — a process with 1% of tickets might get 0% over a 100 ms window and 5% over the next.

Stride scheduling#

The deterministic version of lottery: each process has a “stride” inversely proportional to its tickets, and a “pass” counter that advances by its stride each time it runs. The scheduler picks the process with the lowest pass value. Same long-run share as lottery, no randomness, tighter short-term fairness. Stride is the conceptual ancestor of CFS.

CFS — virtual runtime and the red-black tree#

CFS gives each task a vruntime counter, measured in nanoseconds but scaled by an inverse weight derived from nice:

vruntime += delta_exec * NICE_0_LOAD / task->load.weight

A nice 0 task accumulates delta_exec of vruntime per delta_exec of real CPU time. A nice -5 task with ~3x weight accumulates only ~1/3 as much vruntime, so it appears “less run” and gets picked more. A nice +5 task accumulates ~3x faster and gets picked less.

The runnable set is a red-black tree keyed by vruntime. The leftmost node (lowest vruntime) is the next to run; insertions and removals are O(log n). The “leftmost” pointer is cached so picking is O(1).

Sleep and wakeup#

When a task sleeps (blocks on I/O) it leaves the tree; when it wakes it must rejoin. If we re-inserted it with its old vruntime, it would be far behind everyone and get to monopolise CPU until it “caught up” — exactly the bug that lets sleepers starve runners. The fix: on wakeup, set the task’s vruntime to max(its_old_vruntime, min_vruntime_in_tree - some_threshold). This gives sleepers a small bonus for snappier response (they appear behind, get to run soon) without letting them dominate.

Latency target and minimum granularity#

CFS exposes two main tunables: sched_latency_ns (default 6 ms) is the target time within which every runnable task should run at least once. sched_min_granularity_ns (default 0.75 ms) is the floor on any individual slice. With n runnable tasks, each slice is sched_latency / n, floored at min_granularity (which is what kicks in once there are too many tasks to give each one a meaningful share within the latency window).

Group scheduling and cgroups#

CFS extended to support hierarchies: cgroups are nodes in a tree of “scheduling entities,” and each entity has its own runqueue (red-black tree) of children. Picking a task means walking down the hierarchy, picking the leftmost entity at each level, recursing into its sub-tree, finally arriving at a task. This is what makes Linux’s container resource control real — a cgroup with cpu.shares=1024 gets twice the CPU of a sibling with cpu.shares=512, regardless of how many tasks are inside each.

Variants#

EEVDF — what came after CFS#

Linux 6.6 (Oct 2023) replaced CFS with EEVDF (Earliest Eligible Virtual Deadline First). It keeps the vruntime concept but adds per-task latency requests: a task can ask “I need to run within X ms,” and EEVDF schedules to satisfy the earliest such virtual deadline. Same fair-share guarantee, better latency control. The red-black tree is still the data structure.

Lottery composition (currency)#

The original lottery paper introduced “currency”: a parent process gets a pool of tickets in some currency and can mint sub-currency for its children. This is the conceptual root of nested cgroups — a parent’s share is partitioned among its children proportionally.

Stride vs. lottery for predictability#

Stride is preferred when you need deterministic short-term behaviour (financial scheduling, real-time-ish use cases). Lottery is preferred when randomisation has its own value (resistance to gaming, simpler composition). In practice neither shipped at scale; CFS subsumed both.

BFS / MuQSS#

Con Kolivas’s alternative Linux schedulers (BFS, then MuQSS) maintained a single global runqueue (BFS) or per-CPU queues with explicit deadlines (MuQSS) to optimise for desktop latency over throughput. Not merged into mainline but influential in the design space — they pushed CFS toward better latency tuning.

FreeBSD ULE#

A scheduler that’s neither pure MLFQ nor pure fair-share — it splits tasks into interactive and batch classes and uses per-CPU runqueues with explicit migration policy. Worth knowing as a counterpoint: not every modern scheduler is CFS-shaped.

Trade-offs#

Lottery — simple, composable, no starvation by construction, easy to reason about. Probabilistic guarantees only — short-window variance can be large. Random-number generation per tick costs cycles. Never shipped in a mainstream OS but the conceptual influence is everywhere (currency, ticket aggregation).

CFS / EEVDF — deterministic, O(log n) operations, explicit nice and cgroup semantics, group scheduling for free. More machinery (red-black tree, vruntime accounting, wakeup bonus heuristic). Shipped in Linux since 2007, runs on billions of machines.

Specific tensions:

Throughput vs. fairness. A perfectly fair scheduler must preempt frequently enough to enforce the share, which costs context switches. CFS tunes this via sched_latency_ns.
Interactivity vs. fairness. Pure fair-share would give a sleepy interactive task only as much CPU as its weight allows, even when no one else is runnable. The wakeup bonus in CFS is a fairness compromise to keep interactive feel snappy.
Tree depth. Group scheduling adds a tree level per cgroup nesting depth. Deeply nested containers add walk cost on every pick — measurable on very deep hierarchies, negligible for typical container setups.
Sensitivity to clock resolution. vruntime accounting requires the kernel to know exactly how long each task ran. Modern hardware timestamps make this cheap; on older hardware it was a cost centre.

Common pitfalls#

Assuming nice is linear. It isn’t — each step of nice multiplies the weight by ~1.25 (or divides by it). nice -5 vs nice 0 is ~3x, nice -10 vs nice 0 is ~9x. Reasoning about nice arithmetically gives wrong answers.
Expecting fair-share to deliver bounded latency. It doesn’t, by default — it delivers proportional share. If you need bounded latency (audio, real-time control), use SCHED_FIFO / SCHED_DEADLINE, not nice tuning.
Sleep-then-spin to game fairness. Doesn’t work on CFS the way it worked on MLFQ — vruntime accumulates over total CPU at every scheduling decision, and the wakeup bonus is capped. Trying to game CFS by yielding produces approximately fair behaviour anyway.
Cgroup cpu.shares=0. Doesn’t mean “no CPU” — it’s an invalid value in older versions, and in cgroup v2 the equivalent is cpu.weight=0 which means “use the lowest weight.” To actually deny CPU, use cpu.max (cgroup v2 quota), not weight.
Treating fair-share as fair across NUMA nodes. It isn’t — CFS runqueues are per-CPU, and the fair-share guarantee is local. Cross-node fairness depends on the load balancer, which has its own knobs.

Why did lottery influence so much without ever shipping?

Because the elegance of the composition (currency, ticket transfer, sub-allocation) is what cgroup hierarchies eventually delivered, and the proof of fairness is easy enough that it became a teaching staple. CFS pulled the determinism advantage of stride scheduling and the hierarchy advantage of lottery into one design. Lottery’s IP didn’t ship under that name; its ideas did.

CPU Scheduling — FIFO, SJF, STCF, RR — the disciplines fair-share moves beyond.
Multi-Level Feedback Queue (MLFQ) — the previous-generation default that CFS replaced.
Context Switching — the cost CFS pays to enforce its quantum target.
Multi-CPU Scheduling — how CFS extends to many cores.
The Process Abstraction — what’s being placed into the vruntime tree.