Multi-CPU Scheduling
Cache affinity, single-queue vs per-CPU queues, load balancing, and why scaling beyond one core is hard.
Summary#
Scheduling on a single CPU is a search-and-pick problem. Scheduling on N CPUs is a search-and-pick problem on each CPU plus a placement problem between them — which task runs on which core, when to migrate, how to share state between schedulers, how to keep caches warm, how to balance load. None of this is interesting on a desktop with 4 cores; all of it is the entire scheduler design on a server with 128 cores split across multiple NUMA nodes.
The two structural choices are single-queue (SQMS) — one shared runnable list protected by a lock — and multi-queue (MQMS) — per-CPU runqueues with explicit load balancing. SQMS is simple but scales catastrophically (the lock becomes the bottleneck past ~8 cores). MQMS scales linearly in steady state but pays for it in load-balancing complexity, cache-affinity heuristics, and occasional fairness drift between cores. Every production OS today uses MQMS with periodic load balancing — Linux CFS, FreeBSD ULE, Solaris, Windows.
Why it matters#
Modern servers have dozens to hundreds of cores. A scheduler that doesn’t scale becomes the bottleneck for the whole system. The lock on a single global runqueue can dominate a workload’s profile — vmstat-visible queue contention, threads spending more time waiting to be scheduled than running. Linux’s pre-2.6 scheduler had exactly this problem above ~8 CPUs, which is why the O(1) scheduler (per-CPU queues) was a big deal in 2002.
Cache affinity matters because moving a task from CPU 0 to CPU 16 isn’t free even if both are idle. The task’s working set lives in CPU 0’s L1 and L2 caches; on CPU 16 those caches are cold and the task takes microseconds-to-milliseconds to warm them. On a NUMA machine the data lives in CPU 0’s memory node; CPU 16 has to fetch it across an interconnect, paying latency on every miss until the task’s pages migrate. The scheduler that ignores affinity wastes the cache hierarchy.
How it works#
Single-queue multi-processor scheduling (SQMS)#
One global runqueue protected by a lock. Every CPU, when it needs to pick a task, takes the lock, picks from the queue, releases the lock, and runs. Simple to reason about — global fairness is automatic — and it gives load balance “for free” because every idle CPU naturally pulls from the same pool.
Failure modes:
- Lock contention. Past ~8 cores the lock holds for non-trivial fractions of total CPU time. Cache-line ping-pong on the lock word dominates.
- No cache affinity. A task that ran on CPU 0 last tick is just as likely to run on CPU 16 next tick. Caches stay perpetually cold.
- Cross-core bouncing. A task that wakes up gets dispatched to whatever CPU is fastest to grab the lock, which is rarely the CPU its data is on.
SQMS survives at small scale — embedded systems with 2–4 cores can still ship it — but no general-purpose OS uses it past that.
Multi-queue multi-processor scheduling (MQMS)#
Each CPU has its own runqueue. Tasks are assigned to a CPU at creation (usually the parent’s CPU, or one chosen by an idle balancer) and stay there until something migrates them. Scheduling on each CPU is then the same single-CPU problem, fast and lock-contention-free.
The new problem: load balance. If A and B are CPU-bound and both end up on CPU 0, while CPU 1 is idle, the scheduler must notice and migrate one of them. Two complementary mechanisms:
- Push migration. Periodically (every few ticks) a balancer checks per-CPU load and pushes tasks from over-loaded to under-loaded CPUs. Linux runs this in the
load_balancepath called from the scheduler tick. - Pull / idle migration. When a CPU goes idle (its runqueue empties), it actively looks at its neighbours’ queues and pulls a task. Cheaper than periodic push because it only fires when there’s slack.
Both mechanisms weigh migration cost against load imbalance — Linux’s migration_cost_ns tunable is the threshold below which a balancing decision skips the move.
Cache affinity heuristics#
A task that ran on CPU k recently has hot caches on k. The scheduler tracks last_cpu per task and prefers to dispatch it back to the same CPU on wakeup, even if that means slightly higher queue depth on k. The trade-off is captured by the migration_cost_ns and wake_up_idle_cpu knobs — short last-run-here windows favour affinity, longer ones favour empty-CPU dispatch.
NUMA-aware scheduling#
On a NUMA box, CPUs are grouped into nodes, each with its own local memory. Cross-node memory access is 2–3x slower than local. The scheduler tries to:
- Keep a task on the node where its memory lives. Linux tracks
numa_faultsper task per node and migrates the task toward its data, or migrates the data toward the task (AutoNUMA). - Balance within a node before across nodes. The load-balance hierarchy in Linux is core → socket → node → system; balancing is cheapest at the lowest level and only escalates when needed.
- Avoid waking on remote nodes. A wakeup that lands on a remote-node CPU pays interconnect latency until the working set migrates.
Linux’s scheduling domains#
Linux organises CPUs into a tree of scheduling domains matching the hardware topology — SMT siblings inside cores, cores inside sockets, sockets inside NUMA nodes. Load balance walks this tree, balancing first within SMT pairs (cheap), then within sockets (moderate), then across nodes (expensive). Each level has its own balance interval — more frequent at the bottom, less frequent at the top — so cheap rebalances happen often and expensive ones happen rarely.
Per-CPU runqueue locking#
Each CPU’s runqueue has its own spinlock. Most operations (enqueue local task, dequeue local task, pick next) take only the local lock. Load balance takes two locks (source and destination) in a defined order to avoid deadlock. Wakeups across CPUs are handled with IPIs (inter-processor interrupts) — the waking CPU posts a wakeup, then sends an IPI to the target CPU to make it re-examine its runqueue.
Variants and trade-offs#
Specific multi-CPU trade-offs:
- Migration cost vs. balance. Aggressive balancing keeps cores busy but pollutes caches. Conservative balancing keeps caches warm but lets cores idle. Linux’s
migration_cost_ns(default ~500 microseconds) is the empirically tuned threshold. - Per-CPU vs. per-NUMA-node queues. Per-CPU is what Linux ships, but on very large boxes with many cores per node, per-node queues with intra-node sharing have been studied. Trade depth of locking against migration frequency.
- SMT siblings. Two hyperthreads share L1 and L2 caches. Scheduling two CPU-bound tasks on sibling threads can be worse than scheduling them on separate physical cores. Linux’s SMT-aware balancing tries to keep one task per physical core unless there are more tasks than cores.
- Power efficiency. On mobile / laptop systems, packing tasks onto a small number of CPUs and idling the rest saves power. The opposite of pure load-balance. Linux’s
EAS(Energy-Aware Scheduling) addresses this on ARM big.LITTLE systems.
Why is the Linux scheduler still adding NUMA features in 2024?
Because NUMA topology keeps getting more complex. AMD’s Zen 4 has multiple CCDs per socket, each with its own L3. Intel’s Sapphire Rapids has multi-die packages. Cloud VMs sit on heterogeneous hosts where the guest’s view of “one CPU” can hide significant locality. Each generation requires the scheduler to learn the new topology and re-tune migration costs, scheduling-domain depth, and wakeup placement. Multi-CPU scheduling is one of the kernel areas that won’t be “done” any time this decade.
A subtler tension: fair-share semantics across cores. CFS guarantees fairness within a runqueue, but two CFS runqueues running side by side might give one task more total CPU than another with the same weight, simply because their respective queues had different load. Linux’s group-scheduling logic tries to enforce cgroup fairness globally; the gap between local fairness and global fairness is one of the recurring sources of scheduler bug reports.
When this is asked in interviews#
In senior and staff-level systems interviews, especially for infrastructure / kernel / cloud roles. The screen filter is “what’s the difference between SQMS and MQMS”; the strong answer threads scaling, cache affinity, NUMA, and migration policy.
Follow-ups:
- “Why doesn’t Linux just use a global runqueue with a lock?” Tests whether you understand lock-scaling. Mid-senior.
- “How does CFS preserve fairness across CPUs?” Tests whether you’ve thought about local-vs-global fairness. Senior.
- “Walk me through what happens when a thread wakes up on a NUMA box.” Tests wake-up placement, IPI mechanics, affinity heuristics. Senior to staff.
- “Design a scheduler for a 1000-core machine.” Tests judgement about hierarchy, lock granularity, balance intervals. Staff and above.
- “How would you debug a workload that’s slower on more cores than on fewer?” Tests practical scheduling fluency — usually involves cache pollution, false sharing, or NUMA misplacement. Senior to staff.
A separate context is performance engineering: profiling a workload that shows context-switch overhead high relative to user time and asking why. The answer usually involves migration cost, NUMA bouncing, or SMT mis-placement.
Related concepts#
- CPU Scheduling — FIFO, SJF, STCF, RR — the per-CPU pick problem.
- Multi-Level Feedback Queue (MLFQ) — how the queue structure extends across cores.
- Lottery Scheduling and Linux CFS — the fair-share family multi-CPU policies extend.
- Context Switching — the cost that migration pays twice (out of one CPU, into another).
- The Process Abstraction — the entity being placed and migrated.