Context Switching
Saving and restoring CPU state across processes — register sets, kernel stack, the cost paid every preemption.
What it is#
A context switch is the act of saving one process’s CPU state, switching to another process’s state, and resuming it as if it had been running all along. It is the mechanism that makes time-sharing real — without it, the OS could not multiplex one CPU across many processes. The “context” is the set of register values that define what the CPU is doing: general-purpose registers, instruction pointer, stack pointer, flags, segment / control registers, floating-point and vector register file, and the page-table root.
The switch happens in two flavors. A process switch also changes the address space — different page table, TLB flush (or ASID swap), different kernel stack. A thread switch within the same process keeps the address space and file table and only swaps registers and stack. The cost difference is large enough that schedulers prefer same-process switches when given a choice.
When to use it#
You don’t choose to context-switch — the scheduler does. But you choose designs that make switching more or less frequent, and that matters because each switch costs cycles and pollutes caches. Heavy switching shows up as a workload that’s CPU-bound on paper but only achieving 60% of the cycles user code expects.
Common situations where switching dominates:
- Lots of short-lived processes. Spawning ten thousand
grepinvocations through a shell pipeline pays for thousands of fork + exec + switches. - Heavy threading with fine-grained locks. Each contended
pthread_mutex_lockthat blocks turns into a switch out and a switch back in. - High interrupt rates. Every interrupt is at least a partial context switch (registers saved on the kernel stack), and if the handler wakes a higher-priority task you get a full switch.
The countermove is to reduce switches: thread pools instead of per-task threads, lock-free data structures where contention is high, batched I/O via io_uring, and coroutines / async runtimes that schedule cheaply in user space.
How it works#
What gets saved#
When the kernel decides to switch from process A to process B (in response to a timer tick, a blocking syscall, or a wakeup of higher-priority task), it:
- Saves A’s user registers onto A’s kernel stack. The hardware saved the user-mode instruction pointer, stack pointer, and flags as part of the trap that brought us into kernel mode; the kernel saves the rest with explicit stores.
- Saves A’s kernel stack pointer into A’s PCB. This is the one slot the kernel cares about — once we know where A’s kernel stack lives, we can reconstruct everything else from it later.
- Saves the floating-point / vector register file — lazily on most architectures, only if B will actually use them. Linux uses a “FPU dirty” flag and
xsave/xrstorto manage this. - Switches address space if needed — loads B’s
cr3(x86) orTTBR0(ARM). This flushes the TLB unless the architecture supports tagged TLBs / ASIDs. - Loads B’s kernel stack pointer from B’s PCB.
- Restores B’s registers from B’s kernel stack.
- Returns from the trap, dropping back to user mode at the instruction in B that was running before it got switched out.
On Linux x86_64 the core of this is in __switch_to (the C part) and the assembly stub in switch_to_asm. The whole sequence is a few dozen instructions plus the cost of any cache and TLB misses that follow.
Direct vs. indirect cost#
Direct cost is the cycles spent in the switch code itself — a few hundred to a few thousand cycles depending on whether the FPU state and address space change. Indirect cost is bigger: when B starts running, its instruction cache is cold (A polluted it), its data cache is cold, its branch predictors are mis-trained, and its TLB entries are gone. Real measurements on x86 put indirect cost at 5–50 microseconds for a CPU-bound workload, depending on cache footprint.
The kernel stack handoff#
A subtle point: every process has its own kernel stack (8 KB on Linux x86_64). The “switch” between A and B in the kernel is literally a mov of B’s kernel-stack pointer into rsp — after that instruction, returns inside the kernel are returning into B’s kernel stack frames, not A’s. The next time we iret back to user mode, we go to B’s user code.
What survives the switch#
The PCB does, obviously. The page tables of both processes are unchanged. File descriptors, signal state, credentials — all live in the PCB and resume on the next switch back. What does not survive: the L1 / L2 caches (shared at the core, cold for the new context), the TLB (unless tagged), the branch predictor state, and any CPU-local microarchitectural caches the new process hasn’t warmed yet.
Variants#
Process switch vs. thread switch#
A thread switch within the same process skips step 4 (no cr3 reload, no TLB flush) and is correspondingly cheaper. Linux’s __switch_to checks whether the source and destination tasks share an mm_struct and elides the address-space switch if so.
Voluntary vs. involuntary#
A voluntary switch happens because the running process called a blocking syscall and the kernel needed to do something else. An involuntary switch happens because the scheduler decided to preempt — either the timer tick fired and the slice expired, or a higher-priority task woke up. vmstat and pidstat report the two separately; a process with very high involuntary switches is being preempted heavily and may benefit from priority tuning.
Lazy FPU restore#
Most processes don’t touch FPU / SSE / AVX registers very often. Linux marks the FPU state “dirty” only when the next task actually issues a vector instruction, at which point a #NM exception triggers the restore. This used to be a meaningful optimisation; modern CPUs with cheap xsavec make it less critical, but the pattern remains.
Tagged TLB (PCID / ASID)#
x86 since Westmere supports PCID, a 12-bit tag on each TLB entry. When the kernel switches address spaces it changes the PCID instead of flushing the TLB; entries from both processes coexist. ARM has had ASIDs since ARMv5. This is the difference between “context switch flushes TLB” (old textbook answer) and “context switch tags TLB” (modern reality).
Cooperative context switch in user space#
Goroutines, Node.js event-loop tasks, async / await in Python — these switch contexts entirely in user mode. No kernel trap, no register file save beyond what the language runtime needs. Costs nanoseconds instead of microseconds; pays for it in the fact that any blocking syscall blocks the whole runtime unless wrapped in async machinery.
Trade-offs#
The schedulers that win on real workloads — Linux’s CFS, the various MLFQ flavours, BSD’s priority queues — all try to adapt the switch rate: short slices for interactive tasks, long slices for batch ones. Knowing the trade-off makes “why does my latency-sensitive service drop tail latency when a make -j runs alongside it” diagnosable.
Specific tensions:
- Slice length. Linux’s
sched_min_granularity_ns(default 1 ms) is the floor on a slice; shorter floors mean more switches per second. Real-time RT classes ignore the floor and switch on every event. - Migration vs. affinity. Moving a task between CPUs to balance load costs cache locality. The scheduler weighs
migration_cost_nsagainst the imbalance before pulling. - Address-space change cost. Without PCID, every process switch flushes the TLB. KPTI (Meltdown mitigation) added a second TLB flush on every syscall on machines without PCID; this is why “spectre-mitigated, no PCID” boxes were measurably slower.
Common pitfalls#
- Treating context-switch count as a throughput metric. It’s an overhead metric. Going from 10 K switches/s to 50 K switches/s on the same workload usually means contention or oversubscription, not “the system is doing more work.”
- Spinning in user space to avoid a switch. Sometimes correct (uncontended adaptive spin in
pthread_mutex), often wrong (busy-looping on a flag wastes the cycles you were trying to save). Spin only when you expect the wait to be shorter than a switch round-trip (a few microseconds). - Assuming a switch flushes the TLB. With PCID / ASID on modern CPUs, it doesn’t. Models that assume it do over-estimate switch cost and under-estimate cache pollution.
- Forgetting interrupt handlers are switches too. A high-interrupt-rate workload (busy NIC, busy SSD) is doing a context switch (partial, into the IRQ handler) tens of thousands of times per second per CPU. NAPI polling, interrupt coalescing, and IRQ affinity all exist to manage this.
- Conflating thread switch and process switch costs. A pthread switch on the same core is 1–2 microseconds including cache effects; a cross-core thread migration plus cache miss is 10–50 microseconds; a process switch with TLB flush can be 100+ microseconds on a hot workload. They are not the same number.
Why don't modern CPUs have a hardware 'task gate' instruction like the 386 did?
Intel’s 386 had a hardware task-switch mechanism (TSS, task gates) that saved and restored everything in one instruction. It turned out to be slower than the software sequence Linux and others adopted, because the hardware sequence saved registers nobody cared about and didn’t compose with the lazy-FPU optimisation. Modern OSes use the TSS only to find the kernel stack on syscall entry; the actual switch is plain mov instructions in software.
Related building blocks#
- The Process Abstraction — the PCB whose state is being saved and restored.
- CPU Scheduling — FIFO, SJF, STCF, RR — the policies that decide which switch to make.
- Limited Direct Execution — the timer interrupt that triggers preemptive switches.
- Multi-Level Feedback Queue (MLFQ) — how switch frequency shifts by priority level.
- User Mode vs Kernel Mode — the boundary every switch crosses twice.