User Mode vs Kernel Mode
Hardware privilege levels, system calls, trap-and-emulate, and the cost of crossing the boundary.
Summary#
The CPU has two (or more) hardware-enforced privilege levels: kernel mode can execute every instruction and touch every byte of memory; user mode is restricted to a subset of instructions and only the memory the kernel has mapped into the current address space. On x86 these are rings 0 and 3; on ARM they are EL1 and EL0. The distinction is enforced by the CPU itself, not by software — a user-mode program that tries cli or writes to a control register faults instantly.
A system call is the controlled way for user code to ask the kernel to do something it cannot do itself — read a file, allocate memory, open a socket. It is a trap: a synchronous, intentional jump into the kernel through a fixed entry point, which switches the CPU into kernel mode, saves user state, dispatches to a handler, and on return restores user state and drops back to user mode.
Why it matters#
Without two privilege levels, isolation is impossible. Any process could write the page tables, mask interrupts, halt the CPU, or read another process’s secrets. The whole edifice of multi-user systems, browsers running untrusted JavaScript, databases protecting their data files, and clouds isolating tenants rests on this one hardware feature.
The cost matters too. Every syscall, every page fault, every interrupt crosses the boundary, saves a chunk of state, and runs kernel code with caches and branch predictors disturbed. A modern syscall on x86 costs a few hundred cycles on the fast path; with Meltdown / Spectre mitigations it can be thousands. That cost is why high-performance designs (io_uring, vDSO, kernel-bypass networking) work hard to avoid crossings rather than make them cheaper.
How it works#
What hardware enforces#
The CPU’s control registers contain a current privilege level (CPL). Privileged instructions — hlt, cli, lgdt, writes to cr3, port I/O — check CPL and fault with a general-protection exception if executed in user mode. The MMU consults page-table permission bits that include a user/supervisor flag; a user-mode access to a supervisor-only page raises a page fault. The hardware also enforces that switching CPL only happens through specific instructions and through specific entry points the kernel set up.
Entering the kernel#
Three things cause a transition into kernel mode:
- System call. User code executes
syscall(on x86_64) orsvc(on ARM). The CPU jumps to the address the kernel previously installed (MSR_LSTARon x86), switches to the kernel stack, and runs the kernel’s syscall entry. The user passes a syscall number and arguments in registers. - Exception. A page fault, divide-by-zero, illegal instruction, or general-protection fault traps synchronously to a handler the kernel registered in the IDT (x86) or vector table (ARM).
- Interrupt. An asynchronous hardware signal from a device, timer, or another CPU. The handler runs in kernel mode regardless of what user code was doing.
Inside the kernel#
The handler validates arguments — copy_from_user on Linux is the canonical “untrusted-pointer-becomes-trusted-kernel-data” routine and has been the source of many CVEs. It then runs the requested work, possibly blocking on I/O (which yields to other processes), and writes a return value. Before returning, the kernel restores any saved user state, optionally re-checks pending signals, and executes sysret / eret to drop back to user mode at the instruction following the syscall.
What “crossing the boundary” actually costs#
On a modern x86 box, a no-op syscall round trip is roughly:
- 100–200 cycles for
syscall+ register save + entry dispatch +sysret. - An additional ~1000 cycles when KPTI (Meltdown mitigation) flips page tables on entry and exit.
- TLB and branch-predictor pollution from kernel code, which slows the next few thousand user instructions.
Most real syscalls do meaningful work and dwarf this; trivial calls like gettimeofday were so common that Linux moved them into the vDSO — a kernel-provided shared library mapped read-only into every user address space, so the call stays in user mode.
Variants and trade-offs#
The other axis is how heavyweight a crossing is:
- Syscall via instruction (
syscallon x86_64) — the fast path on modern hardware; uses dedicated MSRs to find the kernel entry. - Syscall via interrupt (
int 0x80on 32-bit x86) — older, slower; still works for legacy binaries. - Page fault — same hardware mechanism but unintentional from the user’s perspective. Cost is similar to a syscall plus the page-handling work.
- vDSO and ASLR-friendly fast paths — the kernel maps small bits of trusted code into user space so frequent calls stay in user mode entirely.
Why does Linux still have so many syscalls when io_uring exists?
Because compatibility is the most important kernel feature. Every program that ever ran on Linux uses read, write, open — removing them is impossible. io_uring is an addition for workloads where the syscall cost matters (databases, high-throughput servers), not a replacement. The lesson is general: kernels accrete interfaces and very rarely retire them, which is why the syscall table grows monotonically.
A subtle trade-off: interrupts vs. polling. Interrupts let the CPU do useful work while waiting for a device, but each one costs a mode-switch and cache disturbance. At very high I/O rates (10 Gb/s networking, NVMe SSDs) interrupts saturate; that is why DPDK and SPDK exist — they poll in user mode and skip the boundary crossings entirely.
When this is asked in interviews#
Constantly, in some form. The basic question — “what’s the difference between user mode and kernel mode” — is a screen filter; the bar is to mention hardware enforcement, privileged instructions, the MMU’s role, and the syscall mechanism as the controlled crossing. Stopping at “kernel mode can do more” is a junior answer.
The follow-ups go deep:
- “Walk through what happens on
read(fd, buf, 100).” Tests whether you can name the trap, the argument copy, the file-system dispatch, the block-layer queue, the interrupt return, the wakeup, and the return path. Mid-level. - “What does Meltdown attack and why was KPTI needed?” Tests whether you understand that the boundary is hardware-enforced for some things and was speculatively bypassable for memory reads. Senior.
- “How does vDSO work and which calls live there?” Tests whether you know the actual fast-path optimisations. Senior.
- “Could you build a kernel without rings?” Tests whether you can reason about software-only isolation (SFI, language-based isolation, eBPF verification). Staff and above.
A second context is OS-bypass and high-performance systems work — the bar is then to talk about specific costs in nanoseconds and to have opinions about io_uring, DPDK, and user-space drivers.
Related concepts#