Limited Direct Execution — Operating Systems

Summary#

Limited Direct Execution (LDE) is the foundational technique that makes a modern OS both fast and safe. Direct execution means the CPU runs user code natively, at hardware speed — no interpreter, no per-instruction check. Limited means the OS retains the ability to regain control when it needs to: when the user makes a system call, when the user does something illegal, or when a periodic timer fires. The whole design hinges on the hardware mechanisms that let the OS surrender the CPU to user code without surrendering its ability to take it back.

LDE answers two questions every kernel must answer: how do we let untrusted code run without paying interpreter overhead, and how do we make sure that code can never starve other processes or escape its sandbox? The answer is the trio of traps for syscalls and exceptions, the privilege boundary, and the timer interrupt for preemption.

Why it matters#

If the OS interpreted every user instruction, performance would collapse — early systems that did this (some research OSes, some early Java VMs) ran 10–100x slower than native. So the design starts from “let the user run on bare metal” and then layers the minimum machinery to keep that safe.

The machinery is non-obvious until you see it. A user program that wants to read a file cannot just call into the kernel — the kernel has to advertise where to enter and the hardware has to enforce that only those entry points work. A user program that goes into an infinite loop cannot lock up the machine — the timer interrupt fires periodically and gives the scheduler a chance to run. A user program that touches an unmapped page cannot crash the kernel — the page fault handler decides what to do (page in, kill the process, deliver a signal).

How it works#

Direct execution — letting user code run#

The kernel sets up the process: allocates a PCB, builds an address space, loads the binary, prepares a user stack, sets the program counter to the entry point. It then executes a special instruction (iret on x86, eret on ARM) that simultaneously switches to user mode and jumps to user code. From that moment the CPU is running user instructions natively at hardware speed.

Limit one — the privilege boundary#

The CPU is in user mode (ring 3 on x86). Privileged instructions fault. Memory not mapped into the user portion of the address space is inaccessible. If user code wants to do something useful it has to ask: it executes syscall, which jumps through a kernel-installed entry point and switches back to kernel mode. The kernel validates arguments, does the work, and returns. Crucially, the entry points are fixed at boot — user code cannot jump to arbitrary kernel addresses and pretend it asked nicely.

Limit two — exceptions#

When user code does something illegal — divides by zero, dereferences a bad pointer, executes an undefined instruction — the CPU raises a synchronous exception. The hardware looks up the handler in the IDT (x86) or vector table (ARM) and dispatches into the kernel. Most exceptions in modern OSes are page faults, and the handler decides: was the page swapped out (page it in), is it copy-on-write (allocate a copy), or is the access truly invalid (deliver SIGSEGV or kill the process).

Limit three — the timer interrupt#

This is the one that makes preemptive multitasking possible. The kernel programs a hardware timer (the local APIC on x86, the generic timer on ARM) to fire at a fixed interval — historically every 10 ms (HZ=100), now often 1 ms (HZ=1000) or tickless on idle CPUs. Each tick is an asynchronous interrupt: the CPU jumps into the kernel regardless of what user code was doing, the timer handler runs, the scheduler decides whether to keep the current process or switch, and on return the kernel either resumes the same process or context-switches to another.

The control-flow diagram#

boot ─────► kernel sets up first process
              │
              ▼
            iret ─────► USER MODE ─────► user code runs at full speed
                          │ │ │
                          │ │ └─ timer interrupt ─► kernel scheduler runs ─► back to some user process
                          │ └─── exception (page fault, divide-by-zero) ─► handler runs ─► resume or kill
                          └───── syscall ─► kernel handler runs ─► sysret back to user code

Each downward arrow is a mode switch into kernel; each upward arrow uses the hardware’s “return from interrupt / syscall” instruction to drop back into user mode safely.

Variants and trade-offs#

Cooperative scheduling (no timer) — early UNIX, Windows 3.x, classic Mac OS through System 9, and modern green-thread runtimes. Processes run until they voluntarily yield (typically by making a syscall that blocks). Lower overhead — no timer interrupts to handle. Fatal weakness: a single bug or infinite loop hangs the whole system.

Preemptive scheduling (timer-driven) — every modern general-purpose OS. The timer interrupt guarantees the kernel runs periodically. Robust against runaway processes. Pays the cost of an interrupt every tick (low on modern hardware) plus the scheduler’s bookkeeping. The tickless-idle work on Linux is about removing this cost on idle CPUs.

Specific trade-offs and refinements:

Tick rate. Higher HZ means snappier response (shorter scheduling latency) but more interrupt overhead and worse power efficiency. Linux ships HZ=1000 for desktops, HZ=100 for some servers, and tickless (CONFIG_NO_HZ) for both idle and CPU-bound workloads where the periodic tick is unnecessary.
Syscall fast paths. Each crossing costs cycles. The vDSO removes that cost for read-only operations (gettimeofday, clock_gettime). io_uring removes it for high-throughput I/O by letting user code submit work via a shared ring buffer the kernel polls.
Interrupt coalescing. At very high event rates (10 Gb NICs, NVMe), per-event interrupts saturate the CPU. The hardware coalesces them and the kernel processes batches in softirq / NAPI polling loops.

Why isn't preemption purely a kernel decision?

Because the kernel itself runs on the same CPU. To preempt a running process the kernel has to be running, which requires the hardware to have already interrupted the process. There is no way for the kernel to “decide to preempt” without the timer or some other interrupt having fired first. LDE is therefore not a software algorithm; it is the cooperative dance between the OS and the CPU’s interrupt machinery. If the timer hardware breaks, the OS becomes cooperative whether it wanted to or not.

A subtle question is what to do during long kernel work. If a syscall takes 50 ms, is the kernel itself preemptible? Linux is “voluntary preempt” by default — long kernel paths sprinkle explicit cond_resched() calls. Real-time kernels (CONFIG_PREEMPT_RT, RTOSes like VxWorks) make almost all kernel code preemptible at the cost of careful locking discipline.

When this is asked in interviews#

Often as the second question, after “user mode vs kernel mode.” The strong answer threads the three mechanisms — traps for syscalls, exceptions for faults, timer interrupts for preemption — and names the hardware they depend on. Bonus points for connecting it back to OS design goals: direct execution serves the “low overhead” goal, the limits serve the “protection / isolation” goal.

Follow-ups:

“What happens if you remove the timer interrupt?” Tests whether you understand preemption is hardware-driven. Foundational.
“How does the kernel know which page to swap in on a page fault?” Tests fault handler depth. Mid-level.
“Why does HZ=1000 cost more power than HZ=100?” Tests whether you’ve thought about idle and tickless. Mid to senior.
“Walk me through what a JIT VM (V8) and a kernel have in common.” Tests whether you see LDE as a general pattern. Senior.

A second context: hypervisor design. A hypervisor is “LDE for guest kernels” — same trio of mechanisms applied one level higher. Knowing that connection is a senior signal.