Process API — fork, exec, wait — Operating Systems

What it is#

UNIX exposes process creation as three calls instead of one. fork() duplicates the calling process — the child gets a copy of the parent’s address space, file descriptors, and registers, and starts executing right after the fork call returns. exec() (the family: execv, execve, execvp, …) replaces the calling process’s address space with a fresh program loaded from disk while keeping the PID and most file descriptors. wait() (and waitpid, waitid) lets a parent block until a child terminates and reap its exit status.

The strange thing about fork is the return value: it returns twice. In the parent it returns the child’s PID; in the child it returns 0. That single fact is how the same code can take different branches in parent and child.

When to use it#

Any time a process needs to start another program, the canonical pattern is fork then exec in the child. Shells use it for every command (ls, cat, make); build systems use it to spawn compilers; web servers use it to spawn worker processes; sshd uses it on every incoming connection. The pattern is so ubiquitous that POSIX added posix_spawn as a single-call shortcut for embedded systems where fork’s memory cost is prohibitive — but on a real machine with copy-on-write, fork+exec is still the default.

Use wait whenever you fork — failing to reap children produces zombie processes, which hang around holding a PID slot until the parent dies. Long-running daemons that fork workers but never wait will eventually exhaust the system’s PID table.

How it works#

fork#

The kernel allocates a new PCB (process control block) for the child, gives it a fresh PID, and clones the parent’s state. The address space is the expensive part — on a modern system this is done with copy-on-write: both processes start sharing the parent’s physical pages, marked read-only, and the kernel copies a page only when one side actually writes to it. File descriptors are duplicated so the child inherits all open files and pipes, with the refcount bumped on each underlying file object.

The kernel then schedules both processes. Which one runs first is up to the scheduler — code that assumes parent-first or child-first is wrong.

exec#

exec is destructive: the calling process’s text, data, heap, and stack are discarded and replaced with the contents of the named binary. The kernel parses the ELF (or Mach-O on macOS) header, sets up the new address space, loads the program segments, resolves the dynamic linker if needed, and jumps to the program’s entry point. What survives the exec: the PID, the parent, the working directory, the file descriptors (unless marked FD_CLOEXEC), the environment (unless you used execve to pass a new one).

If exec succeeds it never returns — the calling code is gone. If it fails (binary not found, permission denied, ELF corrupt), it returns -1 and the original program continues.

wait#

wait(&status) blocks until any child terminates, returning the child’s PID and writing its exit status into the integer pointed to by status. waitpid(pid, &status, options) lets you wait for a specific child, or poll with WNOHANG. The status word packs the exit code, the signal that killed the process (if any), and a “did it core-dump” bit — you decode it with the WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG macros.

Putting it together — the shell#

A shell that reads ls -l | wc does roughly:

int fds[2];
pipe(fds);                       // create the pipe
pid_t child1 = fork();
if (child1 == 0) {               // first child: ls
    dup2(fds[1], STDOUT_FILENO); // redirect stdout to pipe write end
    close(fds[0]); close(fds[1]);
    execvp("ls", (char*[]){ "ls", "-l", NULL });
    _exit(127);                  // exec failed
}
pid_t child2 = fork();
if (child2 == 0) {               // second child: wc
    dup2(fds[0], STDIN_FILENO);  // redirect stdin from pipe read end
    close(fds[0]); close(fds[1]);
    execvp("wc", (char*[]){ "wc", NULL });
    _exit(127);
}
close(fds[0]); close(fds[1]);    // parent closes both ends
waitpid(child1, NULL, 0);
waitpid(child2, NULL, 0);

dup2 is the third helper that makes this work — it duplicates a file descriptor onto a specific target number, which is how redirection is built. pipe creates a kernel buffer with two endpoints. _exit (not exit) is used in children to avoid running atexit handlers that the parent registered.

Variants#

`vfork`#

A historic optimization: the child shares the parent’s address space (no copy at all) and the parent is suspended until the child exec’s or exits. Faster than fork on systems without copy-on-write. Largely obsolete on Linux because COW makes fork cheap; still useful on memory-constrained embedded systems. The trap with vfork is that the child must not touch any memory the parent cares about and must call exec or _exit quickly.

`clone` (Linux)#

Linux’s underlying syscall, of which fork and pthread_create are both flavours. It takes flags that control which parts of the parent are shared vs. copied: CLONE_VM shares memory, CLONE_FILES shares the FD table, CLONE_SIGHAND shares signal handlers, CLONE_THREAD puts the child in the same thread group. A thread is clone with everything shared; a process is clone with nothing shared.

`posix_spawn`#

The single-call alternative — takes the program name, argv, env, and a “file actions” struct describing the FD setup. Designed for embedded systems where fork is too expensive (no MMU, no COW). Used on iOS and some BSDs as the preferred process-creation path.

`exec` family#

execv, execvp, execve, execle, execlp, execl — they differ in whether argv is a list or a vector, whether PATH is searched for the binary, and whether environment is passed explicitly. execve is the underlying system call; the others are wrappers.

Trade-offs#

fork + exec — composable: the child can do arbitrary setup before becoming the new program. Standard since 1970s UNIX. Copy-on-write makes the apparent address-space copy cheap. Mature, well-understood, fits the everything-is-a-file UNIX philosophy.

posix_spawn — single call, no transient duplicate process. Faster on systems without COW (embedded, no-MMU). The “file actions” argument absorbs the setup that would otherwise happen between fork and exec — less flexible, more verbose for unusual cases.

Some specific tensions worth knowing:

Copy-on-write hides cost until it doesn’t. A fork on a 16 GB process appears instant — but if the child or parent then dirties many pages, the kernel must allocate physical memory for each, and a fork that “took zero memory” turns into one that needs gigabytes. Databases that fork to take snapshots (Redis BGSAVE) have to plan for this.
fork plus a multithreaded parent is treacherous. Only the calling thread survives the fork in the child; locks held by other threads at fork time stay held forever. The standard guidance is “fork early, before you start threads” or “use posix_spawn”.
Signal handling across exec. Signal dispositions (handlers) are reset to default on exec; signal masks (which signals are blocked) are not. Both inherit across fork.
File descriptor inheritance is mostly desirable (it’s what makes redirection work) but occasionally a hazard — a child unintentionally inheriting an open log file or a listening socket. Mark FDs O_CLOEXEC to close them automatically on exec.

Common pitfalls#

Forgetting to wait. A long-running parent that forks children and never calls wait accumulates zombies. Either wait reliably, or set SIGCHLD handling to SIG_IGN (POSIX guarantees children are auto-reaped in that case) or to a handler that calls waitpid in a loop.
Using exit instead of _exit in the child. exit runs registered atexit handlers and flushes stdio buffers — both of which the parent may have set up and which now run twice. Always _exit (or _Exit) from a forked child that’s about to die without exec’ing.
Assuming an order between parent and child. Both are runnable after fork; either can run first. Code that depends on the child running first (or the parent running first) has a race.
Buffered stdio across fork. Both copies of the process inherit the same stdio buffers. A printf before fork that ended up only in the buffer (no newline, not flushed) gets emitted twice — once when each side flushes. Call fflush(stdout) before fork if you’ve been printing.
Leaking the wrong pipe end. In the shell pattern above, the parent and both children must close every FD they don’t need. If the parent leaves the write end of the pipe open, wc never sees EOF and blocks forever.

Why is fork still alive given how much it complicates threading?

Several research papers have argued for fork’s retirement — the canonical one is Microsoft Research’s “A fork() in the road” (2019). The argument: fork was designed for single-threaded UNIX, doesn’t compose with multithreading, doesn’t compose with shared mmap regions, leaks complexity into runtimes (Java’s Runtime.exec calls posix_spawn now). The counterargument: every Linux distribution, every shell, every container runtime, and a generation of muscle memory depend on it. It isn’t going anywhere.

The Process Abstraction — what a process actually is to the kernel.
Context Switching — what runs between two processes’ time slices.
CPU Scheduling — FIFO, SJF, STCF, RR — which child runs first after fork.
User Mode vs Kernel Mode — every one of these calls is a trap.
Limited Direct Execution — the mechanism that makes returning to user code after a syscall safe.