Threads and Shared State — Operating Systems

Summary#

A thread is a separately schedulable stream of execution inside a single process. It has its own program counter, registers, and stack, but it shares the heap, code, and open file descriptors with every other thread in the same process. That sharing is the entire point — and it is also the entire problem. Two threads can touch the same memory without copying it, but the OS scheduler can interleave their instructions at any granularity it likes, and that interleaving is where every concurrency bug lives.

The whole zoo of synchronization primitives — locks, condition variables, semaphores, atomics, channels — exists to answer one wish: “let me declare which sequences of operations must run without interruption from another thread.” That’s it. Everything else is implementation detail.

Why it matters#

Threads are how a single process exploits multiple cores, hides I/O latency, and structures concurrent work. A web server that serves one request per thread, an indexer that fans work across cores, a UI that keeps a worker thread for blocking calls — all of them rely on threads. The alternative — a separate process per concurrent task — is heavier (its own address space, no shared memory, more expensive context switches) and pushes coordination onto pipes or sockets.

The cost is that any non-trivial multithreaded program has bugs that are invisible in single-threaded testing. They appear under load, on different hardware, after a kernel upgrade, or never — until the customer hits them. An interviewer asking about threading is testing whether you understand why the bugs exist before asking which primitive to use against them.

How it works#

Address space layout#

A process has one address space. Code and globals are shared. The heap (malloc/free) is shared. Open files are shared. What each thread gets to itself is a stack (allocated by pthread_create, typically 1-8 MB) and its register set (saved on context switch). The thread-local storage area (__thread / thread_local) gives each thread a per-thread slot for variables that look global but aren’t.

int counter = 0;          // shared across threads
__thread int my_id = 0;   // per-thread

void* worker(void* arg) {
    int local = 0;        // on this thread's stack only
    counter++;            // races with every other worker
    return NULL;
}

The atomicity problem#

counter++ looks like one statement. The compiler emits three instructions: load counter into a register, add one, store it back. If two threads run counter++ concurrently and the scheduler interleaves them, both can load the same value, both increment, both store — and one increment vanishes. The bug is not the C source; the bug is that the OS scheduler can preempt anywhere it likes, including between those three instructions.

T1: load counter (=5)
T2: load counter (=5)
T1: add 1 (=6)
T2: add 1 (=6)
T1: store 6
T2: store 6
final counter = 6, expected 7

What “atomic” actually means#

An operation is atomic if no thread can observe a partial result. Hardware gives you a small set of atomic operations directly (aligned word loads and stores, plus instructions like lock cmpxchg on x86). Everything else — even a 64-bit store on a 32-bit machine — can tear. The OS and the language standard layer on top of that hardware to give you primitives (mutexes, atomic types) that extend the atomic guarantee to arbitrary critical sections.

Race conditions, critical sections, mutual exclusion#

A race condition is any case where the result depends on scheduling. A critical section is a region of code that touches shared state and must not be interleaved with itself by another thread. Mutual exclusion is the property the synchronization primitive provides: at most one thread is inside the critical section at a time.

The “wish” the textbook talks about is: “I want a function atomic { ... } block.” Languages don’t give it to you (transactional memory got close, never shipped widely), so you simulate it with locks.

Variants and trade-offs#

Threads (shared memory) — cheap to spawn, fast to coordinate, but every shared variable is a potential bug. Debugging is hard because the bug is in the interleaving, not the source. Default for CPU-bound work, latency-sensitive servers, and any program that needs cores.

Processes (separate address space) — expensive to spawn, expensive to communicate (pipes, sockets, shared-memory segments), but isolation is automatic — one process can’t corrupt another’s memory. Default for sandboxing, fault isolation, and language ecosystems where shared memory is hard (Python’s GIL).

Other axes worth knowing:

Kernel threads vs. user-space threads. Modern Linux uses 1:1 kernel threads via NPTL — every pthread_create is one clone() syscall. M:N user-space threading (green threads, fibers, coroutines) is making a comeback in language runtimes (Go goroutines, Java virtual threads) because kernel threads cost too much at very high concurrency.
Shared-everything vs. share-nothing. Shared-everything is the default thread model. Share-nothing (actor model, Erlang, Go channels) makes the data flow explicit and the synchronization implicit — fewer bugs, more boilerplate.
Preemptive vs. cooperative. Threads under an OS scheduler are preemptive — the kernel can switch you out at any instruction. Coroutines are cooperative — they only yield where you say await or yield. Cooperative is easier to reason about; preemptive is necessary when one task can’t be trusted to yield.

Why is a single 64-bit store sometimes not atomic?

On a 32-bit machine, a 64-bit store is two 32-bit stores under the hood. Another thread can read between them and see half-new, half-old. Even on 64-bit hardware, an unaligned 8-byte store that crosses a cache-line boundary can tear. The fix is to use atomic<int64_t> (or the C11 _Atomic type), which the compiler implements with a hardware lock prefix or a wider instruction that’s guaranteed indivisible. “It’s just a number” intuition fails at the hardware level.

When this is asked in interviews#

Almost always as the entry point to a concurrency question. “Two threads call counter++ a million times each; what’s counter at the end?” The right answer is anywhere between 1,000,000 and 2,000,000, depending on scheduling. The strong follow-up is to explain why — load/add/store interleaving — and then to propose the fix (a mutex, an atomic counter, or __sync_fetch_and_add).

The follow-ups divide:

Foundational — “What’s a race condition? What’s a critical section?” Verbal, no code.
Mid-level — “Write the producer-consumer pattern. What primitives do you need?” Tests CV + mutex fluency.
Senior — “Why is volatile not enough for multi-threaded code?” Tests memory-model fluency: volatile prevents compiler reordering of accesses but does nothing about CPU reordering or atomicity.
Staff — “Walk me through what std::memory_order_acquire actually means.” Tests release/acquire semantics and the C++ / Java memory model.