Concurrency Bugs — Deadlock, Atomicity, Order — Operating Systems

Summary#

Concurrency bugs split into two families. Non-deadlock bugs — atomicity violations and order violations — account for most real-world incidents according to the Lu et al. 2008 study of MySQL, Apache, Mozilla, and OpenOffice. Deadlock bugs are flashier (everything stops) but rarer (roughly 30% of concurrency bugs in that study). Both come from the same root: shared state plus uncontrolled scheduling. The fixes are mechanical once you can name the bug class.

The four conditions for deadlock — mutual exclusion, hold-and-wait, no preemption, circular wait — give you four levers to break it. Break any one and deadlock is impossible. The interview value is naming the levers and knowing which is cheap and which is expensive in real systems.

Why it matters#

Concurrency bugs are the hardest bugs to reproduce. They depend on scheduling, which depends on load, which depends on what else the machine is doing. “Worked in QA, broke in production” is the canonical story. They survive code review because the source code looks correct — the bug lives in the interleaving, not in any single line. And when they bite, they bite hard: corrupted state propagates silently, deadlocks freeze whole subsystems, and the debugger often perturbs timing enough to hide the bug.

The defensive value of categorising them is that the categories prescribe the fix. An atomicity violation needs a lock. An order violation needs a condition variable or a join. A deadlock needs you to break one of the four conditions. Knowing the family is half the cure.

How it works#

Atomicity violations#

The classic shape: a sequence of operations the programmer assumes runs together, but another thread interleaves. Example from MySQL InnoDB (paraphrased):

// Thread 1                       // Thread 2
if (thd->proc_info) {
    fputs(thd->proc_info, ...);   // thd->proc_info = NULL;
}

Thread 1 checks for null, then dereferences. If Thread 2 nulls the pointer between the two operations, Thread 1 dereferences null and crashes. The fix is to wrap both lines in a mutex — the programmer intended them to be atomic; the code didn’t enforce it.

The general pattern: any check-then-act sequence on shared state where the check is informative only if act immediately follows. Either lock both, or copy the value once and use the local copy.

Order violations#

Two operations need to happen in a particular order across threads, but nothing in the code enforces it. Example:

// Thread 1 (init)        // Thread 2 (worker)
state = make_state();      use_state(state);   // crash if state still NULL

Thread 2 may run first and dereference NULL. The fix is a condition variable or a join — Thread 2 must wait for the predicate state != NULL.

Order violations often come from “I’ll set this up, then spawn the worker” patterns where the worker is actually spawned earlier in the code than the setup completes. Or from listener-before-event subscriptions — the worker reads the queue before the producer’s first item arrives.

Deadlock and the four conditions#

For deadlock to occur, all four must hold simultaneously (Coffman 1971):

Mutual exclusion — the resources can only be held by one thread at a time.
Hold-and-wait — a thread holds a resource while waiting for another.
No preemption — resources can only be released voluntarily.
Circular wait — there’s a cycle of threads, each waiting for a resource held by the next.

Break any one and deadlock is impossible. The canonical AB-BA:

// Thread 1                   // Thread 2
pthread_mutex_lock(&a);        pthread_mutex_lock(&b);
pthread_mutex_lock(&b);        pthread_mutex_lock(&a);
// ...                         // ...

Thread 1 holds a and waits for b; Thread 2 holds b and waits for a. Circular wait satisfied; both threads block forever.

Live examples#

Linux ext3 directory locks (early 2000s) — circular locking on i_mutex between rename and unlink. Found by code review, fixed by establishing a lock ordering by inode number.
Java’s Vector and Hashtable — every method synchronizes, but composite operations (if (!v.contains(x)) v.add(x)) still have atomicity violations because the lock is dropped between calls.
Chrome’s GPU process (2014-era bugs) — order violation when a callback fires before the main thread finishes registering for it.

Prevention, avoidance, detection-and-recovery#

Prevention — design the code so a condition can never hold. Break mutual exclusion (use lock-free), hold-and-wait (acquire all locks at once with pthread_mutex_trylock and back off), no-preemption (some kernels can revoke locks), or circular wait (impose a total order on locks). The last one is the practical default.
Avoidance — at runtime, check whether granting a lock could lead to deadlock and refuse if so. Banker’s algorithm. Rarely used; requires knowing each thread’s max claim in advance.
Detection-and-recovery — let deadlock happen, detect it (cycle in the wait-for graph), pick a victim, abort it. Databases do this: every minute or so, run the cycle detector and kill the youngest transaction in any cycle. The user retries.

Variants and trade-offs#

Prevention (lock ordering) — design-time discipline; every lock has a level number; you only ever acquire higher-numbered locks while holding lower-numbered ones. Tools like Linux’s lockdep enforce it at runtime. Cost: every new lock added to the system must fit the ordering; refactors that change ordering break everywhere.

Detection-and-recovery (database transactions) — let conflicting transactions race; periodically detect cycles and abort the youngest. Cost: aborted transactions retry, which means application code must handle “I just lost the work I did.” Reasonable when transactions are short and idempotent; unreasonable when they’re long.

Other tensions:

Coarse vs. fine-grained locks. Coarse rarely deadlocks (one lock, no ordering) but throttles throughput. Fine-grained scales but multiplies the deadlock surface. Lock-free dodges deadlock entirely but trades it for the ABA problem and memory reclamation puzzles.
Static analysis vs. dynamic detection. Tools like lockdep, ThreadSanitizer (TSan), and Helgrind catch many bugs but at runtime cost (TSan slows down code 5-15x). Static tools (Coverity, Infer) catch some at compile time but miss interleavings only seen at runtime.
Compositionality. Functions that each take their own lock are safe in isolation; composed without a global ordering, they deadlock. APIs that take locks internally must document their order — or, better, accept the lock as a parameter so the caller controls ordering.

Other concurrency bug families#

Priority inversion — a low-priority thread holds a lock a high-priority thread needs. Fix: priority inheritance mutexes.
Lost wakeup / spurious wakeup — covered in Condition Variables; fix is the while-loop discipline.
Memory-ordering bugs — atomic operations correct on x86 (strong model) but broken on ARM (weaker model). Fix: explicit memory_order_acquire / memory_order_release.
Use-after-free in lock-free code — covered in Concurrent Data Structures; fix is hazard pointers or RCU.
Signal-handler races — a signal arrives mid-malloc; the handler calls printf; livelock. Fix: sigprocmask to defer signals on threads holding locks.

Why are atomicity violations more common than deadlocks in production?

Deadlocks are easy to notice (everything stops) and most projects build tools to detect them. Atomicity violations look like silent data corruption — a counter that drifts, a timestamp that goes backwards, a session that gets attributed to the wrong user. They’re easy to miss in testing because they require a specific interleaving, and easy to miss in production because the corruption may surface hours later, in an unrelated subsystem. The 2008 Lu et al. study found atomicity violations were ~50% of concurrency bugs in their corpus, deadlocks ~30%, order violations ~20%. The distribution holds up in modern codebases.

When this is asked in interviews#

A staple of any concurrency-heavy interview. Patterns:

Foundational — “Show me a deadlock in pseudo-code, then fix it.” Tests basic vocabulary.
Mid-level — “Two functions, each takes two locks. How do you make them deadlock-safe?” Answer: total order on lock addresses, acquire in increasing order. Or trylock + back-off.
Senior — “A counter that’s supposed to be monotonic is going backwards in production. What’s your debugging approach?” Tests whether you reach for atomicity-violation tools (TSan, perf, adding lock-discipline asserts) rather than guessing.
Staff — “Design a deadlock detector for a database.” Tests wait-for graph, cycle detection, victim selection, and trade-offs (false positives, detection latency, abort cost).

A second context: production incident reviews. “We had a 30-minute freeze; turns out two services held each other’s locks.” Tells you whether you can move from symptom (freeze) to root cause (circular wait) to fix (lock ordering or trylock-and-retry) in real time.