Locks and Spinlocks — Operating Systems

What it is#

A lock is the primitive that turns “this region of code must not be interleaved with itself” into a runtime check. It exposes two operations — lock() and unlock() — with the property that at most one thread can be inside the lock at any moment. Build it correctly and a counter++ between lock and unlock becomes effectively atomic, even though the underlying instructions are not.

The interesting thing is not the API; it’s the implementation. You can’t build a lock out of regular loads and stores — that brings you back to the original race. You need a hardware-supported atomic operation: test-and-set, compare-and-swap, fetch-and-add, or load-linked / store-conditional. Every lock in every OS is some flavour of these.

When to use it#

Whenever multiple threads (or a thread and an interrupt handler) touch the same writable state. Specific cases:

A shared counter, list, hash table, queue. Wrap mutating operations in a lock.
An invariant that spans two operations. “Decrement balance A and increment balance B” — both must happen or neither, so both go under one lock.
Lazy initialization. pthread_once or a double-checked lock around a one-time setup.
Coordinating with an interrupt handler in the kernel. Disable interrupts, then take a spinlock — never the other way around.

Don’t reach for a lock when:

A lock-free data structure or an atomic primitive does the job (a single counter is better as atomic<int>).
The “shared” state is actually thread-local and you just need to make that explicit (thread_local).
The synchronization you need is “wait until something is ready,” not “wait for exclusive access.” That’s a condition variable’s job.

How it works#

The naive (broken) attempt#

volatile int locked = 0;
void lock(void)   { while (locked) ; locked = 1; }
void unlock(void) { locked = 0; }

This loses immediately. Two threads can both read locked == 0, both set it to 1, both enter the critical section. The whole point is that “check it’s zero AND set it to one” must be one indivisible step.

Test-and-set#

Most hardware exposes some form of atomic “swap” — xchg on x86, swp on early ARM. Set the lock word to 1 atomically and return the old value:

typedef struct { int flag; } lock_t;

void lock(lock_t* L) {
    while (atomic_exchange(&L->flag, 1) == 1)
        ;  // spin until we win the swap
}
void unlock(lock_t* L) { atomic_store(&L->flag, 0); }

This is correct but pessimal — every waiter spins on a memory location, generating cache-coherence traffic that scales as O(N^2) in the number of contenders. It’s also unfair: the next acquirer is whichever core’s cache invalidation lands first, not whoever has been waiting longest.

Compare-and-swap#

int cas(int* addr, int expected, int desired);
// atomically: if (*addr == expected) { *addr = desired; return 1; } else return 0;

CAS is more general — it lets you build lock-free data structures, not just locks. To build a lock: while (!cas(&L->flag, 0, 1)) ; — equivalent to test-and-set in spirit but the universal primitive.

Fetch-and-add (ticket locks)#

A ticket lock cures the unfairness of test-and-set by handing out a queue number:

typedef struct { int ticket, turn; } ticket_lock_t;

void lock(ticket_lock_t* L) {
    int my = atomic_fetch_add(&L->ticket, 1);
    while (atomic_load(&L->turn) != my)
        ;  // wait until it's my turn
}
void unlock(ticket_lock_t* L) {
    atomic_fetch_add(&L->turn, 1);
}

Now threads acquire in arrival order — FIFO, starvation-free. The cache traffic is still O(N) per release because every waiter is spinning on turn. MCS and CLH locks fix that by giving each waiter its own cache line to spin on.

Two-phase locks (futex-backed mutexes)#

Spinning forever in user space wastes cycles. A modern pthread_mutex_t is a two-phase lock: spin for a few hundred cycles in case the holder is about to release, then call into the kernel (futex(FUTEX_WAIT)) and sleep. The kernel puts you on a wait queue keyed by the lock’s address; the unlocker calls futex(FUTEX_WAKE) to release one waiter.

lock(m):
  if cas(&m->flag, UNLOCKED, LOCKED) succeed: return
  spin briefly retrying CAS
  if still locked: futex_wait(&m->flag, LOCKED)   // syscall, sleep
  retry from top
unlock(m):
  store m->flag = UNLOCKED
  if waiters: futex_wake(&m->flag, 1)             // syscall

The brilliance is the uncontended path is a single CAS — no syscall, no scheduler involvement. Only contention pays the kernel cost. This is why a glibc mutex is ~25 ns uncontended versus ~1 µs for a Win32 critical section on older Windows.

Spinlocks in the kernel#

Inside the kernel, with interrupts disabled, you can’t sleep — there’s no other thread to give the CPU to, and yielding inside an interrupt handler is forbidden. Spinlocks are the only option. Linux uses ticket spinlocks (now qspinlock, a queued variant) for in-kernel mutual exclusion. The trade-off changes completely: in user space you have a scheduler that can run someone else; in the kernel you don’t.

Variants#

Recursive (re-entrant) locks#

Same thread can lock the same mutex multiple times; releases must match acquires. Useful when a function that holds the lock calls another function that needs it. Cost: hides design bugs (why does function B need this lock anyway?) and is slightly slower (must track owner thread and count).

Reader-writer locks#

Many readers OR one writer. Throughput win when reads dominate and the critical section is non-trivial. Cost: at high contention writer starvation is common; the lock itself is heavier (typically three words instead of one).

Priority-inheritance mutexes#

If a high-priority thread waits for a lock held by a low-priority thread, the holder is temporarily boosted to the waiter’s priority. Prevents priority inversion (famously bit Mars Pathfinder). Required for real-time use; pthread_mutexattr_setprotocol(PTHREAD_PRIO_INHERIT).

Robust mutexes#

If the holder thread dies (segfaults, gets killed), the next acquirer sees EOWNERDEAD and gets a chance to repair the protected state. Important for processes that share a mutex via shared memory.

Lock-free / wait-free structures#

Use CAS directly instead of building a lock. A lock-free queue, stack, or counter can scale better under contention because there’s no kernel sleep involved. Cost: extremely hard to get right (ABA bugs, memory reclamation), and the C++ memory model’s acquire/release semantics get involved.

Trade-offs#

Spinlock — uncontended cost is one CAS, no syscall. Wins when the critical section is short (single-digit microseconds) and the system has more cores than threads. Loses badly when a holder is descheduled — every waiter burns CPU until the OS schedules the holder back in. In user space, almost always wrong.

Sleep lock (futex mutex) — uncontended cost is also one CAS. Contended path costs a syscall and a context switch (~1-5 µs) but the waiter doesn’t burn CPU. Wins when threads outnumber cores, when critical sections vary in length, or when the holder may be preempted. The default for user space.

Other tensions:

Coarse vs. fine-grained locking. One big lock is correct and easy; many small locks scale but invite deadlock. Lock ordering rules (always acquire a before b) are how you make fine-grained correct.
Lock vs. lock-free. Lock-free avoids the kernel sleep cost but is dramatically harder to write and audit. For 95% of code, a mutex is better.
Mutex vs. RW lock. RW wins when read:write is >10:1 and reads are non-trivial. For tiny critical sections, the RW lock’s extra overhead dominates the parallelism gain.
Try-lock and timed-lock. pthread_mutex_trylock returns immediately if the lock is held; pthread_mutex_timedlock waits with a deadline. Both useful for deadlock-avoiding “back off and retry” patterns.

Common pitfalls#

Holding a lock across I/O or a syscall that can block. Blocks every other contender for the same lock until the I/O completes. Either release first, do the I/O, reacquire — or move the I/O outside the critical section entirely.
Forgetting to unlock on every return path. if (err) return -1; after lock() is a leak. Use RAII (C++), defer (Go), or pthread_cleanup_push (C).
Lock ordering inversion. Thread 1 locks a then b; thread 2 locks b then a. Classic AB-BA deadlock. Establish and enforce a total order on locks; tools like lockdep in the Linux kernel catch violations at runtime.
Signalling a CV outside the lock. Sometimes legal but breaks predicate ordering. Default: signal under the lock.
Using a spinlock in user space. The holder can be preempted. Every waiter burns a core. Use pthread_mutex_t unless you have measured proof a spinlock helps.
Holding a lock and forking. The child inherits the locked mutex with no owner; no one can ever unlock it. Use pthread_atfork to release locks before fork, or avoid threads-then-fork entirely.
Not initializing a mutex. pthread_mutex_t m; on the stack with no pthread_mutex_init(&m, NULL) gives you garbage state. PTHREAD_MUTEX_INITIALIZER only works for static storage.

What's the ABA problem in lock-free code?

Compare-and-swap checks “is the value still X?” before swapping. But “X” might mean a different X that happens to share its bits — for example, a pointer to a freed and reallocated node. Thread 1 reads pointer X. Thread 2 frees X, allocates again, gets the same address, and reinserts. Thread 1’s CAS succeeds against the new X but the internal state has changed. Fixes: tag the pointer with a version counter (double-word CAS), or use hazard pointers / epoch-based reclamation. ABA is the reason lock-free code looks easier than it is.

Threads and Shared State — what locks protect.
POSIX Threads API — the pthread_mutex_t surface.
Condition Variables — the wait-for-state companion.
Semaphores — counting generalisation.
Concurrency Bugs — Deadlock, Atomicity, Order — what goes wrong with locks.