Concurrent Data Structures
Concurrent counters, linked lists, queues, and hash tables — coarse-grained vs hand-over-hand vs lock-free, and where each one breaks under contention.
What it is#
A concurrent data structure is a regular data structure — counter, list, queue, hash table, tree — extended with enough synchronization that multiple threads can call its methods at the same time without corrupting it or losing updates. The interesting design space is how you synchronize: one big lock around the whole thing, fine-grained locks per element, lock-free with atomics, or sharded so threads rarely touch each other’s data.
The progression mirrors the lifecycle of a real codebase. Start with one lock because it’s obviously correct. Profile, find the lock is the bottleneck, split it. Keep splitting until the cost of acquiring multiple locks dominates, or until you reach for lock-free.
When to use it#
Whenever multiple threads call methods on the same in-memory data structure:
- Counter — shared metric, request ID generator, sequence number.
- Queue / channel — producer-consumer between threads, work-stealing task pool, log buffer.
- Linked list / skip list — ordered set, in-memory index, free list.
- Hash table — in-memory cache, symbol table, dedup set.
- Tree / B-tree — in-memory index, page-level locks in a database storage engine.
The shape of the workload dictates the strategy. If reads dominate writes 100:1, a RW lock or sharded structure pays off. If writes dominate, fine-grained or lock-free is the only thing that scales. If the structure is touched once per request, one lock is plenty — don’t optimise what isn’t a bottleneck.
How it works#
The counter — three flavours#
// Coarse: one mutex around an int. Correct, simple, doesn't scale.struct counter { int v; pthread_mutex_t m; };void inc(struct counter* c) { pthread_mutex_lock(&c->m); c->v++; pthread_mutex_unlock(&c->m);}// Atomic: hardware fetch-and-add. Lock-free, but cache-line ping-pongs across cores.atomic_int v;void inc(void) { atomic_fetch_add(&v, 1); }// Sloppy / per-CPU: each thread bumps a local counter; reader sums them up.__thread long local;long globals[NCPU]; // periodically rolled upvoid inc(void) { local++; if (local % 1024 == 0) flush(); }The sloppy counter is what Linux uses for per-cpu stats and what high-volume metrics systems do. Reads are approximate but cheap; writes don’t touch any shared cache line. The trade-off is read accuracy and the periodic flush cost.
Linked list — coarse vs. hand-over-hand#
Coarse: one mutex protects the whole list. Insert and delete each take it. Throughput is bounded by the single-lock acquire/release rate (~10M/s on a modern x86).
Hand-over-hand (lock-coupling): each node has its own lock. To walk the list, you lock node A, lock node B, unlock A, lock C, unlock B, … The walker holds at most two locks at a time. Concurrent walks can overtake each other.
T1 walking: [a*][b*][c ][d ][e ]T2 walking: [a ][b ][c*][d*][e ]In practice hand-over-hand often loses to a coarse lock — the per-node lock acquire/release overhead dominates the savings unless the list is long and the operations are slow. Lock-free lists (Harris-Michael) skip locks entirely with CAS, but deletion is subtle (you mark a node “logically deleted” by setting the low bit of its next pointer; a later traversal physically unlinks it).
Queue — Michael-Scott#
struct node { value v; atomic<node*> next; };struct queue { atomic<node*> head, tail; };
void enqueue(queue* q, value v) { node* n = new_node(v); while (true) { node* t = q->tail.load(); node* next = t->next.load(); if (t != q->tail.load()) continue; // tail moved, retry if (next == NULL) { if (cas(&t->next, NULL, n)) { // link new node cas(&q->tail, t, n); // swing tail return; } } else { cas(&q->tail, t, next); // help advance tail } }}The Michael-Scott queue is the canonical lock-free FIFO. The dance is: load tail, see whether tail still points to itself (if not, help advance it), then CAS in the new node. A consumer dequeueing does the symmetric dance on head. The reason it’s hard is memory reclamation — when do you free(node)? A concurrent reader might still hold a pointer. Solutions: hazard pointers, epoch-based reclamation, or RCU.
Hash table — striped locking#
Replace one mutex with N mutexes, each protecting a subset of buckets:
#define STRIPES 256struct ht { bucket_t* bins[CAP]; pthread_mutex_t locks[STRIPES];};
void put(ht* h, key_t k, val_t v) { int i = hash(k) % CAP; pthread_mutex_lock(&h->locks[i % STRIPES]); // ... insert into h->bins[i] ... pthread_mutex_unlock(&h->locks[i % STRIPES]);}Each lock protects 1/N of the buckets; collisions on the same lock only happen when keys hash to bins in the same stripe. java.util.concurrent’s ConcurrentHashMap was striped (16 stripes by default in Java 7) before moving to per-bucket CAS in Java 8.
The scaling cliff#
Once a data structure is correct, the real question is throughput as N (cores or threads) grows. The shapes:
- Coarse-locked: throughput climbs to a point, then plateaus — every operation serializes on the one lock.
- Atomic single word: scales up to ~8-16 cores, then plateaus or declines — the cache line becomes a contention point; every increment invalidates every other core’s copy.
- Sloppy / per-CPU: scales near-linearly with cores until the read aggregation becomes a bottleneck.
- Fine-grained / lock-free: scales well for low contention; bad cases (all threads hitting the same hash bucket, the same queue end) hit pathological retry loops.
Variants#
Per-CPU / sharded data structures#
Each CPU has its own copy; updates go to the local copy with no contention; reads aggregate across CPUs. Linux’s per-CPU counters, jemalloc’s per-thread arenas, and tcmalloc’s thread caches all use this. Cost: reads are O(N) and approximate.
Read-copy-update (RCU)#
Readers don’t synchronize at all — they just dereference. Writers copy the structure, mutate the copy, atomically swap the pointer, then wait for all pre-swap readers to finish before freeing the old copy. The Linux kernel uses RCU heavily for routing tables, file descriptor tables, and the directory cache. Reads are essentially free; writes are slow.
Lock-free vs. wait-free#
- Lock-free — at least one thread always makes progress, even under contention. Individual threads can be starved.
- Wait-free — every thread completes its operation in a bounded number of steps, regardless of contention. Theoretically beautiful, in practice expensive — most production code is “lock-free enough.”
Optimistic concurrency control#
Read, compute, then attempt to commit with CAS or version-check. If something else moved while you computed, retry. Used in modern B-tree variants (Bw-tree), in HTM-based code (Intel TSX), and in software transactional memory.
Trade-offs#
Other tensions:
- Throughput vs. latency variance. A coarse lock has predictable latency (the queue length is bounded by core count). A lock-free retry loop has long tails — under heavy contention, a thread can retry hundreds of times.
- Read-mostly vs. write-mostly. RW locks, RCU, copy-on-write trees win on read-mostly. Striped locks and lock-free win on write-heavy.
- Cache locality. A per-CPU structure keeps each thread on its own cache lines. A shared counter pings the line between cores on every update; at ~80 ns per ping, that’s the entire cost.
Common pitfalls#
- False sharing. Two
intcounters in the same struct land on the same cache line; threads incrementing them invalidate each other’s line on every write. Padding them to 64 bytes (the cache line size on x86) recovers most of the lost throughput. - Memory reclamation in lock-free code. “When is it safe to
free?” is the hardest question in lock-free programming. Get it wrong and you free a node a concurrent reader is still walking. Solutions: hazard pointers, epoch reclamation, RCU. Don’t write a lock-free structure without one. - Assuming
volatileis enough.volatileprevents compiler reordering of accesses but not CPU reordering or atomicity. Useatomic<T>(orstd::atomic_*in C,_Atomicin C11) with explicit memory orders. - ABA without versioning. Already covered in Locks and Spinlocks. Lock-free stacks and queues need tagged pointers or epoch reclamation to be safe.
- Striped locks with a bad hash. If keys cluster, all hits land on the same stripe and you’re back to coarse locking. Choose the stripe count and hash function so collisions are rare.
- Atomic operations on misaligned data. An
atomic<int64_t>crossing a cache line is undefined behaviour on x86 and a partial-store stall on most other architectures. The compiler aligns these for you; manual placement (packed structs, mmap-backed memory) can break the guarantee.
Why does a single atomic counter stop scaling around 16 cores?
Every atomic_fetch_add on the same cache line forces the line into Modified state on the executing core and Invalidated on every other core’s L1. Each core then re-fetches it on its next increment. The MESI coherence traffic grows with the number of contenders; at some point the line spends most of its time in flight on the interconnect rather than in any L1. The fix is to either eliminate the sharing (per-CPU counter) or to allow approximate reads (sloppy counter). Hardware can’t make a single contended word scale — physics intervenes.
Related building blocks#
- Locks and Spinlocks — the synchronization primitive these are built on.
- Threads and Shared State — the model that motivates them.
- POSIX Threads API — the surface you build these against.
- Condition Variables — used by blocking concurrent queues.
- Concurrency Bugs — Deadlock, Atomicity, Order — the failure modes specific to concurrent structures.