Concurrent Data Structures

Concurrent counters, linked lists, queues, and hash tables — coarse-grained vs hand-over-hand vs lock-free, and where each one breaks under contention.

Building Block Intermediate
9 min read
concurrent-data-structures lock-free scaling contention

What it is#

A concurrent data structure is a regular data structure — counter, list, queue, hash table, tree — extended with enough synchronization that multiple threads can call its methods at the same time without corrupting it or losing updates. The interesting design space is how you synchronize: one big lock around the whole thing, fine-grained locks per element, lock-free with atomics, or sharded so threads rarely touch each other’s data.

The progression mirrors the lifecycle of a real codebase. Start with one lock because it’s obviously correct. Profile, find the lock is the bottleneck, split it. Keep splitting until the cost of acquiring multiple locks dominates, or until you reach for lock-free.

When to use it#

Whenever multiple threads call methods on the same in-memory data structure:

  • Counter — shared metric, request ID generator, sequence number.
  • Queue / channel — producer-consumer between threads, work-stealing task pool, log buffer.
  • Linked list / skip list — ordered set, in-memory index, free list.
  • Hash table — in-memory cache, symbol table, dedup set.
  • Tree / B-tree — in-memory index, page-level locks in a database storage engine.

The shape of the workload dictates the strategy. If reads dominate writes 100:1, a RW lock or sharded structure pays off. If writes dominate, fine-grained or lock-free is the only thing that scales. If the structure is touched once per request, one lock is plenty — don’t optimise what isn’t a bottleneck.

How it works#

The counter — three flavours#

// Coarse: one mutex around an int. Correct, simple, doesn't scale.
struct counter { int v; pthread_mutex_t m; };
void inc(struct counter* c) {
pthread_mutex_lock(&c->m);
c->v++;
pthread_mutex_unlock(&c->m);
}
// Atomic: hardware fetch-and-add. Lock-free, but cache-line ping-pongs across cores.
atomic_int v;
void inc(void) { atomic_fetch_add(&v, 1); }
// Sloppy / per-CPU: each thread bumps a local counter; reader sums them up.
__thread long local;
long globals[NCPU]; // periodically rolled up
void inc(void) { local++; if (local % 1024 == 0) flush(); }

The sloppy counter is what Linux uses for per-cpu stats and what high-volume metrics systems do. Reads are approximate but cheap; writes don’t touch any shared cache line. The trade-off is read accuracy and the periodic flush cost.

Linked list — coarse vs. hand-over-hand#

Coarse: one mutex protects the whole list. Insert and delete each take it. Throughput is bounded by the single-lock acquire/release rate (~10M/s on a modern x86).

Hand-over-hand (lock-coupling): each node has its own lock. To walk the list, you lock node A, lock node B, unlock A, lock C, unlock B, … The walker holds at most two locks at a time. Concurrent walks can overtake each other.

T1 walking: [a*][b*][c ][d ][e ]
T2 walking: [a ][b ][c*][d*][e ]

In practice hand-over-hand often loses to a coarse lock — the per-node lock acquire/release overhead dominates the savings unless the list is long and the operations are slow. Lock-free lists (Harris-Michael) skip locks entirely with CAS, but deletion is subtle (you mark a node “logically deleted” by setting the low bit of its next pointer; a later traversal physically unlinks it).

Queue — Michael-Scott#

struct node { value v; atomic<node*> next; };
struct queue { atomic<node*> head, tail; };
void enqueue(queue* q, value v) {
node* n = new_node(v);
while (true) {
node* t = q->tail.load();
node* next = t->next.load();
if (t != q->tail.load()) continue; // tail moved, retry
if (next == NULL) {
if (cas(&t->next, NULL, n)) { // link new node
cas(&q->tail, t, n); // swing tail
return;
}
} else {
cas(&q->tail, t, next); // help advance tail
}
}
}

The Michael-Scott queue is the canonical lock-free FIFO. The dance is: load tail, see whether tail still points to itself (if not, help advance it), then CAS in the new node. A consumer dequeueing does the symmetric dance on head. The reason it’s hard is memory reclamation — when do you free(node)? A concurrent reader might still hold a pointer. Solutions: hazard pointers, epoch-based reclamation, or RCU.

Hash table — striped locking#

Replace one mutex with N mutexes, each protecting a subset of buckets:

#define STRIPES 256
struct ht {
bucket_t* bins[CAP];
pthread_mutex_t locks[STRIPES];
};
void put(ht* h, key_t k, val_t v) {
int i = hash(k) % CAP;
pthread_mutex_lock(&h->locks[i % STRIPES]);
// ... insert into h->bins[i] ...
pthread_mutex_unlock(&h->locks[i % STRIPES]);
}

Each lock protects 1/N of the buckets; collisions on the same lock only happen when keys hash to bins in the same stripe. java.util.concurrent’s ConcurrentHashMap was striped (16 stripes by default in Java 7) before moving to per-bucket CAS in Java 8.

The scaling cliff#

Once a data structure is correct, the real question is throughput as N (cores or threads) grows. The shapes:

  • Coarse-locked: throughput climbs to a point, then plateaus — every operation serializes on the one lock.
  • Atomic single word: scales up to ~8-16 cores, then plateaus or declines — the cache line becomes a contention point; every increment invalidates every other core’s copy.
  • Sloppy / per-CPU: scales near-linearly with cores until the read aggregation becomes a bottleneck.
  • Fine-grained / lock-free: scales well for low contention; bad cases (all threads hitting the same hash bucket, the same queue end) hit pathological retry loops.

Variants#

Per-CPU / sharded data structures#

Each CPU has its own copy; updates go to the local copy with no contention; reads aggregate across CPUs. Linux’s per-CPU counters, jemalloc’s per-thread arenas, and tcmalloc’s thread caches all use this. Cost: reads are O(N) and approximate.

Read-copy-update (RCU)#

Readers don’t synchronize at all — they just dereference. Writers copy the structure, mutate the copy, atomically swap the pointer, then wait for all pre-swap readers to finish before freeing the old copy. The Linux kernel uses RCU heavily for routing tables, file descriptor tables, and the directory cache. Reads are essentially free; writes are slow.

Lock-free vs. wait-free#

  • Lock-free — at least one thread always makes progress, even under contention. Individual threads can be starved.
  • Wait-free — every thread completes its operation in a bounded number of steps, regardless of contention. Theoretically beautiful, in practice expensive — most production code is “lock-free enough.”

Optimistic concurrency control#

Read, compute, then attempt to commit with CAS or version-check. If something else moved while you computed, retry. Used in modern B-tree variants (Bw-tree), in HTM-based code (Intel TSX), and in software transactional memory.

Trade-offs#

Lock-based — straightforward to reason about (every method has an obvious critical section), easy to debug, easy to extend. Cost: contention serializes, deadlock is always one bad ordering away, blocking waiters waste latency.
Lock-free (CAS-based) — no blocking waiters, no deadlock, often higher throughput at high core counts. Cost: extremely hard to write correctly (ABA, memory reclamation, memory ordering); reasoning about a 30-line CAS loop is harder than 300 lines of locked code; benchmarks lie.

Other tensions:

  • Throughput vs. latency variance. A coarse lock has predictable latency (the queue length is bounded by core count). A lock-free retry loop has long tails — under heavy contention, a thread can retry hundreds of times.
  • Read-mostly vs. write-mostly. RW locks, RCU, copy-on-write trees win on read-mostly. Striped locks and lock-free win on write-heavy.
  • Cache locality. A per-CPU structure keeps each thread on its own cache lines. A shared counter pings the line between cores on every update; at ~80 ns per ping, that’s the entire cost.

Common pitfalls#

  • False sharing. Two int counters in the same struct land on the same cache line; threads incrementing them invalidate each other’s line on every write. Padding them to 64 bytes (the cache line size on x86) recovers most of the lost throughput.
  • Memory reclamation in lock-free code. “When is it safe to free?” is the hardest question in lock-free programming. Get it wrong and you free a node a concurrent reader is still walking. Solutions: hazard pointers, epoch reclamation, RCU. Don’t write a lock-free structure without one.
  • Assuming volatile is enough. volatile prevents compiler reordering of accesses but not CPU reordering or atomicity. Use atomic<T> (or std::atomic_* in C, _Atomic in C11) with explicit memory orders.
  • ABA without versioning. Already covered in Locks and Spinlocks. Lock-free stacks and queues need tagged pointers or epoch reclamation to be safe.
  • Striped locks with a bad hash. If keys cluster, all hits land on the same stripe and you’re back to coarse locking. Choose the stripe count and hash function so collisions are rare.
  • Atomic operations on misaligned data. An atomic<int64_t> crossing a cache line is undefined behaviour on x86 and a partial-store stall on most other architectures. The compiler aligns these for you; manual placement (packed structs, mmap-backed memory) can break the guarantee.
Why does a single atomic counter stop scaling around 16 cores?

Every atomic_fetch_add on the same cache line forces the line into Modified state on the executing core and Invalidated on every other core’s L1. Each core then re-fetches it on its next increment. The MESI coherence traffic grows with the number of contenders; at some point the line spends most of its time in flight on the interconnect rather than in any L1. The fix is to either eliminate the sharing (per-CPU counter) or to allow approximate reads (sloppy counter). Hardware can’t make a single contended word scale — physics intervenes.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.