Ticketmaster (Flash Sale)

Inventory reservation under coordinated burst load: waiting rooms, holds, atomic seat allocation, and the bot-vs-fan arms race.

System Advanced
13 min read
inventory flash-sale queueing anti-abuse
Companies this resembles: Ticketmaster · AXS · See Tickets · Eventbrite

Step 1 — Clarify Requirements#

Functional

  • A user browses upcoming events; picks an event; sees a seat map; selects seats; checks out within a hold window; pays; receives a ticket.
  • A seat can be sold exactly once. Ever. Double-booking is the worst possible failure.
  • An on-sale moment (“Taylor Swift Eras Tour, sale at 10:00 ET sharp”) collapses a year of event browsing into ~10 minutes of stampede.
  • Waiting room: between “queue opens” and “you get a turn”, show me my position and an honest ETA.
  • Hold: once I have seats in my cart, they’re mine for 8 minutes; if I don’t pay, they return to the pool.
  • Out of scope: dynamic pricing, secondary-market resale, accessibility-seat handling rules, event-day check-in scanning.

Non-functional

  • 99.99% availability for the browse path year-round.
  • 99.9% acceptable for the on-sale path during the spike — degradation is OK as long as no double-bookings.
  • p99 “click → hold confirmed” under 2 s during peak. The user is staring at a spinner; longer than that and the cart abandonment shape gets worse.
  • Single biggest sale we plan for: 10 M users in the waiting room, 50 K seats available. The mismatch is the whole problem.
  • Strong consistency on seat ownership. Weak consistency on every other axis (waiting-room position, displayed inventory) is fine.

Step 2 — Capacity Estimation#

A “normal” day:

  • ~50 K events listed, ~1 M tickets purchased/day → ~12 purchases/sec average, peaks of a few hundred per second on a Friday afternoon sale.

A flash sale (the design driver):

  • Waiting room population: 10 M users in queue for a 50 K-seat tour leg.
  • Arrival rate: those 10 M show up within ~5 minutes (waiting-room “doors open” + a coordinated email blast + push notification). ~33 K request/sec sustained for the first 5 minutes, with sharp peaks at the second of doors-opening.
  • Active hold rate: at most 50 K simultaneous holds (one per seat). The remaining 9.95 M users are in queue, not actively reserving.
  • Reservation attempts per second (the actual contention): each hold cycle is roughly 5 minutes (3 minute payment + abandonment + re-queue). Through a sale’s first hour we cycle each seat ~10–12 times before it sticks. That’s ~50 K × 10 / 3600 ≈ 140 reservation attempts/sec, but each attempt touches the same hot inventory.
  • Storage: a 50 K-seat venue has ~50 K seat rows. Even a year of tours is under 100 M rows in the inventory store. Tiny.
  • Payment QPS: bounded by seat capacity, not waiting-room size. ~10–20 payments/sec sustained, well within any payment gateway.

The whole system is shaped by one number: arrival rate ÷ throughput capacity = queue depth. Everything we build is to make that ratio not blow up the actual seat-allocation engine.

Step 3 — System Interface#

GET /events/:id
Returns: event metadata (name, venue, dates, sections)
GET /events/:id/seatmap
Returns: section layouts, availability summary (count, not per-seat — too hot)
POST /queue/join
Body: { event_id, user_id }
Returns: { queue_token, position_estimate, eta_seconds }
GET /queue/status?token=...
Returns: { position, eta_seconds, ready: false }
On readiness: { ready: true, session_token (expires in 8 min) }
POST /reserve
Header: session_token
Body: { event_id, seat_ids: [...] }
Returns: { hold_id, expires_at } or { conflict: [seat_id, ...] }
POST /cart/:hold_id/checkout
Body: { payment_token, ... }
Returns: { order_id } or { hold_expired: true }
DELETE /cart/:hold_id (explicit release)

The waiting room is a strict gate: without session_token, /reserve returns 403. The session token is a short-lived JWT signed by the queue service.

Step 4 — High-Level Design#

[10M users]
CDN edge (rate-limit, bot scoring) ──→ static seatmap, event metadata
waiting room service ──→ admission counter (Redis, sliding window)
│ │
│ ▼
│ session token issued
application tier (stateless) ──→ reservation service ──→ inventory shard (per event)
│ │
├──→ hold store (Redis) └─ durable inventory DB
│ (TTL-based)
payment service ──→ external PSP
order service ──→ ticket issuance, email, wallet pass

The critical isolation: each event gets its own inventory shard. A Taylor Swift sale doesn’t share infrastructure with a small-club show. Hot events are pinned to dedicated capacity; cold events share a pool. This is the only way to bound blast radius — a single hot event can fully consume its shard without affecting the rest of the catalog.

Step 5 — Data Model#

Events (read-heavy, mostly static):

table events
event_id uuid PK
artist string
venue_id uuid
starts_at timestamp
on_sale_at timestamp
status enum(upcoming, on_sale, sold_out, ended)
seat_map_url string // static asset on CDN

Inventory (sharded by event_id, partitioned within event by section):

table seats
event_id uuid PK
seat_id string CK // e.g. "A-12-15"
section string
row string
number int
price_tier string
status enum(available, held, sold)
hold_id uuid?
held_until timestamp?
sold_to uuid?
sold_at timestamp?

A single row per seat. The status transition is the consistency-critical operation. The held_until field is what enables TTL-based hold expiry.

Holds (Redis primary, durable log secondary):

hold:{hold_id} → { user_id, event_id, seat_ids, expires_at } TTL 8 min
user_active_hold:{user_id} → hold_id (enforce one hold per user)

Waiting room (Redis sorted set per event):

queue:{event_id} → ZSET { user_id : enqueue_timestamp }
queue:{event_id}:counter → monotonic admission counter

Orders (durable, append-only):

table orders
order_id uuid PK
user_id uuid
event_id uuid
seat_ids list
total money
payment_ref string
status enum(pending, paid, failed, refunded)
created_at timestamp

Step 6 — Detailed Design#

The waiting room#

When on_sale_at - 30 min arrives, the system opens the queue. Each user submits to /queue/join; the service appends to a sorted set with their request timestamp. They get a queue_token (an opaque, signed identifier of their position).

The waiting room admits users to the active shopping path at a controlled rate. The admission rate is not a constant; it’s a function of current cart-flow throughput:

admission_rate = (seat_availability_estimate / avg_session_minutes) × safety_factor

For a 50 K-seat sale with 5-minute average sessions, baseline admission is ~10 K users every 5 minutes = ~33 admissions/sec. The safety factor (~0.7) keeps us from over-admitting and creating frustration when users arrive at an already-empty seatmap.

Position display is intentionally approximate. We never tell a user “you are #4 837 261 in line” — that’s information warfare. We tell them “your wait is approximately 12 minutes” with a coarse-grained bucket, and we always under-promise a bit. The exact position is derivable from the sorted set but never surfaced raw.

Hard FIFO queue — strict ordering by enqueue timestamp. Fair, deterministic, easy to explain. Subject to bot priming: bots that hit the join endpoint at millisecond zero gain a guaranteed first-N slot.
Stochastic admission — admit from a window of “early” users randomly. Less explicit fairness, but bot advantage drops sharply because being first doesn’t help if entry is sampled. We use a hybrid: bucket by 5-second tranches and sample within bucket.

Atomic seat reservation#

The reservation service receives POST /reserve { seat_ids: ['A-12-15', 'A-12-16'] } with a valid session token. It must atomically transition each seat from available to held or return a conflict listing the seats that were already gone.

Two-phase implementation:

  1. Optimistic phase (Redis): a Lua script that checks seat:{event}:{seat_id}:status, sets to held, sets held_until = now + 480 s, sets hold_id. The script runs atomically across all seats requested; if any fails, none change.
  2. Durable phase (async): the hold is journaled to the inventory DB within ~200 ms. If the durable write fails, the Redis state is rolled back via a compensating Lua script.

Why Redis first: the Lua script runs in microseconds on a single instance and handles thousands of attempts/sec on one core. The DB write is durability insurance, not the critical path.

KEYS = seat:{event}:A-12-15, seat:{event}:A-12-16
ARGV = hold_id, user_id, ttl_seconds
local results = {}
for i, k in ipairs(KEYS) do
if redis.call('HGET', k, 'status') ~= 'available' then
-- roll back any seats already held in this attempt
for j = 1, i-1 do
redis.call('HSET', KEYS[j], 'status', 'available')
redis.call('HDEL', KEYS[j], 'hold_id', 'held_until')
end
return { 'conflict', k }
end
redis.call('HSET', k, 'status', 'held', 'hold_id', ARGV[1], 'held_until', ARGV[3]+now)
end
return 'ok'

The transition available → held is single-writer on the Redis primary; the script holds the seats while it works, and Redis is single-threaded per shard so there is no concurrent observer. A double-allocation can’t happen.

Inventory sharding#

A 50 K-seat venue with 33 K rps arriving simultaneously is too hot for a single Redis instance. We shard the inventory inside the event by section:

shard 1: sections A-G (10 K seats)
shard 2: sections H-N (10 K seats)
shard 3: sections O-T (10 K seats)
shard 4: sections U-Z (10 K seats)
shard 5: sections AA-GG (10 K seats)

Each shard handles ~7 K rps independently. Cross-section reservations (rare — most users pick within a section) require a two-shard transaction; we use a saga pattern with explicit rollback rather than 2PC.

Hold expiry#

A hold has a held_until field set at reservation time. Three mechanisms enforce it:

  1. Redis TTL on the hold:{hold_id} key — automatically gone after 8 min.
  2. Scheduled scanner every 30 s checks for status='held' AND held_until < now in the durable inventory and resets to available (covers any Redis evictions).
  3. Read-time validation: any read of a held seat that finds held_until < now lazy-evicts the hold.

The reconciliation job between Redis and durable storage runs every minute to repair any drift.

Checkout#

Once the user submits payment, the hold transitions to sold. This must be irreversible from the user’s perspective: a payment confirmation followed by “actually, your seats are gone” is the worst UX outcome.

POST /cart/:hold_id/checkout
load hold from Redis (must exist, not expired)
call payment service (synchronous, idempotent on hold_id)
on payment.success:
Lua: for each seat in hold, transition held→sold (idempotent on hold_id)
durable write: insert order row, update seats
emit ticket-issuance event
return order_id
on payment.failure:
keep hold alive (let user retry); or release on user action

The payment call is the longest piece of the checkout path (~1 s). It’s idempotent on hold_id so client retries don’t double-charge. If the payment succeeds but our subsequent durable write fails (rare but possible), we have a refund-safety job that reconciles payment-events against order-events nightly.

Bot defenses#

This is the arms race that makes Ticketmaster Ticketmaster.

  • CDN-level rate limiting keyed on IP and on browser-fingerprint. The first wave of bots dies here; sophisticated ones use residential-IP rotation and pass.
  • Proof-of-work or invisible CAPTCHA at queue-join time. Adds 1–3 s of CPU work the bot must spend per request. Real users barely notice; bots burn meaningful compute.
  • Behavior scoring: mouse-movement entropy, time-on-page before clicking, prior-event history. Low-score sessions face heavier challenges (visible CAPTCHA, longer admission delay).
  • Account aging: an account created 30 minutes ago that joins 15 different sale queues has a low score by definition.
  • Per-account purchase limits: 4 tickets per account per event, with backend dedup across payment cards / shipping addresses to catch obvious sybils.

None of these stop a determined adversary; all of them raise the cost enough that the bot economics get marginal. The honest framing is: we make scalping less profitable, not impossible.

Latency budget#

queue admission → /reserve: sub-100 ms (cached session token)
/reserve roundtrip:
edge / TLS: 30 ms
application tier hop: 5 ms
reservation service → Redis Lua: 2 ms
durable journal (async): not on critical path
response back: 30 ms
total: ~70–120 ms p99

Far inside the 2 s budget; the spare time is what swallows network jitter on slow mobile connections.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: the hot inventory shard. A single section is a contention hotspot when its first row is the most-coveted real estate. Even with intra-event sharding, “section A row 1” is a single key. Mitigations: lazy availability display (don’t tell users a specific row is available; tell them “front section available”) so 100 K users don’t all hammer the same key on the same millisecond. The reservation service serializes contention but throughput is bounded by Redis single-thread perf (~100 K ops/sec per primary, headroom from there with multiple shards).

Bottleneck #2: payment-gateway throughput. External PSPs (Stripe, Adyen, regional processors) have per-merchant rate limits in the low thousands per second. A sold-out 50 K-seat sale completes in ~30 min, so ~30 payments/sec — comfortable. A truly extreme event (multi-venue tour going on sale globally) can exceed PSP limits; we negotiate burst quotas in advance and have a payment-queue fallback that delays charging while keeping the hold alive.

Bottleneck #3: waiting-room state size. 10 M ZSET entries per event in Redis is fine; 50 simultaneous mega-events with 10 M each is 500 M entries. We allocate dedicated queue-cluster capacity per major event rather than mixing.

Bottleneck #4: bot adversaries. Not a technical bottleneck so much as an economic one. Every defensive measure (CAPTCHA, IP scoring, account-aging) trades user friction for bot resistance. The honest evaluation: we will never win the arms race outright. The goal is making the bot economics worse than ticket prices, so any individual scalper’s margin is thin. Verified-fan presales (allowlists of known accounts) are the only mechanism that genuinely changes the shape, at the cost of audit complexity.

Alternative I’d push back on: real-time per-seat availability rendered on the seatmap during a flash sale. Customers ask for it; UX designers prototype it. At 10 M concurrent viewers on the same seatmap, the broadcast cost (every seat-status change pushed to every viewer) is unbounded. The hybrid we use — show aggregate availability (“12 left in Section A”) and reveal individual seats only after the user enters the active selection flow — is the right trade. It frustrates a small fraction of power-users; the majority get a faster, more reliable experience.

What breaks first at 10× scale (a 500 K-seat global tour going on sale in one moment): the waiting-room admission service. The current per-event sorted set hits memory and write-throughput limits beyond ~50 M entries. The fix is to shard the queue itself by region and run a distributed admission protocol — but that complicates the fairness story (does a US user have any chance against an EU user with the same enqueue time?). The product-policy answer probably becomes “regional allocations” rather than “global free-for-all.”

Companies this resembles#

Ticketmaster (the canonical), AXS, See Tickets, DICE, Eventbrite (lighter end of the same shape), and the queue technology overlaps heavily with high-demand product launches (Supreme, Nintendo Switch restocks, GPU drops) and IPO retail subscription windows.

  • Rate Limiter — the front-line defense layer for the queue-join endpoint.
  • Distributed Cache — Redis is the linchpin of both the holds plane and the waiting room.
  • Payment System — checkout depends on this design directly; idempotency and reconciliation are co-designed.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.