Ticketmaster (Flash Sale) — System Design

Step 1 — Clarify Requirements#

Functional

A user browses upcoming events; picks an event; sees a seat map; selects seats; checks out within a hold window; pays; receives a ticket.
A seat can be sold exactly once. Ever. Double-booking is the worst possible failure.
An on-sale moment (“Taylor Swift Eras Tour, sale at 10:00 ET sharp”) collapses a year of event browsing into ~10 minutes of stampede.
Waiting room: between “queue opens” and “you get a turn”, show me my position and an honest ETA.
Hold: once I have seats in my cart, they’re mine for 8 minutes; if I don’t pay, they return to the pool.
Out of scope: dynamic pricing, secondary-market resale, accessibility-seat handling rules, event-day check-in scanning.

Non-functional

99.99% availability for the browse path year-round.
99.9% acceptable for the on-sale path during the spike — degradation is OK as long as no double-bookings.
p99 “click → hold confirmed” under 2 s during peak. The user is staring at a spinner; longer than that and the cart abandonment shape gets worse.
Single biggest sale we plan for: 10 M users in the waiting room, 50 K seats available. The mismatch is the whole problem.
Strong consistency on seat ownership. Weak consistency on every other axis (waiting-room position, displayed inventory) is fine.

Step 2 — Capacity Estimation#

A “normal” day:

~50 K events listed, ~1 M tickets purchased/day → ~12 purchases/sec average, peaks of a few hundred per second on a Friday afternoon sale.

A flash sale (the design driver):

Waiting room population: 10 M users in queue for a 50 K-seat tour leg.
Arrival rate: those 10 M show up within ~5 minutes (waiting-room “doors open” + a coordinated email blast + push notification). ~33 K request/sec sustained for the first 5 minutes, with sharp peaks at the second of doors-opening.
Active hold rate: at most 50 K simultaneous holds (one per seat). The remaining 9.95 M users are in queue, not actively reserving.
Reservation attempts per second (the actual contention): each hold cycle is roughly 5 minutes (3 minute payment + abandonment + re-queue). Through a sale’s first hour we cycle each seat ~10–12 times before it sticks. That’s ~50 K × 10 / 3600 ≈ 140 reservation attempts/sec, but each attempt touches the same hot inventory.
Storage: a 50 K-seat venue has ~50 K seat rows. Even a year of tours is under 100 M rows in the inventory store. Tiny.
Payment QPS: bounded by seat capacity, not waiting-room size. ~10–20 payments/sec sustained, well within any payment gateway.

The whole system is shaped by one number: arrival rate ÷ throughput capacity = queue depth. Everything we build is to make that ratio not blow up the actual seat-allocation engine.

Step 3 — System Interface#

GET   /events/:id
        Returns: event metadata (name, venue, dates, sections)

GET   /events/:id/seatmap
        Returns: section layouts, availability summary (count, not per-seat — too hot)

POST  /queue/join
        Body: { event_id, user_id }
        Returns: { queue_token, position_estimate, eta_seconds }

GET   /queue/status?token=...
        Returns: { position, eta_seconds, ready: false }
        On readiness: { ready: true, session_token (expires in 8 min) }

POST  /reserve
        Header: session_token
        Body: { event_id, seat_ids: [...] }
        Returns: { hold_id, expires_at }   or   { conflict: [seat_id, ...] }

POST  /cart/:hold_id/checkout
        Body: { payment_token, ... }
        Returns: { order_id }   or   { hold_expired: true }

DELETE /cart/:hold_id           (explicit release)

The waiting room is a strict gate: without session_token, /reserve returns 403. The session token is a short-lived JWT signed by the queue service.

Step 4 — High-Level Design#

   [10M users]
        │
        ▼
   CDN edge (rate-limit, bot scoring) ──→ static seatmap, event metadata
        │
        ▼
   waiting room service ──→ admission counter (Redis, sliding window)
        │                              │
        │                              ▼
        │                       session token issued
        ▼
   application tier (stateless) ──→ reservation service ──→ inventory shard (per event)
                                         │                          │
                                         ├──→ hold store (Redis)    └─ durable inventory DB
                                         │     (TTL-based)
                                         ▼
                                   payment service ──→ external PSP
                                         │
                                         ▼
                                   order service ──→ ticket issuance, email, wallet pass

The critical isolation: each event gets its own inventory shard. A Taylor Swift sale doesn’t share infrastructure with a small-club show. Hot events are pinned to dedicated capacity; cold events share a pool. This is the only way to bound blast radius — a single hot event can fully consume its shard without affecting the rest of the catalog.

Step 5 — Data Model#

Events (read-heavy, mostly static):

table events
  event_id     uuid     PK
  artist       string
  venue_id     uuid
  starts_at    timestamp
  on_sale_at   timestamp
  status       enum(upcoming, on_sale, sold_out, ended)
  seat_map_url string             // static asset on CDN

Inventory (sharded by event_id, partitioned within event by section):

table seats
  event_id     uuid     PK
  seat_id      string   CK         // e.g. "A-12-15"
  section      string
  row          string
  number       int
  price_tier   string
  status       enum(available, held, sold)
  hold_id      uuid?
  held_until   timestamp?
  sold_to      uuid?
  sold_at      timestamp?

A single row per seat. The status transition is the consistency-critical operation. The held_until field is what enables TTL-based hold expiry.

Holds (Redis primary, durable log secondary):

hold:{hold_id}              → { user_id, event_id, seat_ids, expires_at }     TTL 8 min
user_active_hold:{user_id}  → hold_id   (enforce one hold per user)

Waiting room (Redis sorted set per event):

queue:{event_id}            → ZSET { user_id : enqueue_timestamp }
queue:{event_id}:counter    → monotonic admission counter

Orders (durable, append-only):

table orders
  order_id     uuid     PK
  user_id      uuid
  event_id     uuid
  seat_ids     list
  total        money
  payment_ref  string
  status       enum(pending, paid, failed, refunded)
  created_at   timestamp

Step 6 — Detailed Design#

The waiting room#

When on_sale_at - 30 min arrives, the system opens the queue. Each user submits to /queue/join; the service appends to a sorted set with their request timestamp. They get a queue_token (an opaque, signed identifier of their position).

The waiting room admits users to the active shopping path at a controlled rate. The admission rate is not a constant; it’s a function of current cart-flow throughput:

admission_rate = (seat_availability_estimate / avg_session_minutes) × safety_factor

For a 50 K-seat sale with 5-minute average sessions, baseline admission is ~10 K users every 5 minutes = ~33 admissions/sec. The safety factor (~0.7) keeps us from over-admitting and creating frustration when users arrive at an already-empty seatmap.

Position display is intentionally approximate. We never tell a user “you are #4 837 261 in line” — that’s information warfare. We tell them “your wait is approximately 12 minutes” with a coarse-grained bucket, and we always under-promise a bit. The exact position is derivable from the sorted set but never surfaced raw.

Hard FIFO queue — strict ordering by enqueue timestamp. Fair, deterministic, easy to explain. Subject to bot priming: bots that hit the join endpoint at millisecond zero gain a guaranteed first-N slot.

Stochastic admission — admit from a window of “early” users randomly. Less explicit fairness, but bot advantage drops sharply because being first doesn’t help if entry is sampled. We use a hybrid: bucket by 5-second tranches and sample within bucket.

Atomic seat reservation#

The reservation service receives POST /reserve { seat_ids: ['A-12-15', 'A-12-16'] } with a valid session token. It must atomically transition each seat from available to held or return a conflict listing the seats that were already gone.

Two-phase implementation:

Optimistic phase (Redis): a Lua script that checks seat:{event}:{seat_id}:status, sets to held, sets held_until = now + 480 s, sets hold_id. The script runs atomically across all seats requested; if any fails, none change.
Durable phase (async): the hold is journaled to the inventory DB within ~200 ms. If the durable write fails, the Redis state is rolled back via a compensating Lua script.

Why Redis first: the Lua script runs in microseconds on a single instance and handles thousands of attempts/sec on one core. The DB write is durability insurance, not the critical path.

KEYS = seat:{event}:A-12-15, seat:{event}:A-12-16
ARGV = hold_id, user_id, ttl_seconds

local results = {}
for i, k in ipairs(KEYS) do
  if redis.call('HGET', k, 'status') ~= 'available' then
    -- roll back any seats already held in this attempt
    for j = 1, i-1 do
      redis.call('HSET', KEYS[j], 'status', 'available')
      redis.call('HDEL', KEYS[j], 'hold_id', 'held_until')
    end
    return { 'conflict', k }
  end
  redis.call('HSET', k, 'status', 'held', 'hold_id', ARGV[1], 'held_until', ARGV[3]+now)
end
return 'ok'

The transition available → held is single-writer on the Redis primary; the script holds the seats while it works, and Redis is single-threaded per shard so there is no concurrent observer. A double-allocation can’t happen.

Inventory sharding#

A 50 K-seat venue with 33 K rps arriving simultaneously is too hot for a single Redis instance. We shard the inventory inside the event by section:

shard 1: sections A-G    (10 K seats)
shard 2: sections H-N    (10 K seats)
shard 3: sections O-T    (10 K seats)
shard 4: sections U-Z    (10 K seats)
shard 5: sections AA-GG  (10 K seats)

Each shard handles ~7 K rps independently. Cross-section reservations (rare — most users pick within a section) require a two-shard transaction; we use a saga pattern with explicit rollback rather than 2PC.

Hold expiry#

A hold has a held_until field set at reservation time. Three mechanisms enforce it:

Redis TTL on the hold:{hold_id} key — automatically gone after 8 min.
Scheduled scanner every 30 s checks for status='held' AND held_until < now in the durable inventory and resets to available (covers any Redis evictions).
Read-time validation: any read of a held seat that finds held_until < now lazy-evicts the hold.

The reconciliation job between Redis and durable storage runs every minute to repair any drift.

Checkout#

Once the user submits payment, the hold transitions to sold. This must be irreversible from the user’s perspective: a payment confirmation followed by “actually, your seats are gone” is the worst UX outcome.

POST /cart/:hold_id/checkout
  load hold from Redis (must exist, not expired)
  call payment service (synchronous, idempotent on hold_id)
  on payment.success:
    Lua: for each seat in hold, transition held→sold (idempotent on hold_id)
    durable write: insert order row, update seats
    emit ticket-issuance event
    return order_id
  on payment.failure:
    keep hold alive (let user retry); or release on user action

The payment call is the longest piece of the checkout path (~1 s). It’s idempotent on hold_id so client retries don’t double-charge. If the payment succeeds but our subsequent durable write fails (rare but possible), we have a refund-safety job that reconciles payment-events against order-events nightly.

Bot defenses#

This is the arms race that makes Ticketmaster Ticketmaster.

CDN-level rate limiting keyed on IP and on browser-fingerprint. The first wave of bots dies here; sophisticated ones use residential-IP rotation and pass.
Proof-of-work or invisible CAPTCHA at queue-join time. Adds 1–3 s of CPU work the bot must spend per request. Real users barely notice; bots burn meaningful compute.
Behavior scoring: mouse-movement entropy, time-on-page before clicking, prior-event history. Low-score sessions face heavier challenges (visible CAPTCHA, longer admission delay).
Account aging: an account created 30 minutes ago that joins 15 different sale queues has a low score by definition.
Per-account purchase limits: 4 tickets per account per event, with backend dedup across payment cards / shipping addresses to catch obvious sybils.

None of these stop a determined adversary; all of them raise the cost enough that the bot economics get marginal. The honest framing is: we make scalping less profitable, not impossible.

Latency budget#

queue admission → /reserve:               sub-100 ms (cached session token)
/reserve roundtrip:
  edge / TLS:                          30 ms
  application tier hop:                 5 ms
  reservation service → Redis Lua:      2 ms
  durable journal (async):              not on critical path
  response back:                       30 ms
                                 total: ~70–120 ms p99

Far inside the 2 s budget; the spare time is what swallows network jitter on slow mobile connections.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: the hot inventory shard. A single section is a contention hotspot when its first row is the most-coveted real estate. Even with intra-event sharding, “section A row 1” is a single key. Mitigations: lazy availability display (don’t tell users a specific row is available; tell them “front section available”) so 100 K users don’t all hammer the same key on the same millisecond. The reservation service serializes contention but throughput is bounded by Redis single-thread perf (~100 K ops/sec per primary, headroom from there with multiple shards).

Bottleneck #2: payment-gateway throughput. External PSPs (Stripe, Adyen, regional processors) have per-merchant rate limits in the low thousands per second. A sold-out 50 K-seat sale completes in ~30 min, so ~30 payments/sec — comfortable. A truly extreme event (multi-venue tour going on sale globally) can exceed PSP limits; we negotiate burst quotas in advance and have a payment-queue fallback that delays charging while keeping the hold alive.

Bottleneck #3: waiting-room state size. 10 M ZSET entries per event in Redis is fine; 50 simultaneous mega-events with 10 M each is 500 M entries. We allocate dedicated queue-cluster capacity per major event rather than mixing.

Bottleneck #4: bot adversaries. Not a technical bottleneck so much as an economic one. Every defensive measure (CAPTCHA, IP scoring, account-aging) trades user friction for bot resistance. The honest evaluation: we will never win the arms race outright. The goal is making the bot economics worse than ticket prices, so any individual scalper’s margin is thin. Verified-fan presales (allowlists of known accounts) are the only mechanism that genuinely changes the shape, at the cost of audit complexity.

Alternative I’d push back on: real-time per-seat availability rendered on the seatmap during a flash sale. Customers ask for it; UX designers prototype it. At 10 M concurrent viewers on the same seatmap, the broadcast cost (every seat-status change pushed to every viewer) is unbounded. The hybrid we use — show aggregate availability (“12 left in Section A”) and reveal individual seats only after the user enters the active selection flow — is the right trade. It frustrates a small fraction of power-users; the majority get a faster, more reliable experience.

What breaks first at 10× scale (a 500 K-seat global tour going on sale in one moment): the waiting-room admission service. The current per-event sorted set hits memory and write-throughput limits beyond ~50 M entries. The fix is to shard the queue itself by region and run a distributed admission protocol — but that complicates the fairness story (does a US user have any chance against an EU user with the same enqueue time?). The product-policy answer probably becomes “regional allocations” rather than “global free-for-all.”

Companies this resembles#

Ticketmaster (the canonical), AXS, See Tickets, DICE, Eventbrite (lighter end of the same shape), and the queue technology overlaps heavily with high-demand product launches (Supreme, Nintendo Switch restocks, GPU drops) and IPO retail subscription windows.

Rate Limiter — the front-line defense layer for the queue-join endpoint.
Distributed Cache — Redis is the linchpin of both the holds plane and the waiting room.
Payment System — checkout depends on this design directly; idempotency and reconciliation are co-designed.