Spotify Wrapped (Batch Analytics) — System Design

Step 1 — Clarify Requirements#

Functional

Once a year (late November / early December), every user gets a personalized retrospective: top songs, top artists, top genres, total minutes listened, “music personality” classifications, comparisons to last year.
Output is consumed by a mobile experience (animated story-style cards) and a web shareable summary.
The full result for ~600 M users must be ready and pre-warmed for the simultaneous launch moment.
Shareable cards include some derived statements (“you were in the top 0.5% of X listeners”) that require global rank computation, not just per-user aggregation.
Out of scope: per-user algorithmic playlists (continuous, separate pipeline), live “currently listening” features, podcast deep-stats (separate Wrapped surface).

Non-functional

Result correctness is non-negotiable. If we publish “your top artist is Bad Bunny” it had better actually be your top artist for the year. No probabilistic shortcuts on the main numbers.
Launch is a single coordinated moment. The mobile app pre-fetches results in the days before; the launch flips a feature flag and the experience becomes available simultaneously to all users in a region.
Total run cost matters because the workload is once-a-year. We can rent peak capacity rather than provision it permanently.
p99 “open Wrapped → first card renders” under 2 s on the day. The data is pre-baked; this is a CDN/cache problem at view time.
Privacy: outputs must be derivable only from this account’s listening history, with one exception (the global rank statements, which use only aggregates).

Step 2 — Capacity Estimation#

The shape of the data:

Users: 600 M MAU.
Plays per user per year: heavy long-tail. Median ~1 500 plays/year, p90 ~10 000, p99 ~50 000, max in the hundreds of thousands.
Total plays in a year: roughly ~600 M × 5 000 average ≈ 3 trillion play events.
Per-play event size on the lake: ~200 B compressed (Parquet, including user_id, track_id, ts, ms_played, source, device hash) → ~600 TB compressed/year of play data in the lake.
Compute footprint: a single-pass Spark job to group-and-aggregate by user across 3 T rows is computationally tractable on a large cluster. Roughly ~50 000 vCPU-hours for the main aggregation pass, plus another 50 000 for derived statistics and packaging.
Output size: ~50 KB of JSON per user × 600 M = ~30 TB of pre-baked result documents.
Cluster sizing: at, say, 5 000 cores for 24 hours, we finish in a day. We typically run the pipeline 3–4 times in November (dry runs + final) so total spend is ~5 cluster-days of large compute.
Serving QPS at launch: 600 M users, ~20% open Wrapped within the first 24 h. Concentrated burst: peak roughly ~100 K rps for the first hour after launch.

These numbers say: this isn’t a streaming problem, it’s a batch problem with a sharp serving spike. We’re sizing for two extremes — terabyte-scale offline crunch and a CDN-shaped read peak — connected by a static document store.

Step 3 — System Interface#

The visible API is small; almost everything else is internal.

GET   /wrapped/:user_id
        Returns: full pre-baked JSON document (cached on CDN, immutable until next year)

GET   /wrapped/:user_id/story.png    (and .mp4, .json variants)
        Returns: pre-rendered shareable card

POST  /wrapped/:user_id/share
        Body: { target: 'instagram' | 'twitter' | ... }
        Returns: { share_url, expires_at }

# Internal (between pipeline stages)
data lake:  plays/year=2026/month=*/day=*/part-*.parquet
output:     wrapped/2026/user_id={hash_prefix}/{user_id}.json

The user-facing GET is content-addressed and immutable: the URL never changes, the body is fixed at publish time. This is what makes the read path a CDN problem rather than a real-time computation problem.

Step 4 — High-Level Design#

event sources                      ┌── orchestrator (Airflow / equivalent) ──┐
  apps  ─→ play-event ingest ─→ Kafka ─→ stream tee ─→ S3-like lake (Parquet, partitioned by date)
                                        (continuous, year-round)              │
                                                                              ▼
                                                          ┌──── Stage 1: user-grouped aggregation
                                                          │     (Spark, repartition by user_id)
                                                          │
                                                          ├──── Stage 2: per-user enrichment
                                                          │     (join with artist/track catalog,
                                                          │      compute genres, similarity classes)
                                                          │
                                                          ├──── Stage 3: global rank computation
                                                          │     (percentile cuts per attribute)
                                                          │
                                                          ├──── Stage 4: result packaging
                                                          │     (assemble final JSON per user,
                                                          │      pre-render share cards)
                                                          │
                                                          ▼
                                              output store (object store, sharded)
                                                          │
                                                          ▼
                                                  CDN warmup + feature-flag flip
                                                          │
                                                          ▼
                                                    clients fetch via CDN

The pipeline is strictly batch. Year-round, plays flow into the lake; in November, the four staged jobs run, producing per-user documents. Once everything is materialized, the CDN gets pre-warmed and the launch toggle flips.

Step 5 — Data Model#

Source play events (append-only, partitioned by day in the lake):

play_event
  user_id        uuid
  track_id       uuid
  artist_id      uuid
  album_id       uuid
  ts             timestamp
  ms_played      int            // listened duration, not track duration
  source         enum(playlist, search, radio, library, autoplay, share, ...)
  device_class   string
  country        string

Stage-1 output (per-user-grouped, partitioned by hash(user_id) % N):

user_summary
  user_id
  total_ms
  total_plays
  unique_tracks
  unique_artists
  top_tracks   list<{track_id, plays, ms}>
  top_artists  list<{artist_id, plays, ms}>
  top_genres   map<genre, ms>
  monthly_breakdown map<month, {plays, ms, top_artist}>
  listening_hour_histogram   list<24>

Stage-2 enrichment (broadcast joins with relatively small catalogs):

artist_catalog: artist_id → { name, image_url, primary_genre, ... }
track_catalog:  track_id  → { title, duration_ms, isrc, ... }
genre_taxonomy: genre     → parent genre, hue, vibe-cluster

These catalogs are megabytes, not terabytes — broadcast them to every executor rather than shuffling.

Final per-user document (the served artifact):

{
  "user_id": "...",
  "year": 2026,
  "total_minutes": 41523,
  "top_artists": [{ "id": "...", "name": "...", "plays": 1240 }, ...],
  "top_tracks":  [...],
  "top_genres":  [...],
  "music_personality": {
    "label": "Eclectic Explorer",
    "description": "...",
    "axes": { "discovery": 0.82, "consistency": 0.31, ... }
  },
  "global_rank_blurbs": [
    "You were in the top 0.5% of Bad Bunny listeners.",
    "You discovered Charli XCX before 92% of fans."
  ],
  "shareable_card_urls": { ... }
}

Step 6 — Detailed Design#

Stage 1: user-grouped aggregation#

The lake holds plays partitioned by day. Stage 1 reads the year’s partitions, repartitions by user_id, and computes per-user aggregates. This is the largest shuffle in the pipeline.

The shuffle key is hash(user_id) % N for a chosen N (say 10 000 partitions). Each partition holds ~60 K users and the play records for those users only. Within a partition, a single executor scans the data and emits one row per user.

Hot-user skew is a real concern: a heavy-listener bot account or shared family account can have 200 K plays/year while the median user has 1 500. We use salted aggregation for the top-K-by-time tracks computation: rather than a single groupByKey(user_id) which would put all plays for one extreme user on one executor, we partial-aggregate with a salt and then re-aggregate without it:

plays.map(p => ((p.user_id, p.track_id), (1, p.ms_played)))
     .reduceByKey(combine)                          // local combine
     .map(((u, t), v) => (u, (t, v)))
     .groupByKey()                                  // now small per user
     .map(per_user_topk)

The local combine step shrinks plays-of-the-same-track from N rows to 1 row before the network shuffle, which makes the per-user grouping cheap regardless of skew.

Single-pass groupBy — read all plays, group by user, compute all aggregates. Simple to reason about, but skew kills you: one user with 200 K plays bottlenecks the executor holding their partition.

Two-stage aggregate — partial-aggregate by (user, track) and (user, artist) locally before the user-level shuffle. Much smaller shuffle volume, skew-immune. The extra pass costs ~20% more CPU and avoids 90% of the OOM risk.

Stage 2: enrichment#

The per-user output of Stage 1 references track/artist IDs but no human-readable data. Stage 2 joins against the catalogs (broadcast joins because catalogs are small) and computes derived attributes: dominant genres, “vibe clusters,” top-artist-by-month, listening time-of-day patterns.

This is also where the “music personality” classifier runs. It’s a clustering model trained offline that maps the user’s audio-feature distribution (tempo, energy, valence aggregates) to one of, say, 16 labels. The model itself is tiny; the work is just applying it.

Stage 3: global rank computation#

Some statements (“top 0.5% of Bad Bunny listeners”) require knowing where each user sits in the global distribution. Naïvely sorting all 600 M users by every attribute is expensive. We use approximate quantile sketches (T-Digest or Q-Digest) computed once per attribute over all users, then look each user up against the sketch:

for each attribute (e.g. ms_played_for_artist_X):
   build T-Digest over user values (one pass, mergeable across partitions)
   for each user:
      rank = digest.cdf(user_value)
      if rank > 0.99: emit blurb "top 1% of artist X listeners"

The sketches are mergeable, so each executor builds a local digest and the driver combines them. Total memory: a few MB per attribute regardless of user count. The accuracy is within ~0.5% of true quantiles — fine for “top 0.5%” claims.

Stage 4: result packaging#

The final per-user JSON is assembled here. Each row is keyed by user_id and written to wrapped/2026/user_id={hash_prefix}/{user_id}.json in the output store. The hash_prefix sharding (first 2 characters of the user_id hash) gives us ~256 directories with ~2.3 M files each — bounded directory size and predictable lookup.

Pre-rendering of shareable cards happens here too. For each user, we render PNG/MP4 variants for the major social formats. This is the heaviest per-user CPU step but trivially parallel: spawn it as a downstream job that reads the JSON and emits images.

Ingestion (year-round)#

Plays come in as a continuous Kafka stream from clients. A stream tee writes them to the lake in 1-hour Parquet partitions. We don’t bother with exactly-once semantics here — duplicates in the play log are de-duped at Stage 1 by (user_id, track_id, ts, ms_played) tuple. The annual pipeline absorbs the cost.

Late-arriving events (a mobile client offline for a week, syncing on Dec 1 of the wrapped window) get a hard cutoff: we run Stage 1 with a 7-day late-event buffer, then freeze. Anything later is acknowledged in the lake but not reflected in this year’s Wrapped.

Serving path#

Once Stage 4 completes:

The output store now holds 600 M documents.
A warmup job pushes the documents into a CDN — actually a multi-tier cache (regional edges + a hot middle tier). The CDN uses cache-key = user_id, and TTL = 1 year.
We pre-fetch each user’s document to the device opportunistically: the app, in the days before launch, makes a background fetch and stores the encrypted result locally. The launch is just a feature-flag flip on the rendering layer.
The launch moment: the feature flag flips, the experience becomes visible. Reads from cold cache on launch peak at ~100 K rps and serve in under 50 ms from the CDN — the documents are static and content-addressed.

What if Stage 1 fails halfway?#

Idempotence is built in. Each stage writes to a versioned output directory (output/v3/user_summary/...) and the next stage reads from the latest-completed version. A failure mid-stage is restartable from the last checkpoint. We always do dry runs in early November against a sampled population (~1% of users) before the full run, and we keep last year’s pipeline frozen on hand as a contingency.

Latency budget at view time (target p99 2 s)#

client open Wrapped → CDN hit:        20–80 ms (regional)
JSON download (~50 KB):                50–150 ms
client-side animation engine boot:    200–400 ms
first frame:                          ~500 ms total

The interesting work was done weeks ago. View time is just file delivery.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: hot-user skew in Stage 1. Already addressed with salted aggregation; the secondary risk is that the salting strategy itself has to be tuned per-attribute. The top-K computations are skew-tolerant; the histogram computations (listening-hour distribution) need separate care because they cardinality-explode on power users.

Bottleneck #2: pipeline data freshness vs cost. We freeze on December 1 to give downstream stages headroom. Users played from Dec 1 to launch day aren’t included. We surface this honestly: “Wrapped reflects your listening through [date].” Trying to include the last week of data would either delay launch or require a costly incremental re-run, neither worth it.

Bottleneck #3: launch-day CDN load. 100 K rps on a static document set is well within CDN headroom — but only because we pre-warmed. A cold CDN at launch would push origin traffic to multi-Tbps levels. The pre-warm job copies all 30 TB to edges in the 48 h before launch; cost is non-trivial but a known line item.

Bottleneck #4: catalog drift. Artists rename, tracks get pulled from the platform, albums get re-released with new IDs. The catalog snapshot used in Stage 2 has to be coherent across the year of plays. We pin a catalog version at pipeline start; track_id changes mid-year get rectified by a separate track_id_aliases table that the pipeline applies. The honest failure mode: in a small number of edge cases a user’s top track will show an old album cover. We accept that.

Alternative I’d push back on: running this incrementally year-round so that “Wrapped” is computed continuously and just snapshotted in December. Tempting because it would spread the cost evenly and make any-day-of-year retrospective possible. The reality: maintaining an always-fresh aggregate over 3 T plays is much more expensive than a once-a-year batch, because the always-on system has to handle continuous writes with low latency and we’d lose the bulk-compute economics of renting cluster time. The product also benefits from the seasonal moment — making it always-available dilutes the launch event.

What breaks first at 10× scale (6 B users, 30 T plays): Stage 1 shuffle data volume. We’d move to a streaming pre-aggregation that maintains rolling per-user summaries continuously, then the November batch becomes a finalization pass on already-aggregated state. The streaming pre-aggregation is a much more complex system but is the only viable shape at that scale.

Companies this resembles#

Spotify Wrapped (canonical), Apple Music Replay, YouTube Music Recap, Strava Year in Sport, Reddit Recap, Duolingo Year in Review. The general pattern — pre-baked annual retrospective driven by a single big batch job — has become a recurring product genre.

ML Data Infrastructure — the lake, the Parquet conventions, and the broadcast catalogs all show up there.
Distributed Task Scheduler — Airflow-class orchestration of the four-stage DAG.
Blob Store — the substrate for both the lake and the output document store.