Spotify Wrapped (Batch Analytics)

Annual per-user retrospective from a year of plays. Massively-parallel batch pipeline, per-user partitioning, and the once-a-year cost shape.

System Intermediate
12 min read
batch analytics mapreduce partitioning
Companies this resembles: Spotify Wrapped · Apple Replay · YouTube Music Recap · Strava Year in Sport

Step 1 — Clarify Requirements#

Functional

  • Once a year (late November / early December), every user gets a personalized retrospective: top songs, top artists, top genres, total minutes listened, “music personality” classifications, comparisons to last year.
  • Output is consumed by a mobile experience (animated story-style cards) and a web shareable summary.
  • The full result for ~600 M users must be ready and pre-warmed for the simultaneous launch moment.
  • Shareable cards include some derived statements (“you were in the top 0.5% of X listeners”) that require global rank computation, not just per-user aggregation.
  • Out of scope: per-user algorithmic playlists (continuous, separate pipeline), live “currently listening” features, podcast deep-stats (separate Wrapped surface).

Non-functional

  • Result correctness is non-negotiable. If we publish “your top artist is Bad Bunny” it had better actually be your top artist for the year. No probabilistic shortcuts on the main numbers.
  • Launch is a single coordinated moment. The mobile app pre-fetches results in the days before; the launch flips a feature flag and the experience becomes available simultaneously to all users in a region.
  • Total run cost matters because the workload is once-a-year. We can rent peak capacity rather than provision it permanently.
  • p99 “open Wrapped → first card renders” under 2 s on the day. The data is pre-baked; this is a CDN/cache problem at view time.
  • Privacy: outputs must be derivable only from this account’s listening history, with one exception (the global rank statements, which use only aggregates).

Step 2 — Capacity Estimation#

The shape of the data:

  • Users: 600 M MAU.
  • Plays per user per year: heavy long-tail. Median ~1 500 plays/year, p90 ~10 000, p99 ~50 000, max in the hundreds of thousands.
  • Total plays in a year: roughly ~600 M × 5 000 average ≈ 3 trillion play events.
  • Per-play event size on the lake: ~200 B compressed (Parquet, including user_id, track_id, ts, ms_played, source, device hash) → ~600 TB compressed/year of play data in the lake.
  • Compute footprint: a single-pass Spark job to group-and-aggregate by user across 3 T rows is computationally tractable on a large cluster. Roughly ~50 000 vCPU-hours for the main aggregation pass, plus another 50 000 for derived statistics and packaging.
  • Output size: ~50 KB of JSON per user × 600 M = ~30 TB of pre-baked result documents.
  • Cluster sizing: at, say, 5 000 cores for 24 hours, we finish in a day. We typically run the pipeline 3–4 times in November (dry runs + final) so total spend is ~5 cluster-days of large compute.
  • Serving QPS at launch: 600 M users, ~20% open Wrapped within the first 24 h. Concentrated burst: peak roughly ~100 K rps for the first hour after launch.

These numbers say: this isn’t a streaming problem, it’s a batch problem with a sharp serving spike. We’re sizing for two extremes — terabyte-scale offline crunch and a CDN-shaped read peak — connected by a static document store.

Step 3 — System Interface#

The visible API is small; almost everything else is internal.

GET /wrapped/:user_id
Returns: full pre-baked JSON document (cached on CDN, immutable until next year)
GET /wrapped/:user_id/story.png (and .mp4, .json variants)
Returns: pre-rendered shareable card
POST /wrapped/:user_id/share
Body: { target: 'instagram' | 'twitter' | ... }
Returns: { share_url, expires_at }
# Internal (between pipeline stages)
data lake: plays/year=2026/month=*/day=*/part-*.parquet
output: wrapped/2026/user_id={hash_prefix}/{user_id}.json

The user-facing GET is content-addressed and immutable: the URL never changes, the body is fixed at publish time. This is what makes the read path a CDN problem rather than a real-time computation problem.

Step 4 — High-Level Design#

event sources ┌── orchestrator (Airflow / equivalent) ──┐
apps ─→ play-event ingest ─→ Kafka ─→ stream tee ─→ S3-like lake (Parquet, partitioned by date)
(continuous, year-round) │
┌──── Stage 1: user-grouped aggregation
│ (Spark, repartition by user_id)
├──── Stage 2: per-user enrichment
│ (join with artist/track catalog,
│ compute genres, similarity classes)
├──── Stage 3: global rank computation
│ (percentile cuts per attribute)
├──── Stage 4: result packaging
│ (assemble final JSON per user,
│ pre-render share cards)
output store (object store, sharded)
CDN warmup + feature-flag flip
clients fetch via CDN

The pipeline is strictly batch. Year-round, plays flow into the lake; in November, the four staged jobs run, producing per-user documents. Once everything is materialized, the CDN gets pre-warmed and the launch toggle flips.

Step 5 — Data Model#

Source play events (append-only, partitioned by day in the lake):

play_event
user_id uuid
track_id uuid
artist_id uuid
album_id uuid
ts timestamp
ms_played int // listened duration, not track duration
source enum(playlist, search, radio, library, autoplay, share, ...)
device_class string
country string

Stage-1 output (per-user-grouped, partitioned by hash(user_id) % N):

user_summary
user_id
total_ms
total_plays
unique_tracks
unique_artists
top_tracks list<{track_id, plays, ms}>
top_artists list<{artist_id, plays, ms}>
top_genres map<genre, ms>
monthly_breakdown map<month, {plays, ms, top_artist}>
listening_hour_histogram list<24>

Stage-2 enrichment (broadcast joins with relatively small catalogs):

artist_catalog: artist_id → { name, image_url, primary_genre, ... }
track_catalog: track_id → { title, duration_ms, isrc, ... }
genre_taxonomy: genre → parent genre, hue, vibe-cluster

These catalogs are megabytes, not terabytes — broadcast them to every executor rather than shuffling.

Final per-user document (the served artifact):

{
"user_id": "...",
"year": 2026,
"total_minutes": 41523,
"top_artists": [{ "id": "...", "name": "...", "plays": 1240 }, ...],
"top_tracks": [...],
"top_genres": [...],
"music_personality": {
"label": "Eclectic Explorer",
"description": "...",
"axes": { "discovery": 0.82, "consistency": 0.31, ... }
},
"global_rank_blurbs": [
"You were in the top 0.5% of Bad Bunny listeners.",
"You discovered Charli XCX before 92% of fans."
],
"shareable_card_urls": { ... }
}

Step 6 — Detailed Design#

Stage 1: user-grouped aggregation#

The lake holds plays partitioned by day. Stage 1 reads the year’s partitions, repartitions by user_id, and computes per-user aggregates. This is the largest shuffle in the pipeline.

The shuffle key is hash(user_id) % N for a chosen N (say 10 000 partitions). Each partition holds ~60 K users and the play records for those users only. Within a partition, a single executor scans the data and emits one row per user.

Hot-user skew is a real concern: a heavy-listener bot account or shared family account can have 200 K plays/year while the median user has 1 500. We use salted aggregation for the top-K-by-time tracks computation: rather than a single groupByKey(user_id) which would put all plays for one extreme user on one executor, we partial-aggregate with a salt and then re-aggregate without it:

plays.map(p => ((p.user_id, p.track_id), (1, p.ms_played)))
.reduceByKey(combine) // local combine
.map(((u, t), v) => (u, (t, v)))
.groupByKey() // now small per user
.map(per_user_topk)

The local combine step shrinks plays-of-the-same-track from N rows to 1 row before the network shuffle, which makes the per-user grouping cheap regardless of skew.

Single-pass groupBy — read all plays, group by user, compute all aggregates. Simple to reason about, but skew kills you: one user with 200 K plays bottlenecks the executor holding their partition.
Two-stage aggregate — partial-aggregate by (user, track) and (user, artist) locally before the user-level shuffle. Much smaller shuffle volume, skew-immune. The extra pass costs ~20% more CPU and avoids 90% of the OOM risk.

Stage 2: enrichment#

The per-user output of Stage 1 references track/artist IDs but no human-readable data. Stage 2 joins against the catalogs (broadcast joins because catalogs are small) and computes derived attributes: dominant genres, “vibe clusters,” top-artist-by-month, listening time-of-day patterns.

This is also where the “music personality” classifier runs. It’s a clustering model trained offline that maps the user’s audio-feature distribution (tempo, energy, valence aggregates) to one of, say, 16 labels. The model itself is tiny; the work is just applying it.

Stage 3: global rank computation#

Some statements (“top 0.5% of Bad Bunny listeners”) require knowing where each user sits in the global distribution. Naïvely sorting all 600 M users by every attribute is expensive. We use approximate quantile sketches (T-Digest or Q-Digest) computed once per attribute over all users, then look each user up against the sketch:

for each attribute (e.g. ms_played_for_artist_X):
build T-Digest over user values (one pass, mergeable across partitions)
for each user:
rank = digest.cdf(user_value)
if rank > 0.99: emit blurb "top 1% of artist X listeners"

The sketches are mergeable, so each executor builds a local digest and the driver combines them. Total memory: a few MB per attribute regardless of user count. The accuracy is within ~0.5% of true quantiles — fine for “top 0.5%” claims.

Stage 4: result packaging#

The final per-user JSON is assembled here. Each row is keyed by user_id and written to wrapped/2026/user_id={hash_prefix}/{user_id}.json in the output store. The hash_prefix sharding (first 2 characters of the user_id hash) gives us ~256 directories with ~2.3 M files each — bounded directory size and predictable lookup.

Pre-rendering of shareable cards happens here too. For each user, we render PNG/MP4 variants for the major social formats. This is the heaviest per-user CPU step but trivially parallel: spawn it as a downstream job that reads the JSON and emits images.

Ingestion (year-round)#

Plays come in as a continuous Kafka stream from clients. A stream tee writes them to the lake in 1-hour Parquet partitions. We don’t bother with exactly-once semantics here — duplicates in the play log are de-duped at Stage 1 by (user_id, track_id, ts, ms_played) tuple. The annual pipeline absorbs the cost.

Late-arriving events (a mobile client offline for a week, syncing on Dec 1 of the wrapped window) get a hard cutoff: we run Stage 1 with a 7-day late-event buffer, then freeze. Anything later is acknowledged in the lake but not reflected in this year’s Wrapped.

Serving path#

Once Stage 4 completes:

  1. The output store now holds 600 M documents.
  2. A warmup job pushes the documents into a CDN — actually a multi-tier cache (regional edges + a hot middle tier). The CDN uses cache-key = user_id, and TTL = 1 year.
  3. We pre-fetch each user’s document to the device opportunistically: the app, in the days before launch, makes a background fetch and stores the encrypted result locally. The launch is just a feature-flag flip on the rendering layer.
  4. The launch moment: the feature flag flips, the experience becomes visible. Reads from cold cache on launch peak at ~100 K rps and serve in under 50 ms from the CDN — the documents are static and content-addressed.

What if Stage 1 fails halfway?#

Idempotence is built in. Each stage writes to a versioned output directory (output/v3/user_summary/...) and the next stage reads from the latest-completed version. A failure mid-stage is restartable from the last checkpoint. We always do dry runs in early November against a sampled population (~1% of users) before the full run, and we keep last year’s pipeline frozen on hand as a contingency.

Latency budget at view time (target p99 2 s)#

client open Wrapped → CDN hit: 20–80 ms (regional)
JSON download (~50 KB): 50–150 ms
client-side animation engine boot: 200–400 ms
first frame: ~500 ms total

The interesting work was done weeks ago. View time is just file delivery.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: hot-user skew in Stage 1. Already addressed with salted aggregation; the secondary risk is that the salting strategy itself has to be tuned per-attribute. The top-K computations are skew-tolerant; the histogram computations (listening-hour distribution) need separate care because they cardinality-explode on power users.

Bottleneck #2: pipeline data freshness vs cost. We freeze on December 1 to give downstream stages headroom. Users played from Dec 1 to launch day aren’t included. We surface this honestly: “Wrapped reflects your listening through [date].” Trying to include the last week of data would either delay launch or require a costly incremental re-run, neither worth it.

Bottleneck #3: launch-day CDN load. 100 K rps on a static document set is well within CDN headroom — but only because we pre-warmed. A cold CDN at launch would push origin traffic to multi-Tbps levels. The pre-warm job copies all 30 TB to edges in the 48 h before launch; cost is non-trivial but a known line item.

Bottleneck #4: catalog drift. Artists rename, tracks get pulled from the platform, albums get re-released with new IDs. The catalog snapshot used in Stage 2 has to be coherent across the year of plays. We pin a catalog version at pipeline start; track_id changes mid-year get rectified by a separate track_id_aliases table that the pipeline applies. The honest failure mode: in a small number of edge cases a user’s top track will show an old album cover. We accept that.

Alternative I’d push back on: running this incrementally year-round so that “Wrapped” is computed continuously and just snapshotted in December. Tempting because it would spread the cost evenly and make any-day-of-year retrospective possible. The reality: maintaining an always-fresh aggregate over 3 T plays is much more expensive than a once-a-year batch, because the always-on system has to handle continuous writes with low latency and we’d lose the bulk-compute economics of renting cluster time. The product also benefits from the seasonal moment — making it always-available dilutes the launch event.

What breaks first at 10× scale (6 B users, 30 T plays): Stage 1 shuffle data volume. We’d move to a streaming pre-aggregation that maintains rolling per-user summaries continuously, then the November batch becomes a finalization pass on already-aggregated state. The streaming pre-aggregation is a much more complex system but is the only viable shape at that scale.

Companies this resembles#

Spotify Wrapped (canonical), Apple Music Replay, YouTube Music Recap, Strava Year in Sport, Reddit Recap, Duolingo Year in Review. The general pattern — pre-baked annual retrospective driven by a single big batch job — has become a recurring product genre.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.