Spotify Wrapped (Batch Analytics)
Annual per-user retrospective from a year of plays. Massively-parallel batch pipeline, per-user partitioning, and the once-a-year cost shape.
Step 1 — Clarify Requirements#
Functional
- Once a year (late November / early December), every user gets a personalized retrospective: top songs, top artists, top genres, total minutes listened, “music personality” classifications, comparisons to last year.
- Output is consumed by a mobile experience (animated story-style cards) and a web shareable summary.
- The full result for ~600 M users must be ready and pre-warmed for the simultaneous launch moment.
- Shareable cards include some derived statements (“you were in the top 0.5% of X listeners”) that require global rank computation, not just per-user aggregation.
- Out of scope: per-user algorithmic playlists (continuous, separate pipeline), live “currently listening” features, podcast deep-stats (separate Wrapped surface).
Non-functional
- Result correctness is non-negotiable. If we publish “your top artist is Bad Bunny” it had better actually be your top artist for the year. No probabilistic shortcuts on the main numbers.
- Launch is a single coordinated moment. The mobile app pre-fetches results in the days before; the launch flips a feature flag and the experience becomes available simultaneously to all users in a region.
- Total run cost matters because the workload is once-a-year. We can rent peak capacity rather than provision it permanently.
- p99 “open Wrapped → first card renders” under 2 s on the day. The data is pre-baked; this is a CDN/cache problem at view time.
- Privacy: outputs must be derivable only from this account’s listening history, with one exception (the global rank statements, which use only aggregates).
Step 2 — Capacity Estimation#
The shape of the data:
- Users: 600 M MAU.
- Plays per user per year: heavy long-tail. Median ~1 500 plays/year, p90 ~10 000, p99 ~50 000, max in the hundreds of thousands.
- Total plays in a year: roughly ~600 M × 5 000 average ≈ 3 trillion play events.
- Per-play event size on the lake: ~200 B compressed (Parquet, including user_id, track_id, ts, ms_played, source, device hash) → ~600 TB compressed/year of play data in the lake.
- Compute footprint: a single-pass Spark job to group-and-aggregate by user across 3 T rows is computationally tractable on a large cluster. Roughly ~50 000 vCPU-hours for the main aggregation pass, plus another 50 000 for derived statistics and packaging.
- Output size: ~50 KB of JSON per user × 600 M = ~30 TB of pre-baked result documents.
- Cluster sizing: at, say, 5 000 cores for 24 hours, we finish in a day. We typically run the pipeline 3–4 times in November (dry runs + final) so total spend is ~5 cluster-days of large compute.
- Serving QPS at launch: 600 M users, ~20% open Wrapped within the first 24 h. Concentrated burst: peak roughly ~100 K rps for the first hour after launch.
These numbers say: this isn’t a streaming problem, it’s a batch problem with a sharp serving spike. We’re sizing for two extremes — terabyte-scale offline crunch and a CDN-shaped read peak — connected by a static document store.
Step 3 — System Interface#
The visible API is small; almost everything else is internal.
GET /wrapped/:user_id Returns: full pre-baked JSON document (cached on CDN, immutable until next year)
GET /wrapped/:user_id/story.png (and .mp4, .json variants) Returns: pre-rendered shareable card
POST /wrapped/:user_id/share Body: { target: 'instagram' | 'twitter' | ... } Returns: { share_url, expires_at }
# Internal (between pipeline stages)data lake: plays/year=2026/month=*/day=*/part-*.parquetoutput: wrapped/2026/user_id={hash_prefix}/{user_id}.jsonThe user-facing GET is content-addressed and immutable: the URL never changes, the body is fixed at publish time. This is what makes the read path a CDN problem rather than a real-time computation problem.
Step 4 — High-Level Design#
event sources ┌── orchestrator (Airflow / equivalent) ──┐ apps ─→ play-event ingest ─→ Kafka ─→ stream tee ─→ S3-like lake (Parquet, partitioned by date) (continuous, year-round) │ ▼ ┌──── Stage 1: user-grouped aggregation │ (Spark, repartition by user_id) │ ├──── Stage 2: per-user enrichment │ (join with artist/track catalog, │ compute genres, similarity classes) │ ├──── Stage 3: global rank computation │ (percentile cuts per attribute) │ ├──── Stage 4: result packaging │ (assemble final JSON per user, │ pre-render share cards) │ ▼ output store (object store, sharded) │ ▼ CDN warmup + feature-flag flip │ ▼ clients fetch via CDNThe pipeline is strictly batch. Year-round, plays flow into the lake; in November, the four staged jobs run, producing per-user documents. Once everything is materialized, the CDN gets pre-warmed and the launch toggle flips.
Step 5 — Data Model#
Source play events (append-only, partitioned by day in the lake):
play_event user_id uuid track_id uuid artist_id uuid album_id uuid ts timestamp ms_played int // listened duration, not track duration source enum(playlist, search, radio, library, autoplay, share, ...) device_class string country stringStage-1 output (per-user-grouped, partitioned by hash(user_id) % N):
user_summary user_id total_ms total_plays unique_tracks unique_artists top_tracks list<{track_id, plays, ms}> top_artists list<{artist_id, plays, ms}> top_genres map<genre, ms> monthly_breakdown map<month, {plays, ms, top_artist}> listening_hour_histogram list<24>Stage-2 enrichment (broadcast joins with relatively small catalogs):
artist_catalog: artist_id → { name, image_url, primary_genre, ... }track_catalog: track_id → { title, duration_ms, isrc, ... }genre_taxonomy: genre → parent genre, hue, vibe-clusterThese catalogs are megabytes, not terabytes — broadcast them to every executor rather than shuffling.
Final per-user document (the served artifact):
{ "user_id": "...", "year": 2026, "total_minutes": 41523, "top_artists": [{ "id": "...", "name": "...", "plays": 1240 }, ...], "top_tracks": [...], "top_genres": [...], "music_personality": { "label": "Eclectic Explorer", "description": "...", "axes": { "discovery": 0.82, "consistency": 0.31, ... } }, "global_rank_blurbs": [ "You were in the top 0.5% of Bad Bunny listeners.", "You discovered Charli XCX before 92% of fans." ], "shareable_card_urls": { ... }}Step 6 — Detailed Design#
Stage 1: user-grouped aggregation#
The lake holds plays partitioned by day. Stage 1 reads the year’s partitions, repartitions by user_id, and computes per-user aggregates. This is the largest shuffle in the pipeline.
The shuffle key is hash(user_id) % N for a chosen N (say 10 000 partitions). Each partition holds ~60 K users and the play records for those users only. Within a partition, a single executor scans the data and emits one row per user.
Hot-user skew is a real concern: a heavy-listener bot account or shared family account can have 200 K plays/year while the median user has 1 500. We use salted aggregation for the top-K-by-time tracks computation: rather than a single groupByKey(user_id) which would put all plays for one extreme user on one executor, we partial-aggregate with a salt and then re-aggregate without it:
plays.map(p => ((p.user_id, p.track_id), (1, p.ms_played))) .reduceByKey(combine) // local combine .map(((u, t), v) => (u, (t, v))) .groupByKey() // now small per user .map(per_user_topk)The local combine step shrinks plays-of-the-same-track from N rows to 1 row before the network shuffle, which makes the per-user grouping cheap regardless of skew.
Stage 2: enrichment#
The per-user output of Stage 1 references track/artist IDs but no human-readable data. Stage 2 joins against the catalogs (broadcast joins because catalogs are small) and computes derived attributes: dominant genres, “vibe clusters,” top-artist-by-month, listening time-of-day patterns.
This is also where the “music personality” classifier runs. It’s a clustering model trained offline that maps the user’s audio-feature distribution (tempo, energy, valence aggregates) to one of, say, 16 labels. The model itself is tiny; the work is just applying it.
Stage 3: global rank computation#
Some statements (“top 0.5% of Bad Bunny listeners”) require knowing where each user sits in the global distribution. Naïvely sorting all 600 M users by every attribute is expensive. We use approximate quantile sketches (T-Digest or Q-Digest) computed once per attribute over all users, then look each user up against the sketch:
for each attribute (e.g. ms_played_for_artist_X): build T-Digest over user values (one pass, mergeable across partitions) for each user: rank = digest.cdf(user_value) if rank > 0.99: emit blurb "top 1% of artist X listeners"The sketches are mergeable, so each executor builds a local digest and the driver combines them. Total memory: a few MB per attribute regardless of user count. The accuracy is within ~0.5% of true quantiles — fine for “top 0.5%” claims.
Stage 4: result packaging#
The final per-user JSON is assembled here. Each row is keyed by user_id and written to wrapped/2026/user_id={hash_prefix}/{user_id}.json in the output store. The hash_prefix sharding (first 2 characters of the user_id hash) gives us ~256 directories with ~2.3 M files each — bounded directory size and predictable lookup.
Pre-rendering of shareable cards happens here too. For each user, we render PNG/MP4 variants for the major social formats. This is the heaviest per-user CPU step but trivially parallel: spawn it as a downstream job that reads the JSON and emits images.
Ingestion (year-round)#
Plays come in as a continuous Kafka stream from clients. A stream tee writes them to the lake in 1-hour Parquet partitions. We don’t bother with exactly-once semantics here — duplicates in the play log are de-duped at Stage 1 by (user_id, track_id, ts, ms_played) tuple. The annual pipeline absorbs the cost.
Late-arriving events (a mobile client offline for a week, syncing on Dec 1 of the wrapped window) get a hard cutoff: we run Stage 1 with a 7-day late-event buffer, then freeze. Anything later is acknowledged in the lake but not reflected in this year’s Wrapped.
Serving path#
Once Stage 4 completes:
- The output store now holds 600 M documents.
- A warmup job pushes the documents into a CDN — actually a multi-tier cache (regional edges + a hot middle tier). The CDN uses cache-key =
user_id, and TTL = 1 year. - We pre-fetch each user’s document to the device opportunistically: the app, in the days before launch, makes a background fetch and stores the encrypted result locally. The launch is just a feature-flag flip on the rendering layer.
- The launch moment: the feature flag flips, the experience becomes visible. Reads from cold cache on launch peak at ~100 K rps and serve in under 50 ms from the CDN — the documents are static and content-addressed.
What if Stage 1 fails halfway?#
Idempotence is built in. Each stage writes to a versioned output directory (output/v3/user_summary/...) and the next stage reads from the latest-completed version. A failure mid-stage is restartable from the last checkpoint. We always do dry runs in early November against a sampled population (~1% of users) before the full run, and we keep last year’s pipeline frozen on hand as a contingency.
Latency budget at view time (target p99 2 s)#
client open Wrapped → CDN hit: 20–80 ms (regional)JSON download (~50 KB): 50–150 msclient-side animation engine boot: 200–400 msfirst frame: ~500 ms totalThe interesting work was done weeks ago. View time is just file delivery.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: hot-user skew in Stage 1. Already addressed with salted aggregation; the secondary risk is that the salting strategy itself has to be tuned per-attribute. The top-K computations are skew-tolerant; the histogram computations (listening-hour distribution) need separate care because they cardinality-explode on power users.
Bottleneck #2: pipeline data freshness vs cost. We freeze on December 1 to give downstream stages headroom. Users played from Dec 1 to launch day aren’t included. We surface this honestly: “Wrapped reflects your listening through [date].” Trying to include the last week of data would either delay launch or require a costly incremental re-run, neither worth it.
Bottleneck #3: launch-day CDN load. 100 K rps on a static document set is well within CDN headroom — but only because we pre-warmed. A cold CDN at launch would push origin traffic to multi-Tbps levels. The pre-warm job copies all 30 TB to edges in the 48 h before launch; cost is non-trivial but a known line item.
Bottleneck #4: catalog drift. Artists rename, tracks get pulled from the platform, albums get re-released with new IDs. The catalog snapshot used in Stage 2 has to be coherent across the year of plays. We pin a catalog version at pipeline start; track_id changes mid-year get rectified by a separate track_id_aliases table that the pipeline applies. The honest failure mode: in a small number of edge cases a user’s top track will show an old album cover. We accept that.
Alternative I’d push back on: running this incrementally year-round so that “Wrapped” is computed continuously and just snapshotted in December. Tempting because it would spread the cost evenly and make any-day-of-year retrospective possible. The reality: maintaining an always-fresh aggregate over 3 T plays is much more expensive than a once-a-year batch, because the always-on system has to handle continuous writes with low latency and we’d lose the bulk-compute economics of renting cluster time. The product also benefits from the seasonal moment — making it always-available dilutes the launch event.
What breaks first at 10× scale (6 B users, 30 T plays): Stage 1 shuffle data volume. We’d move to a streaming pre-aggregation that maintains rolling per-user summaries continuously, then the November batch becomes a finalization pass on already-aggregated state. The streaming pre-aggregation is a much more complex system but is the only viable shape at that scale.
Companies this resembles#
Spotify Wrapped (canonical), Apple Music Replay, YouTube Music Recap, Strava Year in Sport, Reddit Recap, Duolingo Year in Review. The general pattern — pre-baked annual retrospective driven by a single big batch job — has become a recurring product genre.
Related systems#
- ML Data Infrastructure — the lake, the Parquet conventions, and the broadcast catalogs all show up there.
- Distributed Task Scheduler — Airflow-class orchestration of the four-stage DAG.
- Blob Store — the substrate for both the lake and the output document store.