YouTube
Video upload, encoding pipeline, CDN-backed delivery, watch-history, recommendations.
Step 1 — Clarify Requirements#
Functional
- A creator uploads a video file (a few MB to many GB) → it becomes playable on every device at the right quality.
- A viewer searches or browses, hits play, and watches with sub-second startup.
- Adaptive bitrate streaming: quality adjusts to bandwidth in real time.
- View counts, likes, watch history, simple recommendations.
- Out of scope here: monetization, copyright detection (Content ID), live streaming, comments moderation.
Non-functional
- 99.99% availability for the watch path. The upload and transcoding pipeline can tolerate minutes of delay.
- p99 video startup (time-to-first-frame) under 1 second within a region.
- Global reach — viewers in any continent must be no more than ~50 ms from an edge cache.
- Eventual consistency on view counts and recommendations; durability is non-negotiable for the master video.
Step 2 — Capacity Estimation#
Anchor on 2 B monthly active viewers, 500 hours uploaded per minute (canonical YouTube-scale figures).
- Upload throughput: 500 hours/min × 60 min × 100 MB/hour (1080p compressed) ≈ 3 PB/day of raw uploads.
- Transcoded storage: each upload fans out to 6-8 renditions (144p, 240p, 360p, 480p, 720p, 1080p, 4K) plus audio + thumbnails. With dedup and tiered storage, total amplification ≈ 3-4×. Net storage growth: 9-12 PB/day, ~4 EB/year.
- Watch throughput: 1 B hours watched per day → roughly 5 Tbps of egress at peak (averages over 24h, peaks 2-3× higher).
- Watch QPS (manifest requests + segment fetches): 5 Tbps / 200 KB segments ≈ 3 M segment requests/sec globally.
- Read:write ratio in storage events: watching is ~1000× the bandwidth of uploading.
These numbers tell us the design: the upload + transcode pipeline is large but bounded; the watch path is a CDN-and-cache problem at planetary scale, and storage is dominated by long-tail content that rarely plays.
Step 3 — System Interface#
POST /uploads (resumable upload, returns upload_id)PUT /uploads/:upload_id/chunk?offset=N (chunked, idempotent on offset)POST /uploads/:upload_id/finalize (kicks off transcoding)
GET /videos/:video_id (metadata + manifest URL)GET /watch/:video_id/manifest.m3u8 (HLS / DASH manifest, signed)GET /watch/:video_id/seg_<bitrate>_<n>.ts (segment file, edge-cached)
POST /videos/:video_id/view (heartbeat, async)POST /videos/:video_id/likeUploads use resumable, chunked PUTs so a creator on flaky wifi can resume after an interruption. The manifest URL is signed with a short TTL (5 minutes) so view tokens cannot be rehosted indefinitely.
Step 4 — High-Level Design#
creator viewer │ │ ▼ ▼ Upload service ──→ Raw video blob store (S3-class) ──→ CDN (edge POPs) │ │ ▲ │ ▼ │ └→ metadata DB Transcoding fleet (GPU + CPU workers) │ (queue-driven, per-rendition fanout) │ │ │ ▼ │ Transcoded blob store ──── replication ────┘ │ ▼ Manifest service ─────→ viewers' API gatewayTwo largely independent pipelines:
- Ingest (cold): upload → durable raw store → enqueue transcoding job → fan out to one job per rendition → write transcoded outputs → publish manifest → invalidate stale CDN entries.
- Watch (hot): viewer hits CDN edge → segment served from edge cache → cache misses fall back to regional cache → regional misses fall back to origin (rare).
The CDN is the load-bearing piece for cost and latency. ~95% of bytes served must come from the edge, or the bandwidth bill alone becomes unmanageable.
Step 5 — Data Model#
Metadata (relational, sharded by video_id)
table videos video_id uuid PK uploader_id uuid title string description string duration_s int status enum(uploading, transcoding, ready, blocked) visibility enum(public, unlisted, private) created_at timestamp renditions list<{bitrate, codec, blob_uri, manifest_uri}>Watch events (wide-column, partition by video_id, time bucket as cluster key)
table watch_events video_id uuid PK bucket_5min timestamp CK viewer_id uuid watched_seconds int geo stringCounters for view-count, like-count are sharded (see /system-design/sharded-counters) — naively writing one row per view becomes a hot key for any trending video.
Blob stores are content-addressed when possible; the manifest references blobs by URI rather than by row.
Step 6 — Detailed Design#
The transcoding pipeline#
A single 4K upload of 60 minutes produces ~10 derived assets. Naively, that’s a 10-hour serial job. The fix is to chunk the master into ~10-second segments and transcode in parallel:
master.mp4 → segment_000.mp4 segment_001.mp4 ... segment_359.mp4 (60 min / 10 s) │ │ │ worker_a worker_b worker_x (one job per (segment, rendition) pair) ▼ transcoded segments → assembled into HLS / DASH playlistsEach (segment, rendition) job is independent and fits on one GPU box. With a fleet of 5,000 workers, a 1-hour upload completes the highest-bitrate ladder in 2-3 minutes. The job queue is partitioned by upload to keep ordering inside one upload.
Adaptive bitrate (HLS / DASH)#
The manifest enumerates renditions by bitrate. The player measures download time of each segment and steps up or down between bitrates to keep the buffer healthy. Segment length (~6-10 seconds) trades off two things:
YouTube settled on ~5-10 second segments as a balance. Live streaming uses shorter segments at the cost of higher overhead.
CDN strategy#
The CDN has three tiers:
- Edge POPs (hundreds globally) — serve the long tail of recently-played videos out of NVMe.
- Regional caches (~20 globally) — back the edges; absorb cache misses.
- Origin storage — the durable, replicated blob store; only ever touched on a regional miss.
A new upload is prewarmed only into regional caches (not edges) until it gets traction. A video that hits ~1,000 views starts replicating into edges via demand-based pull. This avoids pre-pushing 9 PB/day of mostly-unwatched content to thousands of POPs.
Watch-history and recommendations#
Watch events stream into Kafka, fan out to:
- Recent-watch service (Redis, last 200 videos per user) for the “Continue watching” rail.
- Offline training pipelines (Spark / Flink → feature store → recommender model) — see
/system-design/ml-data-infrastructure. - Engagement counters (sharded counters → batch aggregation into the videos table).
The recommender itself is a two-stage system: a cheap candidate generator (vector ANN over a user embedding) followed by a heavier ranker (gradient-boosted trees or a neural ranker). Recommendations are pre-computed for each user every few hours and cached.
Hot-video protection#
A viral video can absorb 50% of regional cache traffic. Mitigations:
- Edge caches use a LFU-with-decay policy so a rising video evicts other content faster.
- Trending videos get explicit prewarm instructions to all edges in the region.
- The manifest itself has long TTL (1 hour); only the segment URLs rotate (signed).
Latency budget (target p99 1 s startup)#
DNS + TLS (warm CDN connection): 30 msManifest fetch from edge: 50 msFirst segment fetch (low bitrate): 200 msPlayer decode + first frame: 100 msNetwork jitter buffer: 100-300 ms total: ~500-700 ms typical, ~1 s p99The trick to staying under 1 second is starting at the lowest bitrate and stepping up after the buffer fills. Players that try to start at the user’s negotiated bitrate add another 500-1500 ms.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: CDN egress cost. This is the dominant operational cost of the system, larger than storage or compute combined. Aggressive edge cache hit rates (target 95%+) and per-ISP peering agreements are where the cost battle lives.
Bottleneck #2: cold-start of the long tail. A video uploaded 3 years ago and then watched once today produces a regional cache miss and possibly an origin fetch — a 200 ms+ first-segment latency. Acceptable, but if it becomes common (e.g., a tweet linking to an obscure 2018 video that goes viral), it can momentarily stampede origin. Single-flight at the regional cache fixes the synchronous version; demand-prefetch handles the gradual version.
Bottleneck #3: transcoding queue depth during spikes. New Year’s Eve, election night, major sports — upload rates spike 5×. The fleet needs autoscaling on queue depth plus a graceful degradation strategy: defer the 4K rendition for ~30 minutes if the queue exceeds X, so 1080p (good enough for most viewers) ships immediately.
Alternative I’d push back on: transcoding once into a “master” rendition and re-transcoding on demand per viewer. Looks like a storage win, but per-view CPU cost dwarfs storage cost at YouTube scale — and watch latency becomes unbounded. Pre-transcode every rendition that has a meaningful audience; only sample exotic resolutions on demand.
What breaks first at 10× scale: object-storage metadata. A single bucket can hold trillions of objects, but the metadata layer (key listing, lifecycle scans) becomes the bottleneck long before the actual byte-storage layer. Shard by upload date and by uploader_id so per-bucket key counts stay under operational limits.
Companies this resembles#
YouTube, TikTok, Vimeo, Netflix VOD (technically much smaller catalog, far higher per-asset traffic), Twitch VODs.
Related systems#
- Content Delivery Network — the substrate this design depends on entirely.
- Blob Store — origin storage for raw and transcoded assets.
- Distributed Task Scheduler — the transcoding job queue.
- Instagram — same media-storage shape at a different aspect ratio.