YouTube

Video upload, encoding pipeline, CDN-backed delivery, watch-history, recommendations.

System Advanced
8 min read
video cdn transcoding storage
Companies this resembles: YouTube · TikTok · Vimeo · Twitch VODs

Step 1 — Clarify Requirements#

Functional

  • A creator uploads a video file (a few MB to many GB) → it becomes playable on every device at the right quality.
  • A viewer searches or browses, hits play, and watches with sub-second startup.
  • Adaptive bitrate streaming: quality adjusts to bandwidth in real time.
  • View counts, likes, watch history, simple recommendations.
  • Out of scope here: monetization, copyright detection (Content ID), live streaming, comments moderation.

Non-functional

  • 99.99% availability for the watch path. The upload and transcoding pipeline can tolerate minutes of delay.
  • p99 video startup (time-to-first-frame) under 1 second within a region.
  • Global reach — viewers in any continent must be no more than ~50 ms from an edge cache.
  • Eventual consistency on view counts and recommendations; durability is non-negotiable for the master video.

Step 2 — Capacity Estimation#

Anchor on 2 B monthly active viewers, 500 hours uploaded per minute (canonical YouTube-scale figures).

  • Upload throughput: 500 hours/min × 60 min × 100 MB/hour (1080p compressed) ≈ 3 PB/day of raw uploads.
  • Transcoded storage: each upload fans out to 6-8 renditions (144p, 240p, 360p, 480p, 720p, 1080p, 4K) plus audio + thumbnails. With dedup and tiered storage, total amplification ≈ 3-4×. Net storage growth: 9-12 PB/day, ~4 EB/year.
  • Watch throughput: 1 B hours watched per day → roughly 5 Tbps of egress at peak (averages over 24h, peaks 2-3× higher).
  • Watch QPS (manifest requests + segment fetches): 5 Tbps / 200 KB segments ≈ 3 M segment requests/sec globally.
  • Read:write ratio in storage events: watching is ~1000× the bandwidth of uploading.

These numbers tell us the design: the upload + transcode pipeline is large but bounded; the watch path is a CDN-and-cache problem at planetary scale, and storage is dominated by long-tail content that rarely plays.

Step 3 — System Interface#

POST /uploads (resumable upload, returns upload_id)
PUT /uploads/:upload_id/chunk?offset=N (chunked, idempotent on offset)
POST /uploads/:upload_id/finalize (kicks off transcoding)
GET /videos/:video_id (metadata + manifest URL)
GET /watch/:video_id/manifest.m3u8 (HLS / DASH manifest, signed)
GET /watch/:video_id/seg_<bitrate>_<n>.ts (segment file, edge-cached)
POST /videos/:video_id/view (heartbeat, async)
POST /videos/:video_id/like

Uploads use resumable, chunked PUTs so a creator on flaky wifi can resume after an interruption. The manifest URL is signed with a short TTL (5 minutes) so view tokens cannot be rehosted indefinitely.

Step 4 — High-Level Design#

creator viewer
│ │
▼ ▼
Upload service ──→ Raw video blob store (S3-class) ──→ CDN (edge POPs)
│ │ ▲
│ ▼ │
└→ metadata DB Transcoding fleet (GPU + CPU workers) │
(queue-driven, per-rendition fanout) │
│ │
▼ │
Transcoded blob store ──── replication ────┘
Manifest service ─────→ viewers' API gateway

Two largely independent pipelines:

  • Ingest (cold): upload → durable raw store → enqueue transcoding job → fan out to one job per rendition → write transcoded outputs → publish manifest → invalidate stale CDN entries.
  • Watch (hot): viewer hits CDN edge → segment served from edge cache → cache misses fall back to regional cache → regional misses fall back to origin (rare).

The CDN is the load-bearing piece for cost and latency. ~95% of bytes served must come from the edge, or the bandwidth bill alone becomes unmanageable.

Step 5 — Data Model#

Metadata (relational, sharded by video_id)

table videos
video_id uuid PK
uploader_id uuid
title string
description string
duration_s int
status enum(uploading, transcoding, ready, blocked)
visibility enum(public, unlisted, private)
created_at timestamp
renditions list<{bitrate, codec, blob_uri, manifest_uri}>

Watch events (wide-column, partition by video_id, time bucket as cluster key)

table watch_events
video_id uuid PK
bucket_5min timestamp CK
viewer_id uuid
watched_seconds int
geo string

Counters for view-count, like-count are sharded (see /system-design/sharded-counters) — naively writing one row per view becomes a hot key for any trending video.

Blob stores are content-addressed when possible; the manifest references blobs by URI rather than by row.

Step 6 — Detailed Design#

The transcoding pipeline#

A single 4K upload of 60 minutes produces ~10 derived assets. Naively, that’s a 10-hour serial job. The fix is to chunk the master into ~10-second segments and transcode in parallel:

master.mp4 → segment_000.mp4 segment_001.mp4 ... segment_359.mp4 (60 min / 10 s)
│ │ │
worker_a worker_b worker_x
(one job per (segment, rendition) pair)
transcoded segments → assembled into HLS / DASH playlists

Each (segment, rendition) job is independent and fits on one GPU box. With a fleet of 5,000 workers, a 1-hour upload completes the highest-bitrate ladder in 2-3 minutes. The job queue is partitioned by upload to keep ordering inside one upload.

Adaptive bitrate (HLS / DASH)#

The manifest enumerates renditions by bitrate. The player measures download time of each segment and steps up or down between bitrates to keep the buffer healthy. Segment length (~6-10 seconds) trades off two things:

Shorter segments (2-4 s) — lower startup latency, finer-grained bitrate switching, but more manifest entries and more request overhead (HTTP setup per segment).
Longer segments (10-30 s) — higher CDN cache efficiency, fewer requests, but slower to switch bitrate when bandwidth changes and slower live-equivalent startup.

YouTube settled on ~5-10 second segments as a balance. Live streaming uses shorter segments at the cost of higher overhead.

CDN strategy#

The CDN has three tiers:

  1. Edge POPs (hundreds globally) — serve the long tail of recently-played videos out of NVMe.
  2. Regional caches (~20 globally) — back the edges; absorb cache misses.
  3. Origin storage — the durable, replicated blob store; only ever touched on a regional miss.

A new upload is prewarmed only into regional caches (not edges) until it gets traction. A video that hits ~1,000 views starts replicating into edges via demand-based pull. This avoids pre-pushing 9 PB/day of mostly-unwatched content to thousands of POPs.

Watch-history and recommendations#

Watch events stream into Kafka, fan out to:

  • Recent-watch service (Redis, last 200 videos per user) for the “Continue watching” rail.
  • Offline training pipelines (Spark / Flink → feature store → recommender model) — see /system-design/ml-data-infrastructure.
  • Engagement counters (sharded counters → batch aggregation into the videos table).

The recommender itself is a two-stage system: a cheap candidate generator (vector ANN over a user embedding) followed by a heavier ranker (gradient-boosted trees or a neural ranker). Recommendations are pre-computed for each user every few hours and cached.

Hot-video protection#

A viral video can absorb 50% of regional cache traffic. Mitigations:

  • Edge caches use a LFU-with-decay policy so a rising video evicts other content faster.
  • Trending videos get explicit prewarm instructions to all edges in the region.
  • The manifest itself has long TTL (1 hour); only the segment URLs rotate (signed).

Latency budget (target p99 1 s startup)#

DNS + TLS (warm CDN connection): 30 ms
Manifest fetch from edge: 50 ms
First segment fetch (low bitrate): 200 ms
Player decode + first frame: 100 ms
Network jitter buffer: 100-300 ms
total: ~500-700 ms typical, ~1 s p99

The trick to staying under 1 second is starting at the lowest bitrate and stepping up after the buffer fills. Players that try to start at the user’s negotiated bitrate add another 500-1500 ms.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: CDN egress cost. This is the dominant operational cost of the system, larger than storage or compute combined. Aggressive edge cache hit rates (target 95%+) and per-ISP peering agreements are where the cost battle lives.

Bottleneck #2: cold-start of the long tail. A video uploaded 3 years ago and then watched once today produces a regional cache miss and possibly an origin fetch — a 200 ms+ first-segment latency. Acceptable, but if it becomes common (e.g., a tweet linking to an obscure 2018 video that goes viral), it can momentarily stampede origin. Single-flight at the regional cache fixes the synchronous version; demand-prefetch handles the gradual version.

Bottleneck #3: transcoding queue depth during spikes. New Year’s Eve, election night, major sports — upload rates spike 5×. The fleet needs autoscaling on queue depth plus a graceful degradation strategy: defer the 4K rendition for ~30 minutes if the queue exceeds X, so 1080p (good enough for most viewers) ships immediately.

Alternative I’d push back on: transcoding once into a “master” rendition and re-transcoding on demand per viewer. Looks like a storage win, but per-view CPU cost dwarfs storage cost at YouTube scale — and watch latency becomes unbounded. Pre-transcode every rendition that has a meaningful audience; only sample exotic resolutions on demand.

What breaks first at 10× scale: object-storage metadata. A single bucket can hold trillions of objects, but the metadata layer (key listing, lifecycle scans) becomes the bottleneck long before the actual byte-storage layer. Shard by upload date and by uploader_id so per-bucket key counts stay under operational limits.

Companies this resembles#

YouTube, TikTok, Vimeo, Netflix VOD (technically much smaller catalog, far higher per-asset traffic), Twitch VODs.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.