YouTube — System Design · Engineering Playbook

Step 1 — Clarify Requirements#

Functional

A creator uploads a video file (a few MB to many GB) → it becomes playable on every device at the right quality.
A viewer searches or browses, hits play, and watches with sub-second startup.
Adaptive bitrate streaming: quality adjusts to bandwidth in real time.
View counts, likes, watch history, simple recommendations.
Out of scope here: monetization, copyright detection (Content ID), live streaming, comments moderation.

Non-functional

99.99% availability for the watch path. The upload and transcoding pipeline can tolerate minutes of delay.
p99 video startup (time-to-first-frame) under 1 second within a region.
Global reach — viewers in any continent must be no more than ~50 ms from an edge cache.
Eventual consistency on view counts and recommendations; durability is non-negotiable for the master video.

Step 2 — Capacity Estimation#

Anchor on 2 B monthly active viewers, 500 hours uploaded per minute (canonical YouTube-scale figures).

Upload throughput: 500 hours/min × 60 min × 100 MB/hour (1080p compressed) ≈ 3 PB/day of raw uploads.
Transcoded storage: each upload fans out to 6-8 renditions (144p, 240p, 360p, 480p, 720p, 1080p, 4K) plus audio + thumbnails. With dedup and tiered storage, total amplification ≈ 3-4×. Net storage growth: 9-12 PB/day, ~4 EB/year.
Watch throughput: 1 B hours watched per day → roughly 5 Tbps of egress at peak (averages over 24h, peaks 2-3× higher).
Watch QPS (manifest requests + segment fetches): 5 Tbps / 200 KB segments ≈ 3 M segment requests/sec globally.
Read:write ratio in storage events: watching is ~1000× the bandwidth of uploading.

These numbers tell us the design: the upload + transcode pipeline is large but bounded; the watch path is a CDN-and-cache problem at planetary scale, and storage is dominated by long-tail content that rarely plays.

Step 3 — System Interface#

POST  /uploads                              (resumable upload, returns upload_id)
PUT   /uploads/:upload_id/chunk?offset=N    (chunked, idempotent on offset)
POST  /uploads/:upload_id/finalize          (kicks off transcoding)

GET   /videos/:video_id                     (metadata + manifest URL)
GET   /watch/:video_id/manifest.m3u8        (HLS / DASH manifest, signed)
GET   /watch/:video_id/seg_<bitrate>_<n>.ts (segment file, edge-cached)

POST  /videos/:video_id/view                (heartbeat, async)
POST  /videos/:video_id/like

Uploads use resumable, chunked PUTs so a creator on flaky wifi can resume after an interruption. The manifest URL is signed with a short TTL (5 minutes) so view tokens cannot be rehosted indefinitely.

Step 4 — High-Level Design#

  creator                                                      viewer
     │                                                            │
     ▼                                                            ▼
  Upload service ──→ Raw video blob store (S3-class) ──→ CDN (edge POPs)
     │                       │                                    ▲
     │                       ▼                                    │
     └→ metadata DB    Transcoding fleet (GPU + CPU workers)      │
                       (queue-driven, per-rendition fanout)       │
                             │                                    │
                             ▼                                    │
                       Transcoded blob store ──── replication ────┘
                             │
                             ▼
                       Manifest service ─────→ viewers' API gateway

Two largely independent pipelines:

Ingest (cold): upload → durable raw store → enqueue transcoding job → fan out to one job per rendition → write transcoded outputs → publish manifest → invalidate stale CDN entries.
Watch (hot): viewer hits CDN edge → segment served from edge cache → cache misses fall back to regional cache → regional misses fall back to origin (rare).

The CDN is the load-bearing piece for cost and latency. ~95% of bytes served must come from the edge, or the bandwidth bill alone becomes unmanageable.

Step 5 — Data Model#

Metadata (relational, sharded by video_id)

table videos
  video_id        uuid    PK
  uploader_id     uuid
  title           string
  description     string
  duration_s      int
  status          enum(uploading, transcoding, ready, blocked)
  visibility      enum(public, unlisted, private)
  created_at      timestamp
  renditions      list<{bitrate, codec, blob_uri, manifest_uri}>

Watch events (wide-column, partition by video_id, time bucket as cluster key)

table watch_events
  video_id        uuid     PK
  bucket_5min     timestamp CK
  viewer_id       uuid
  watched_seconds int
  geo             string

Counters for view-count, like-count are sharded (see /system-design/sharded-counters) — naively writing one row per view becomes a hot key for any trending video.

Blob stores are content-addressed when possible; the manifest references blobs by URI rather than by row.

Step 6 — Detailed Design#

The transcoding pipeline#

A single 4K upload of 60 minutes produces ~10 derived assets. Naively, that’s a 10-hour serial job. The fix is to chunk the master into ~10-second segments and transcode in parallel:

master.mp4 → segment_000.mp4 segment_001.mp4 ... segment_359.mp4   (60 min / 10 s)
                  │                │                    │
            worker_a          worker_b               worker_x
            (one job per (segment, rendition) pair)
                  ▼
            transcoded segments → assembled into HLS / DASH playlists

Each (segment, rendition) job is independent and fits on one GPU box. With a fleet of 5,000 workers, a 1-hour upload completes the highest-bitrate ladder in 2-3 minutes. The job queue is partitioned by upload to keep ordering inside one upload.

Adaptive bitrate (HLS / DASH)#

The manifest enumerates renditions by bitrate. The player measures download time of each segment and steps up or down between bitrates to keep the buffer healthy. Segment length (~6-10 seconds) trades off two things:

Shorter segments (2-4 s) — lower startup latency, finer-grained bitrate switching, but more manifest entries and more request overhead (HTTP setup per segment).

Longer segments (10-30 s) — higher CDN cache efficiency, fewer requests, but slower to switch bitrate when bandwidth changes and slower live-equivalent startup.

YouTube settled on ~5-10 second segments as a balance. Live streaming uses shorter segments at the cost of higher overhead.

CDN strategy#

The CDN has three tiers:

Edge POPs (hundreds globally) — serve the long tail of recently-played videos out of NVMe.
Regional caches (~20 globally) — back the edges; absorb cache misses.
Origin storage — the durable, replicated blob store; only ever touched on a regional miss.

A new upload is prewarmed only into regional caches (not edges) until it gets traction. A video that hits ~1,000 views starts replicating into edges via demand-based pull. This avoids pre-pushing 9 PB/day of mostly-unwatched content to thousands of POPs.

Watch-history and recommendations#

Watch events stream into Kafka, fan out to:

Recent-watch service (Redis, last 200 videos per user) for the “Continue watching” rail.
Offline training pipelines (Spark / Flink → feature store → recommender model) — see /system-design/ml-data-infrastructure.
Engagement counters (sharded counters → batch aggregation into the videos table).

The recommender itself is a two-stage system: a cheap candidate generator (vector ANN over a user embedding) followed by a heavier ranker (gradient-boosted trees or a neural ranker). Recommendations are pre-computed for each user every few hours and cached.

Hot-video protection#

A viral video can absorb 50% of regional cache traffic. Mitigations:

Edge caches use a LFU-with-decay policy so a rising video evicts other content faster.
Trending videos get explicit prewarm instructions to all edges in the region.
The manifest itself has long TTL (1 hour); only the segment URLs rotate (signed).

Latency budget (target p99 1 s startup)#

DNS + TLS (warm CDN connection):   30 ms
Manifest fetch from edge:          50 ms
First segment fetch (low bitrate): 200 ms
Player decode + first frame:       100 ms
Network jitter buffer:             100-300 ms
                          total:  ~500-700 ms typical, ~1 s p99

The trick to staying under 1 second is starting at the lowest bitrate and stepping up after the buffer fills. Players that try to start at the user’s negotiated bitrate add another 500-1500 ms.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: CDN egress cost. This is the dominant operational cost of the system, larger than storage or compute combined. Aggressive edge cache hit rates (target 95%+) and per-ISP peering agreements are where the cost battle lives.

Bottleneck #2: cold-start of the long tail. A video uploaded 3 years ago and then watched once today produces a regional cache miss and possibly an origin fetch — a 200 ms+ first-segment latency. Acceptable, but if it becomes common (e.g., a tweet linking to an obscure 2018 video that goes viral), it can momentarily stampede origin. Single-flight at the regional cache fixes the synchronous version; demand-prefetch handles the gradual version.

Bottleneck #3: transcoding queue depth during spikes. New Year’s Eve, election night, major sports — upload rates spike 5×. The fleet needs autoscaling on queue depth plus a graceful degradation strategy: defer the 4K rendition for ~30 minutes if the queue exceeds X, so 1080p (good enough for most viewers) ships immediately.

Alternative I’d push back on: transcoding once into a “master” rendition and re-transcoding on demand per viewer. Looks like a storage win, but per-view CPU cost dwarfs storage cost at YouTube scale — and watch latency becomes unbounded. Pre-transcode every rendition that has a meaningful audience; only sample exotic resolutions on demand.

What breaks first at 10× scale: object-storage metadata. A single bucket can hold trillions of objects, but the metadata layer (key listing, lifecycle scans) becomes the bottleneck long before the actual byte-storage layer. Shard by upload date and by uploader_id so per-bucket key counts stay under operational limits.

Companies this resembles#

YouTube, TikTok, Vimeo, Netflix VOD (technically much smaller catalog, far higher per-asset traffic), Twitch VODs.

Content Delivery Network — the substrate this design depends on entirely.
Blob Store — origin storage for raw and transcoded assets.
Distributed Task Scheduler — the transcoding job queue.
Instagram — same media-storage shape at a different aspect ratio.