Design the YouTube Streaming API

Upload pipeline, transcode, ABR manifests, CDN, recommendation. The biggest video service in the world, from the API's side.

System Advanced
20 min read
api-design video-streaming abr
Companies this resembles: YouTube · Google

Context#

YouTube is the canonical large-scale video question and an Advanced-tier prompt for a reason: most candidates conflate the HLD round with the API round and try to design every box on the page. The trick is to acknowledge the overlap up front and cut hard.

This writeup is an API-design round, not an HLD round. That means:

  • The transcode pipeline is a black box behind an async webhook. We don’t design the encoder fleet.
  • The CDN is a black box that returns a hostname; we don’t design eviction, peering, or fill paths.
  • The recommendation engine is a black box that returns a list of video IDs; we don’t design candidate generation, ranking, or the embedding store.
  • Monetisation, Content-ID copyright matching, comment moderation — all out of scope. The interviewer will respect a clear bound far more than a vague attempt to cover everything.

What remains is still a rich API surface:

  • A three-sub-API design — Upload, Playback, Recommendation — each with its own latency profile and contract.
  • A resumable multipart upload protocol that has to survive the user closing the laptop with 4 GB left to send.
  • A playback API that hands back signed URLs to ABR manifests (HLS / DASH) without leaking the underlying CDN topology.
  • A transcode-completion webhook that the upload pipeline posts back to as a separate event, decoupling the synchronous upload from the async work.
  • A per-region CDN selection the API has to make at request time using the client’s IP.

The interviewer’s hidden objectives, roughly in order:

  • Can you decompose a giant product into three clean sub-APIs without panic-merging them?
  • Can you produce a resumable-upload protocol that survives partial failures?
  • Can you separate the synchronous upload contract from the async transcode contract with a clear webhook seam?
  • Can you give a sensible playback latency budget (sub-2-second start-to-first-frame)?
  • Can you talk about CDN selection as an API concern without designing the CDN itself?

Requirements (functional and non-functional)#

Functional — in scope:

  • Initiate, perform, and complete a resumable upload of a video file up to 256 GB.
  • Trigger an async transcode on upload completion; deliver result via webhook to the uploader’s server (or app push to the mobile client).
  • Return a playback session for an authenticated viewer: ABR manifest URL, DRM token if required, geo-routed CDN host.
  • Return a per-channel recommendation list keyed on viewer + channel context.
  • Expose video metadata read endpoints (title, description, duration, available qualities, captions).
  • Range-request friendly download is explicitly not in scope for end users — the only legitimate read path is the ABR playback manifest.

Functional — out of scope:

  • Monetisation, ads-decisioning, partner programs.
  • Content-ID copyright matching (a separate background pipeline; not API-exposed).
  • Comment moderation API (separate workbook: see Comment Service API).
  • Live streaming. VOD only in this round.
  • The transcode pipeline’s internals (encoder fleet, codec selection, the ABR ladder construction).
  • The recommendation model itself; we expose a thin endpoint over a black-box ranker.

Non-functional:

  • Upload: support files up to 256 GB; resumable across network drops; chunks up to 64 MB.
  • Transcode latency: not user-facing; SLA is “transcode completes within 4x video duration p95”. Webhook fires inside 5 s of completion.
  • Playback start: time-to-first-frame <= 2 s p95 from the moment the manifest URL is requested.
  • Manifest latency: manifest fetch <= 100 ms p95 (it’s a small JSON / XML doc fetched once per session).
  • Recommendation latency: <= 200 ms p95 (it’s a hot read at app open).
  • Throughput: 500k playback session opens per second globally; 50k uploads per second.
  • Availability: 99.99% on playback; 99.9% on upload (uploads are resumable so a brief outage is recoverable).
  • Durability: 11-nines on the original video bytes; transcode outputs are regeneratable.

Use case diagram#

┌─────────────────┐
│ Creator (UA) │
└────────┬────────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
[initiate [upload parts] [complete
upload] upload]
│ │ │
└──────────────┴──────────────┘
┌─────────────────┐
│ Upload API │──── webhook ───► creator's server
└─────────────────┘
┌─────────────────┐
│ Viewer (UA) │
└────────┬────────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
[get playback [get rec. [get video
session] list] metadata]
│ │ │
└──────────────┴──────────────┘
┌─────────────────┐
│ Playback / │
│ Recommend API │
└─────────────────┘

Two actors. Creator’s surface is the Upload API. Viewer’s surface is Playback + Recommendation + Metadata. The webhook from Upload back to the creator’s server is the seam between the sync upload contract and the async transcode work.

Class diagram#

┌──────────────────────────┐
│ UploadService │
├──────────────────────────┤
│ initiateUpload(req) │
│ putPart(session, idx, b) │
│ completeUpload(session) │
│ cancelUpload(session) │
└──────────────┬───────────┘
│ creates
┌──────────────────────────┐
│ UploadSession │
├──────────────────────────┤
│ id : UUID │
│ video_id : str │
│ owner_id : str │
│ total_bytes : int │
│ part_size : int (=64MB) │
│ parts_received : [int] │
│ state : enum │
│ expires_at : timestamp │
└──────────────────────────┘
┌──────────────────────────┐ ┌─────────────────────┐
│ PlaybackService │ │ PlaybackSession │
├──────────────────────────┤ ├─────────────────────┤
│ openSession(vid, viewer) │ returns │ manifest_url : str │
│ heartbeat(session) │────────►│ drm_token? : str │
│ closeSession(session) │ │ cdn_host : str │
└──────────────────────────┘ │ ttl_seconds : int │
│ session_id : UUID │
└─────────────────────┘
┌──────────────────────────┐ ┌─────────────────────┐
│ RecommendationService │ │ VideoSummary │
├──────────────────────────┤ ├─────────────────────┤
│ forChannel(ch, viewer) │ returns │ id, title │
│ forHomepage(viewer) │────────►│ duration_seconds │
│ relatedTo(video_id) │ │ thumbnail_url │
└──────────────────────────┘ │ channel_id │
└─────────────────────┘

Three services, each owning one sub-API. The UploadSession is the only resource with non-trivial state (the others are read-mostly). PlaybackSession is short-lived (TTL on the order of an hour) and the manifest_url is signed so it can’t be re-used out of context.

Sequence diagram (key flows)#

Flow 1: resumable upload.

Creator UploadAPI BlobStore TranscodeQueue
│ POST /videos:initiateUpload │ │
│ { file_size, mime, title } │ │
│─────────────────────────────►│ │
│ 201 + session, part_size │ │
│◄─────────────────────────────│ │
│ │ │
│ PUT /uploads/{s}/parts/0 │ │
│ (64MB chunk) │ │
│─────────────────────────────►│ stage part 0 │
│ │─────────────────►│ (no)
│ 204 + etag │ │
│◄─────────────────────────────│ │
│ │ │
│ ... PUT part 1, 2, ... N-1 ... │
│ │ │
│ POST /uploads/{s}:complete │ │
│─────────────────────────────►│ finalize blob │
│ │ enqueue transcode│
│ │─────────────────►│
│ 202 Accepted + video_id │ │
│◄─────────────────────────────│ │
│ │
│ (some time later) │
│ ◄── transcode done│
│ POST {creator_webhook_url} │ │
│ { video_id, status: ready } │ │
│◄─────────────────────────────│ │

Note the 202 Accepted on completion — the bytes are durable but the video is not yet playable. The webhook is the contract that says “now it is”.

Flow 2: viewer playback.

Viewer PlaybackAPI SignedURL CDN
│ POST /videos/{id}/playback │ │
│ (auth: bearer token) │ │
│─────────────────────────────►│ │
│ geo-pick CDN from client IP │ │
│ mint signed manifest URL │ │
│ ────────────────────────────►│ │
│ signed URL │ │
│◄─────────────────────────────│ │
│ 200 + manifest_url, ttl │ │
│◄─────────────────────────────│ │
│ │ │
│ GET {manifest_url} │
│ ────────────────────────────────────────────────►│
│ HLS / DASH manifest │
│◄────────────────────────────────────────────────│
│ │ │
│ GET segment .ts / .m4s ... │
│ ────────────────────────────────────────────────►│
│ ABR client picks rendition per bandwidth │

The PlaybackAPI does the CDN selection — choosing among iad.cdn.youtube.example, fra.cdn.youtube.example, sin.cdn.youtube.example etc. based on the client IP’s GeoIP lookup. The signed URL ties the manifest to that specific CDN host plus the session ID, so it can’t be replayed from another region.

Activity diagram (for non-trivial state)#

The UploadSession state machine is the part of this design with real lifecycle logic. Everything else is request/response.

[initiateUpload]
┌────────────────┐
│ PENDING │── 24h idle ─► EXPIRED
└────────┬───────┘
│ first putPart succeeds
┌────────────────┐
│ IN_PROGRESS │── 7d idle ─► EXPIRED
└────────┬───────┘
│ completeUpload
┌────────────────┐
│ FINALIZING │ ── blob fail ─► FAILED
└────────┬───────┘
│ blob ok, transcode enqueued
┌────────────────┐
│ TRANSCODING │ ── transcode fail ─► FAILED
└────────┬───────┘
│ transcode ok
┌────────────────┐
│ READY │
└────────────────┘
[from any state, except READY]
│ cancelUpload
┌──────────────┐
│ CANCELLED │
└──────────────┘

A few invariants the API enforces around these states:

  • putPart is rejected with 409 Conflict unless state is PENDING or IN_PROGRESS.
  • completeUpload is idempotent — calling it twice on a FINALIZING / TRANSCODING / READY session returns the same video_id (with current status). This matters because clients retry the completion call when the response gets lost.
  • The transition from TRANSCODING to READY is what fires the webhook. The webhook payload includes the final video_id, the available renditions (e.g. [240p, 360p, 480p, 720p, 1080p, 1440p, 2160p]), and the manifest URL template.
  • A FAILED state is non-terminal in the sense that the creator can re-trigger transcode (a different operation), but the upload itself can’t be revived.

API implementation#

Endpoint catalogue#

MethodPathPurpose
POST/v1/videos:initiateUploadBegin a resumable upload; returns session
PUT/v1/uploads/{session}/parts/{idx}Upload one part (64 MB)
GET/v1/uploads/{session}Query session state (resume-from inspection)
POST/v1/uploads/{session}:completeFinalize; triggers async transcode
DELETE/v1/uploads/{session}Cancel an in-progress upload
GET/v1/videos/{id}Video metadata (title, durations, thumbnails)
POST/v1/videos/{id}/playbackOpen a playback session; returns manifest URL
GET/v1/channels/{id}/recommendationsPer-channel recs for an authenticated viewer
GET/v1/recommendations/homeHomepage recs for an authenticated viewer
GET/v1/videos/{id}/relatedRelated videos for an open watch session

Ten endpoints across three sub-APIs. The :complete and :initiateUpload paths use Google’s verb-suffix convention (the colon-verb form) — appropriate here since these aren’t pure REST operations on a videos collection; they are state transitions.

OpenAPI schema (excerpt)#

OpenAPI 3.1 — YouTube Streaming API (core endpoints)
paths:
/v1/videos:initiateUpload:
post:
operationId: initiateUpload
security: [{ bearerAuth: [upload.write] }]
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [file_size_bytes, mime_type, title]
properties:
file_size_bytes:
type: integer
minimum: 1
maximum: 274877906944
mime_type:
type: string
enum: [video/mp4, video/quicktime, video/webm, video/x-matroska]
title: { type: string, maxLength: 100 }
description: { type: string, maxLength: 5000 }
webhook_url: { type: string, format: uri, nullable: true }
responses:
'201':
description: Upload session created
content:
application/json:
schema:
$ref: '#/components/schemas/UploadSession'
'400': { description: Invalid request }
'413': { description: File too large }
/v1/uploads/{session}/parts/{idx}:
put:
operationId: uploadPart
security: [{ bearerAuth: [upload.write] }]
parameters:
- { name: session, in: path, required: true, schema: { type: string, format: uuid } }
- { name: idx, in: path, required: true, schema: { type: integer, minimum: 0 } }
- name: Content-Length
in: header
required: true
schema: { type: integer, maximum: 67108864 }
requestBody:
required: true
content:
application/octet-stream:
schema: { type: string, format: binary }
responses:
'204':
description: Part stored
headers:
ETag: { schema: { type: string } }
'409': { description: Session not in IN_PROGRESS state }
'416': { description: Part index out of range }
/v1/videos/{id}/playback:
post:
operationId: openPlayback
security: [{ bearerAuth: [playback.read] }]
parameters:
- { name: id, in: path, required: true, schema: { type: string } }
requestBody:
content:
application/json:
schema:
type: object
properties:
preferred_format:
type: string
enum: [hls, dash, auto]
default: auto
client_capabilities:
type: object
properties:
max_resolution: { type: string, enum: [480p, 720p, 1080p, 1440p, 2160p] }
codecs: { type: array, items: { type: string } }
responses:
'200':
description: Playback session opened
content:
application/json:
schema:
$ref: '#/components/schemas/PlaybackSession'
'404': { description: Video not found or not ready }
'451': { description: Unavailable in viewer's region }
components:
schemas:
UploadSession:
type: object
required: [id, part_size_bytes, expires_at, parts_url_template]
properties:
id: { type: string, format: uuid }
video_id: { type: string }
part_size_bytes: { type: integer, default: 67108864 }
total_parts: { type: integer }
expires_at: { type: string, format: date-time }
parts_url_template:
type: string
example: "https://api.youtube.example/v1/uploads/{session}/parts/{idx}"
state:
type: string
enum: [PENDING, IN_PROGRESS, FINALIZING, TRANSCODING, READY, FAILED, CANCELLED]
PlaybackSession:
type: object
required: [session_id, manifest_url, ttl_seconds]
properties:
session_id: { type: string, format: uuid }
manifest_url: { type: string, format: uri }
manifest_format: { type: string, enum: [hls, dash] }
cdn_host: { type: string }
ttl_seconds: { type: integer, example: 3600 }
drm_token: { type: string, nullable: true }

Webhook contract for transcode completion#

When the transcode pipeline finishes a video, the Upload API posts a small JSON document to the webhook_url the creator supplied at upload time. The shape is intentionally minimal:

Webhook POST body
{
"event": "video.ready",
"video_id": "v_8h2N9c0qK",
"session_id": "01HFY3...",
"renditions": ["240p", "360p", "480p", "720p", "1080p"],
"manifest_url_template": "https://api.youtube.example/v1/videos/v_8h2N9c0qK/playback",
"ts": "2026-05-30T17:42:11Z",
"signature": "sha256=..."
}

The signature is HMAC-SHA256 over the body using a secret shared with the creator at app registration — same pattern as Stripe and GitHub webhooks. Delivery is at-least-once with exponential backoff up to 24 hours. A video.failed event with an error code is the alternative terminal event.

Client samples — three languages#

The end-to-end “initiate, upload one part, complete” flow in Python, Go, and Node.

Resumable upload — Python
import hashlib
import requests
API = "https://api.youtube.example"
TOKEN = "Bearer eyJhbGciOi..."
def initiate(file_path, title):
size = 0
with open(file_path, "rb") as f:
f.seek(0, 2); size = f.tell()
resp = requests.post(
f"{API}/v1/videos:initiateUpload",
json={"file_size_bytes": size, "mime_type": "video/mp4", "title": title},
headers={"Authorization": TOKEN},
timeout=10,
)
resp.raise_for_status()
return resp.json()
def upload_part(session_id, idx, chunk):
return requests.put(
f"{API}/v1/uploads/{session_id}/parts/{idx}",
data=chunk,
headers={"Authorization": TOKEN, "Content-Type": "application/octet-stream"},
timeout=120,
)
def complete(session_id):
resp = requests.post(
f"{API}/v1/uploads/{session_id}:complete",
headers={"Authorization": TOKEN},
timeout=30,
)
resp.raise_for_status()
return resp.json()
def upload_video(file_path, title):
session = initiate(file_path, title)
sid = session["id"]
part_size = session["part_size_bytes"]
with open(file_path, "rb") as f:
idx = 0
while True:
chunk = f.read(part_size)
if not chunk: break
r = upload_part(sid, idx, chunk)
r.raise_for_status()
idx += 1
return complete(sid)

Latency budget — playback start#

Playback start (POST /v1/videos/{id}/playback through first decoded frame) breaks down as:

PhaseBudgetNotes
TLS / HTTP setup30 msWarm connection from app start
Auth + entitlement check15 msJWT verify + per-region availability check
GeoIP lookup + CDN pick5 msIn-process MaxMind-shaped DB
Sign manifest URL5 msHMAC over (video_id, viewer_id, cdn_host, ttl)
Manifest fetch (separate request)100 msFrom CDN edge, cached
Player decision (rendition pick)50 msClient-side
First segment fetch300 ms2-second segment from edge
Decode + paint50 msHardware decoder warm-up
Total555 msWell under the 2 s p95 target

The 2 s target has 1.4 s of margin for the long tail (cold CDN cache miss on the first segment, retry of a dropped TCP connection, DNS lookup on a fresh app install).

Recommendation latency budget#

PhaseBudget
Auth + cache key build5 ms
Cache lookup (per-viewer top-N)5 ms
Black-box ranker RPC120 ms p95
Hydrate VideoSummary for top 5040 ms
Serialize + transport20 ms
Total190 ms

The ranker is the dominant cost and is out of the API’s control — the API contract is forChannel(channel_id, viewer_id) -> [video_id, score, reason_tag][]. If the model team needs more time per call, they have to push it into batch precomputation, not extend the synchronous budget.

Trade-offs and extensions#

DecisionWhyCost if requirements change
Resumable multipart (vs single PUT)Files up to 256 GB; cannot retry from scratch on dropMore client logic, more session state to track
64 MB part sizeTradeoff between TCP-window-fill and retry costSmaller parts on mobile would mean more round-trips
Webhook for transcode completionDecouples sync upload from async workWebhook delivery is at-least-once, creators must dedupe
Signed manifest URL with TTLPrevents URL sharing / hot-linkingRe-signing on session refresh adds 1 RTT
CDN selection in PlaybackAPIAPI can route around regional CDN issuesAPI now owns the GeoIP database refresh
Separate HLS + DASH manifestsApple devices want HLS; everything else DASHCost of running two transcode output formats
:complete is idempotentNetwork drops mid-completion are routineServer must store the result keyed on session id
Recommendation as a thin pass-throughDecouples API from ranker iterationAPI can’t add cross-cutting concerns (e.g. blocklist) without becoming smart
No range-request download endpointDiscourages scraping, simplifies licensingPower users who want offline cannot DIY; needs a dedicated mobile API

Likely follow-up extensions and how the API absorbs them:

  • Live streaming. A new sub-API with POST /v1/streams:create, PUT /v1/streams/{id}/ingest, manifest pointing to a moving window. Different latency profile (sub-3 s glass-to-glass), different player. Cleanest as a sibling API, not a flag on the VOD path.
  • DVR for live. Live + lookback within a 4-hour window. The manifest grows; the API contract is unchanged.
  • Subtitles / captions. A GET /v1/videos/{id}/captions endpoint returning a list of {lang, url, format} entries. WebVTT + TTML.
  • Watch history sync. POST /v1/videos/{id}/playback/heartbeat already in the design — a richer write side would persist watch progress to a separate user-data service.
  • Family content / age gating. A claim on the JWT (age_band) gates /playback. The API responds 451 Unavailable for Legal Reasons rather than hiding the video.
  • Adaptive ABR ladder. Today the renditions are a fixed [240p…2160p] ladder; per-content ladders (“only encode what’s worth encoding”) need a wider transcode-result schema but no playback API change.

Mock interview follow-ups#

  • “What happens if the creator’s webhook endpoint is down when transcode completes?” — At-least-once retry with exponential backoff over 24 hours; the creator can also poll GET /v1/uploads/{session} or GET /v1/videos/{id} to learn state. The creator dashboard does the same.
  • “How does the client know to use HLS vs DASH?” — Client passes preferred_format (or auto); server picks based on User-Agent heuristics if auto. iOS / tvOS always get HLS by App Store policy.
  • “How do you prevent someone from leeching the manifest URL?” — HMAC signature over (video_id, viewer_id_hash, cdn_host, expires_at); TTL of 1 hour; CDN refuses signature mismatch. Premium content uses Widevine / FairPlay DRM tokens on top.
  • “How does the upload survive a network drop?” — Client polls GET /v1/uploads/{session} to learn parts_received; resumes from the first missing part. Each putPart is independently retriable since they’re idempotent on (session, idx).
  • “What’s the recommendation endpoint’s cache strategy?” — Per-viewer cached for 5 minutes for homepage; not cached for relatedTo(video_id) since context shifts every video. Cache key is (viewer_id, surface, locale, device_class).
  • “How do you handle a viewer in a region where the video is geo-blocked?”/playback returns 451. The video metadata endpoint also reflects unavailability so the client doesn’t even try to play.
  • “What about hot-linked thumbnails?” — Thumbnails use a separate signed-URL scheme; same HMAC pattern, longer TTL. The CDN sees the signature and rejects unauthorised requests at the edge.
  • “At 10x scale, what breaks first?” — The synchronous part of the recommendation endpoint. We’d move from on-request ranking to precomputed per-user candidate sets with online re-ranking on a smaller candidate pool, keeping the API contract identical.
  • “How does the API contract evolve when we add a new codec (e.g. AV1)?” — Additive only. The client_capabilities.codecs field already exists; the server returns an AV1-bearing manifest when both client capability and codec availability match. Old clients keep getting H.264 / VP9 manifests unchanged.

Single /videos endpoint covering upload and playback. Common candidate mistake. Upload is a write-heavy state machine with multi-GB request bodies and a multi-second SLA. Playback is a read-heavy session with a 200 ms latency budget. They share nothing operationally; merging them makes one budget impossible to honour.

Three sub-APIs with explicit seams. Upload owns the resumable protocol and the session state machine; the webhook is the only thing it pushes downstream. Playback owns the signed-URL + CDN pick. Recommendation is a thin pass-through. Each can scale, cache, fail, and version independently.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.