Design the Zoom API

Meetings, participants, signalling, recording, webhooks. The video-conferencing API behind a pandemic-era infrastructure.

System Advanced
20 min read
api-design webrtc webhooks
Companies this resembles: Zoom

Context#

Zoom is the canonical “video conferencing API” question and an Advanced-tier prompt because the surface invites scope creep into territory that has nothing to do with API design. The actual interesting media work — encoding, jitter buffer, congestion control, SFU forwarding — happens inside the media-layer architecture, not at the API boundary.

This writeup is an API-design round, not an HLD round. That means:

  • The SFU (Selective Forwarding Unit) that fans out RTP packets between participants is a black box. We do not design it.
  • The media transport (UDP / WebRTC / SRTP) is a black box. The API hands clients credentials to connect to the media plane; the rest is out of scope.
  • The jitter buffer, FEC, simulcast, SVC are media-layer concerns, not API concerns.
  • Whiteboarding, breakout rooms, polls, captions — out of scope for one round. They’re sibling APIs that share the meeting object.

What remains is the right altitude for an API-design round:

  • REST surface for scheduling, participant management, recording control, integration management.
  • WebSocket signalling for the connection-establishment dance and live meeting events.
  • Webhooks as the asynchronous notification channel for off-meeting events (started / ended / recording.completed).
  • OAuth 2 as the third-party-integration auth model, with an admin-managed app marketplace.
  • A clean boundary between what the API does (resource management, signalling handshake, lifecycle webhooks) and what the SFU does (forward packets).

The interviewer’s hidden objectives, roughly in order:

  • Can you draw a clean seam between the API plane and the media plane?
  • Can you design a scheduled-meeting object with the right level of state (scheduled / live / ended) without sliding into the SFU?
  • Can you treat webhooks as a first-class API surface, with signed payloads, retry semantics, and an event taxonomy?
  • Can you reason about OAuth scopes for an app marketplace — meeting:write, recording:read, webhook:write?
  • Can you handle the recording lifecycle (start, stop, processing, available) with a webhook contract that doesn’t tie callers to a polling pattern?

Requirements (functional and non-functional)#

Functional — in scope:

  • Schedule a meeting with start time, duration, host, optional passcode, optional waiting room.
  • Update / cancel a scheduled meeting.
  • List meetings for an authenticated user or workspace.
  • Get meeting details, including the live-state if running.
  • Join token issuance — credentials a client uses to authenticate to the SFU.
  • List participants of a live or recently-ended meeting.
  • Remove or mute a participant (host action via API).
  • Start / stop recording (cloud recording), get the result URL on completion.
  • Webhooks for meeting.started, meeting.ended, participant.joined, participant.left, recording.completed, recording.failed.
  • OAuth 2 for third-party apps; scopes per API category.

Functional — out of scope:

  • The media-plane architecture (SFU, codecs, jitter buffer, congestion control).
  • Phone-bridge / SIP / H.323 endpoints.
  • Whiteboarding, polls, breakout rooms, in-meeting chat, captions, Zoom Rooms.
  • Webinar-specific features (registration funnels, Q&A panels, practice sessions).
  • Marketplace billing for paid apps.
  • Voice-call recording transcription (the recording webhook gives you a media URL; transcription is a sibling service).

Non-functional:

  • Meeting CRUD latency: <= 200 ms p95 for schedule / read / update.
  • Join-token issuance: <= 100 ms p95 (it gates the participant joining).
  • Participant list freshness: <= 5 s after a join/leave event.
  • Webhook delivery: at-least-once, with exponential backoff over 24 hours; first delivery attempt within 5 s of the underlying event.
  • Throughput: 50k concurrent live meetings; 500k API calls/sec at peak. Most calls are read-mostly (clients polling meeting state).
  • Availability: 99.95% on the management surface; the media plane has a separate SLO managed by the SFU service.
  • Webhook delivery: 99.9% within 24 hours.

Use case diagram#

┌─────────────────────┐
│ Host (user) │
└──────────┬──────────┘
┌─────────────────┼─────────────────┐
▼ ▼ ▼
[schedule] [start/end] [record start/stop]
│ │ │
└─────────────────┴─────────────────┘
┌─────────────────────┐
│ Zoom API │
└──────────┬──────────┘
┌──────────┼──────────┐
▼ ▼ ▼
[WS signalling] [webhooks] [REST]
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ Client │ │ 3rd-party│ │ Client │
│ (joins │ │ app │ │ (mgmt UI)│
│ media │ │ │ │ │
│ plane) │ │ │ │ │
└─────────┘ └──────────┘ └──────────┘
┌─────────────────────┐
│ Media plane (SFU) │ ◄── out of API scope
└─────────────────────┘

Three surfaces (REST, WebSocket, webhooks). One actor (host or attendee). The media plane sits beneath the line — clients connect to it directly using credentials minted by the API.

Class diagram#

┌──────────────────────────┐
│ MeetingService │
├──────────────────────────┤
│ scheduleMeeting(req) │
│ updateMeeting(id, req) │
│ cancelMeeting(id) │
│ getMeeting(id) │
│ listMeetings(filters) │
│ issueJoinToken(id, uid) │
└──────────────┬───────────┘
┌──────────────────────────┐
│ Meeting │
├──────────────────────────┤
│ id : str │
│ host_id : str │
│ topic : str │
│ start_time : timestamp │
│ duration_minutes : int │
│ passcode? : str │
│ waiting_room : bool │
│ state : enum │
│ join_url : str │
└──────────────────────────┘
┌──────────────────────────┐ ┌─────────────────────┐
│ ParticipantService │ │ Participant │
├──────────────────────────┤ returns ├─────────────────────┤
│ listParticipants(mtg_id) │────────►│ user_id / guest_id │
│ removeParticipant(id,pid)│ │ display_name │
│ muteParticipant(id, pid) │ │ joined_at │
│ updateRole(id, pid, r) │ │ left_at? │
└──────────────────────────┘ │ role : host|co|att │
│ audio / video state │
└─────────────────────┘
┌──────────────────────────┐ ┌─────────────────────┐
│ RecordingService │ │ Recording │
├──────────────────────────┤ returns ├─────────────────────┤
│ startRecording(mtg_id) │────────►│ id, meeting_id │
│ stopRecording(mtg_id) │ │ state : enum │
│ getRecording(rec_id) │ │ files[]: { url, │
│ listRecordings(mtg_id) │ │ type, size, ts } │
└──────────────────────────┘ └─────────────────────┘
┌──────────────────────────┐ ┌─────────────────────┐
│ WebhookService │ │ WebhookSub │
├──────────────────────────┤ ├─────────────────────┤
│ subscribe(req) │ │ id, url │
│ unsubscribe(id) │ │ events[] │
│ listSubs() │ │ secret │
│ rotateSecret(id) │ │ active : bool │
└──────────────────────────┘ └─────────────────────┘

Four services. Meeting is the central resource; everything else is keyed off meeting_id. The WebhookSub is a config-resource — created by the marketplace app on installation, lives in the app’s settings.

Sequence diagram (key flows)#

Flow 1: scheduling and joining.

Host ZoomAPI SFU Attendee
│ POST /v1/meetings │ │
│ { topic, start_time, ... } │ │
│──────────────────────────────►│ │
│ 201 + meeting { id, join_url}│ │
│◄──────────────────────────────│ │
│ (out-of-band: share join_url) │
│ ─────────────────────────────────────────────────►│
│ │
│ │ GET join_url
│ │ (browser/app)
│ POST /v1/meetings/{id}/joinToken
│ ◄────────────────── │
│ 200 + { jwt, sfu_url }
│ ────────────────── ►│
│ │
│ │ WS to sfu_url with jwt
│ │ ────────────────────►SFU
│ │ media flows in SFU plane
│ │ ────────────────────►SFU

The API’s last involvement before the meeting starts is issuing a joinToken — a short-lived JWT (5-min TTL) the attendee presents to the SFU. The SFU validates the token against a public key the API rotates daily.

Flow 2: recording lifecycle.

Host (or app) ZoomAPI SFU RecordingPipeline
│ POST /v1/meetings/{id}/recordings:start
│──────────────────────────────►│
│ signal SFU to capture │
│ ──────────────────────────────►│
│ tap media stream
│ 202 Accepted + rec_id │
│◄──────────────────────────────│
│ ... meeting continues ...
│ POST /v1/meetings/{id}/recordings:stop
│──────────────────────────────►│
│ ──────────────────────────────►│
│ 202 Accepted │
│◄──────────────────────────────│
│ hand off raw media
│ ──────────────────►│
│ │ transcode, mux, store
│ │ (async, minutes)
│ POST {webhook_url} │
│ { event: recording.completed, ─── ── ── ── ──◄│
│ rec_id, files[]: [...] }
│◄──────────────────────────────│

The API call is fast (202 in tens of ms); the actual recording-processing happens out-of-band and the webhook is the signal of completion. The webhook subscriber gets file URLs valid for 7 days; longer-lived links require a fresh GET /v1/recordings/{id}.

Flow 3: participant management.

Host ZoomAPI SignallingGateway Participant
│ POST /v1/meetings/{id}/participants/{pid}/mute
│──────────────────────────────►│ │
│ signal mute │
│ ─────────────────►│
│ participant client sets mic muted
│ ◄─────────────────│ ACK
│ 200 OK │ │
│◄──────────────────────────────│ │

Mute is enforced client-side via the SDK with a server-vouched flag. The API doesn’t have direct media-plane control; it issues an authoritative state change that flows through the signalling channel.

Activity diagram (for non-trivial state)#

The Meeting state machine is the structure that justifies most of the design choices:

[scheduleMeeting]
┌────────────────┐
│ SCHEDULED │── cancelMeeting ──► CANCELLED
└────────┬───────┘
│ first joinToken issued + redeemed
┌────────────────┐
│ LIVE │── all participants leave or
│ │ host ends ──► ENDED
└────────┬───────┘
▼ (host triggers)
┌────────────────┐
│ RECORDING_ON │ ── stopRecording or LIVE→ENDED
└────────┬───────┘
┌────────────────┐
│ ENDED │
└────────┬───────┘
│ recording-processing job done
┌────────────────┐
│ RECORDING_READY│ (signalled via webhook)
└────────────────┘

Invariants the API enforces:

  • A SCHEDULED meeting can be edited; a LIVE meeting cannot have its start_time changed (only duration_minutes can extend).
  • CANCELLED is terminal; cancelled meetings keep their id (so external bookings remain valid lookups) but joinToken issuance returns 410 Gone.
  • RECORDING_ON is an overlay state on top of LIVE, not a separate state. The webhook taxonomy reflects this — recording.started and recording.stopped are independent of meeting.started / meeting.ended.
  • A meeting can ship multiple Recording objects if recording is started and stopped multiple times during the meeting.
  • The recording-processing step is the only state transition that happens after the meeting is ENDED — typically 0.5x to 2x the meeting duration in wall-clock time.

API implementation#

Endpoint catalogue — REST surface#

MethodPathPurpose
POST/v1/meetingsSchedule a new meeting
GET/v1/meetingsList meetings (filter by host, time window)
GET/v1/meetings/{id}Read meeting details
PATCH/v1/meetings/{id}Update a scheduled meeting
DELETE/v1/meetings/{id}Cancel a meeting
POST/v1/meetings/{id}/joinTokenIssue a short-lived join JWT
GET/v1/meetings/{id}/participantsList participants (live or post-meeting)
DELETE/v1/meetings/{id}/participants/{pid}Remove a participant
POST/v1/meetings/{id}/participants/{pid}/muteForce-mute
POST/v1/meetings/{id}/recordings:startBegin cloud recording
POST/v1/meetings/{id}/recordings:stopStop cloud recording
GET/v1/meetings/{id}/recordingsList recordings for a meeting
GET/v1/recordings/{rec_id}Read a recording (includes signed file URLs)
POST/v1/webhooksSubscribe to events
DELETE/v1/webhooks/{sub_id}Unsubscribe
GET/v1/webhooksList the caller’s subscriptions

Endpoint catalogue — webhook events#

EventPayload key fields
meeting.startedmeeting_id, host_id, started_at
meeting.endedmeeting_id, ended_at, duration_seconds, reason
participant.joinedmeeting_id, participant_id, display_name, joined_at
participant.leftmeeting_id, participant_id, left_at
recording.startedmeeting_id, recording_id, started_at
recording.stoppedmeeting_id, recording_id, stopped_at
recording.completedrecording_id, files[], duration_seconds
recording.failedrecording_id, error_code, message

OpenAPI schema (excerpt)#

OpenAPI 3.1 — Zoom API (core endpoints)
paths:
/v1/meetings:
post:
operationId: scheduleMeeting
security: [{ oauth2: [meeting.write] }]
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [topic, start_time, duration_minutes]
properties:
topic: { type: string, maxLength: 200 }
start_time: { type: string, format: date-time }
duration_minutes: { type: integer, minimum: 1, maximum: 1440 }
timezone: { type: string, example: 'America/Los_Angeles' }
passcode: { type: string, minLength: 1, maxLength: 10, nullable: true }
waiting_room: { type: boolean, default: true }
settings:
type: object
properties:
auto_recording:
type: string
enum: [none, local, cloud]
default: none
mute_upon_entry: { type: boolean, default: true }
allow_join_before_host: { type: boolean, default: false }
responses:
'201':
description: Meeting scheduled
content:
application/json:
schema: { $ref: '#/components/schemas/Meeting' }
/v1/meetings/{id}/joinToken:
post:
operationId: issueJoinToken
security: [{ oauth2: [meeting.join] }]
parameters:
- { name: id, in: path, required: true, schema: { type: string } }
requestBody:
content:
application/json:
schema:
type: object
properties:
display_name: { type: string }
guest: { type: boolean, default: false }
passcode: { type: string, nullable: true }
responses:
'200':
description: Join JWT + SFU endpoint
content:
application/json:
schema:
type: object
required: [jwt, sfu_url, expires_in]
properties:
jwt: { type: string }
sfu_url: { type: string, format: uri }
expires_in: { type: integer, example: 300 }
'403': { description: Passcode mismatch or not yet joinable }
'410': { description: Meeting cancelled }
/v1/meetings/{id}/recordings:start:
post:
operationId: startRecording
security: [{ oauth2: [recording.write] }]
parameters:
- { name: id, in: path, required: true, schema: { type: string } }
responses:
'202':
description: Recording requested
content:
application/json:
schema:
type: object
properties:
recording_id: { type: string }
state: { type: string, example: 'recording' }
'409': { description: Meeting not LIVE or already recording }
/v1/webhooks:
post:
operationId: subscribeWebhook
security: [{ oauth2: [webhook.write] }]
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [url, events]
properties:
url: { type: string, format: uri }
events:
type: array
items:
type: string
enum:
- meeting.started
- meeting.ended
- participant.joined
- participant.left
- recording.started
- recording.stopped
- recording.completed
- recording.failed
responses:
'201':
description: Subscription created (secret returned once)
content:
application/json:
schema:
type: object
properties:
id: { type: string }
secret:
type: string
description: HMAC-SHA256 secret; not retrievable later
url: { type: string }
events: { type: array, items: { type: string } }
components:
schemas:
Meeting:
type: object
required: [id, host_id, topic, start_time, duration_minutes, state, join_url]
properties:
id: { type: string }
host_id: { type: string }
topic: { type: string }
start_time: { type: string, format: date-time }
duration_minutes: { type: integer }
passcode: { type: string, nullable: true }
waiting_room: { type: boolean }
join_url: { type: string, format: uri }
state:
type: string
enum: [SCHEDULED, LIVE, ENDED, CANCELLED]
settings:
type: object
additionalProperties: true
securitySchemes:
oauth2:
type: oauth2
flows:
authorizationCode:
authorizationUrl: https://api.zoom.example/oauth/authorize
tokenUrl: https://api.zoom.example/oauth/token
scopes:
meeting.read: Read meeting metadata
meeting.write: Schedule/update meetings
meeting.join: Issue join tokens
recording.read: Read recordings
recording.write: Control recording
webhook.write: Manage webhook subscriptions

Webhook signature contract#

Every webhook POST carries two headers and a signed body:

POST to subscriber URL
POST /your-callback HTTP/1.1
Content-Type: application/json
X-Zoom-Timestamp: 1748637731
X-Zoom-Signature: t=1748637731,v1=5c4f8d7e1a...
{
"event": "recording.completed",
"ts": "2026-05-30T17:42:11Z",
"data": {
"recording_id": "rec_8h2N9c0qK",
"meeting_id": "mtg_91j20kdoF",
"duration_seconds": 1842,
"files": [
{ "type": "video", "url": "https://...mp4", "size_bytes": 482000000, "expires_at": "2026-06-06T17:42:11Z" },
{ "type": "audio", "url": "https://...m4a", "size_bytes": 28000000, "expires_at": "2026-06-06T17:42:11Z" },
{ "type": "transcript", "url": "https://...vtt", "size_bytes": 124000, "expires_at": "2026-06-06T17:42:11Z" }
]
}
}

Signature scheme: v1 = HMAC-SHA256(secret, "v1:" + timestamp + ":" + raw_body). Subscribers reject events older than 5 minutes (replay defence). At-least-once delivery with retry backoff at 1s, 5s, 30s, 5min, 30min, 6h, 24h.

Client samples — three languages#

The schedule-then-join-token flow in Python, Go, and Node.

Schedule + issue join token — Python
import requests
API = "https://api.zoom.example"
TOKEN = "Bearer eyJhbGciOi..."
def schedule_meeting(topic, start_time_iso, duration_minutes):
body = {
"topic": topic,
"start_time": start_time_iso,
"duration_minutes": duration_minutes,
"timezone": "America/Los_Angeles",
"waiting_room": True,
"settings": {"auto_recording": "cloud", "mute_upon_entry": True},
}
resp = requests.post(
f"{API}/v1/meetings",
json=body,
headers={"Authorization": TOKEN},
timeout=2,
)
resp.raise_for_status()
return resp.json()
def issue_join_token(meeting_id, display_name, passcode=None):
body = {"display_name": display_name, "guest": False}
if passcode:
body["passcode"] = passcode
resp = requests.post(
f"{API}/v1/meetings/{meeting_id}/joinToken",
json=body,
headers={"Authorization": TOKEN},
timeout=1,
)
resp.raise_for_status()
return resp.json()
mtg = schedule_meeting("Weekly Sync", "2026-06-01T18:00:00Z", 30)
tok = issue_join_token(mtg["id"], "Suraj")
print(tok["sfu_url"], tok["expires_in"], "s")

Latency budget — join-token issuance#

The 100 ms p95 budget on POST /v1/meetings/{id}/joinToken (gating user-perceived “joining”):

PhaseBudget
TLS / HTTP0 ms (warm)
OAuth scope check10 ms
Meeting state lookup15 ms
Passcode / waiting-room policy check5 ms
SFU selection (regional homing)10 ms
JWT mint (HS256 or RS256)5 ms
Serialize + transport10 ms
Margin45 ms
Total100 ms

The downstream SFU connection takes another 500-1500 ms (ICE candidate gathering, DTLS handshake), but that’s media-plane time, not API time.

Trade-offs and extensions#

DecisionWhyCost if requirements change
Clean API / SFU seam via join JWTAPI stays stateless about media; SFU is independentHave to keep two systems in sync on key rotation
Webhooks for off-meeting eventsDecouples callers from pollingSubscribers must implement retry-safe handlers
Verb-suffixes for state-change ops (:start, :stop)Honest about non-CRUD semanticsDiverges from pure REST aesthetic
OAuth 2 with granular scopesEnables a third-party app marketplaceMore scopes to document; consent screens get busy
Per-meeting recording lifecycleMultiple recordings per meetingUI complexity around “which recording?”
Cloud recording default offTrust + cost defaultsCustomer-side default is “always on” via app config
5-minute join-token TTLLimits replay if leakedRe-issuance needed on long pre-meeting waiting periods
Webhooks signed with HMAC-SHA256 + timestampReplay defence + integritySubscriber-side bookkeeping (recent ts cache)
At-least-once deliveryTolerates subscriber outagesSubscribers must dedupe by event_id (we send it)
Passcode in plaintext POST bodyStandard for low-entropy meeting passcodesCannot be hashed at rest; rotated per meeting

Likely follow-up extensions and how the API absorbs them:

  • Phone bridge / SIP joins. A separate joinToken issuance path with a phone-number + meeting-id + PIN code triple. Doesn’t change the SFU contract; the SIP gateway sits beside the SFU and presents itself as one more participant.
  • Breakout rooms. A new sub-resource POST /v1/meetings/{id}/breakouts returning an array of sub-rooms each with their own join URL. The SFU treats each breakout as a sibling room. Webhook taxonomy extends with breakout.opened / breakout.closed.
  • Live transcription. A WebSocket subscription per meeting that emits transcribed text frames. New surface, new contract. Reuses OAuth scope shape.
  • Webinar mode. A flag on Meeting that flips it into webinar semantics (panelists vs attendees, raise-hand queue, Q&A). Same Meeting resource, additional fields.
  • Meeting templates. POST /v1/meeting_templates + start_from_template field on the schedule endpoint. Save customers from re-keying settings.

Mock interview follow-ups#

  • “Why is joinToken a separate endpoint, not a field on the meeting object?” — Two reasons. (1) Tokens are short-lived (5 min); embedding them in the meeting object is wrong for resource freshness. (2) Token issuance requires a synchronous policy check (passcode, waiting-room state) that’s a write-shaped operation, not a read.
  • “How does the API handle a meeting that goes long?”duration_minutes on the meeting object is advisory; the API does not auto-end. A long-running meeting fires meeting.ended when the last participant leaves or the host clicks End. Customers who want hard caps configure a tenant-level policy.
  • “What happens if a webhook subscriber is down for hours?” — Exponential backoff over 24 hours. After 24 hours of failures we mark the subscription degraded and surface a dashboard warning; after 7 days of continuous failure we disable it and email the app owner.
  • “How does a third-party app authenticate to our API?” — OAuth 2 Authorization Code flow with PKCE for user-context apps; OAuth 2 Client Credentials for server-to-server apps. Scopes are per-API-category. Refresh tokens rotate on use.
  • “How do you support a meeting with 1000 participants?” — The API contract doesn’t change; the SFU’s job is to scale forwarding. Participant-list endpoint paginates (page_size: 50, cursor); the join-token endpoint stays sub-100 ms. The interesting scaling work is below the seam.
  • “What’s the deal with the start_url vs join_url?”start_url carries a host token (longer-lived JWT, broader scopes) that the meeting host uses to start the session. join_url triggers the joinToken flow for attendees. Two different paths because hosts have authority and need a distinct credential path.
  • “How do recordings handle large meetings?” — Recording is multiplexed at the SFU; processing is async. A 4-hour meeting at 1080p produces ~12 GB of video that the recording pipeline transcodes into adaptive renditions and stores. The webhook fires when the lowest-bitrate rendition is ready; higher renditions arrive minutes later.
  • “What if a participant leaves and rejoins?” — Two participant.joined events with two different participant_id values. The API does not unify them; that’s a downstream-customer reporting concern.
  • “At 10x scale, what breaks first?” — The participant-list endpoint’s freshness. We’d push participant-state via a server-sent events stream GET /v1/meetings/{id}/participants/stream rather than 5-second polling. The schedule / read / write surface scales horizontally; webhook delivery scales by sharding subscribers across delivery workers.

API and media plane fused, with the API directly managing RTP streams. Tempting because it looks “simpler”, but the API surface becomes a leaky abstraction over the SFU. Every congestion-control tweak shows up in the API. Mobile SDKs can’t evolve independently. The web SDK has to mirror every server change.

API and media plane separated by a join JWT. The API is a stateless REST + webhook surface; the SFU is a media-routing fleet that the join token unlocks. Each can ship independently; the contract between them is a signed JWT and a public key rotation schedule. This is how every major real-time-media API ships.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.