Estimating API Latency — Back-of-Envelope

Processing time + network time + queueing. The numbers every engineer should know (memory, SSD, datacenter RTT, cross-continent RTT).

Concept Intermediate
12 min read
latency performance back-of-envelope capacity-planning

Summary#

Every API request’s latency decomposes into three parts:

  • Processing time — what the server does. CPU work, memory access, disk reads, dependent service calls.
  • Network time — what the wire does. DNS, TCP, TLS, the round-trip to the server, the round-trip back.
  • Queueing time — what the server is doing for other people. The time the request waits in queue (kernel accept queue, application thread pool, downstream service queue) before processing starts.

A back-of-envelope latency estimate is the sum of these three. The senior move in an API-design interview is to estimate each one explicitly rather than wave at the answer. “Roughly 50ms” is a junior answer; “30ms processing (one indexed DB read + serialise) + 0.5ms intra-DC network + 10ms queueing under expected load = 40-50ms” is a senior answer.

The numbers you need to do this — and the numbers every infrastructure engineer should be able to recite — were popularised by Jeff Dean’s 2007 talk. Updated for 2026 storage and network reality, they cluster into a handful of orders of magnitude. Once you have the orders of magnitude in head, every API latency estimate is arithmetic.

Why it matters#

Three reasons the latency mental model is core to API design:

  • “Can the API hit this SLA?” is the question behind every design choice. Synchronous or async? Cached or fresh? One call or many? Each decision adds or removes ms in known amounts. Without the numbers, the choice is guesswork.
  • The interviewer will ask “what’s the latency of your design?” They want to see you can decompose it. Hand-waving means you don’t know whether your design is fast or slow.
  • Order-of-magnitude reasoning catches design bugs early. A design with three sequential cross-DC calls is 3 × 50ms = 150ms of pure network — far over a 100ms SLA. Naming the number forces the redesign (parallelise, batch, move closer).

The senior signal in an interview: “Memory access is ~100ns. SSD read is ~100μs. Datacenter RTT is ~0.5ms. Cross-continent RTT is ~80ms. If I add a cross-continent call, I just added 80ms to my p50.”

How it works#

The numbers — order of magnitude reference#

Updated for 2026 hardware and a typical cloud deployment:

OperationTimeNote
L1 cache reference~1 nsSingle CPU cycle on a 1 GHz pipeline
Branch mispredict~3 nsPer the textbook
L2 cache reference~4 nsOff-core, on-die
Mutex lock/unlock~17 nsUncontended
Main memory (RAM) reference~100 nsDRAM access
Compress 1 KB with zstd~1-3 μsModern compression
Send 1 KB over 10 Gbps network (intra-DC)~10 μsWire-time only, no propagation
SSD random read~100 μsNVMe SSD, ~10x faster than 2010 SATA SSD
Read 1 MB sequentially from SSD~500 μsNVMe sequential
Read 1 MB sequentially from RAM~50 μsDRAM streaming
Same-AZ network round-trip~0.5 msSame datacenter, same rack-cluster
Cross-AZ network round-trip (same region)~1-2 msDifferent availability zones
Cross-region (US-East to US-West)~70 msContinental scale
Cross-continent (US to EU)~80 msNYC → London ~75ms
US to Asia~150 msNYC → Singapore ~220ms; SF → Tokyo ~110ms
HDD random seek (legacy)~10 msSpinning disk — rare in 2026 production
TCP three-way handshake1 × RTTOne round-trip extra
TLS 1.3 handshake (new)1 × RTTOne extra; full TLS 1.2 is 2 × RTT
TLS 1.3 resumed (0-RTT)0 × RTTSession resumption

The clusters to keep in head:

  • Nanoseconds: CPU, cache, RAM. Free at API scale.
  • Microseconds: Local disk (SSD), in-process work. Negligible per request.
  • Sub-millisecond: Same-DC network. Cheap.
  • Single-digit milliseconds: Cross-AZ network, busy disk, cache-miss DB query. Noticeable.
  • Tens of milliseconds: Cross-region network, cold DB queries, S3 reads. Add up fast.
  • Hundreds of milliseconds: Cross-continent network, cold-start serverless, image processing. Budget destroyers.

Processing time — what the server does#

For a typical CRUD endpoint:

Receive request ~10 μs (kernel + framework parsing)
Parse JSON ~10-100 μs (depends on body size)
Authenticate (validate token) ~100 μs (HMAC) or ~1ms (DB lookup)
Authorize (RBAC check) ~100 μs (cached) or ~1ms (DB lookup)
Main query (DB indexed read) ~1-5 ms (Postgres, MySQL with index)
Serialise response ~10-100 μs (JSON encode)
Send response ~10 μs (kernel write)
─────────────────────────────────────────────
Total ~2-7 ms (warm path, no cross-service)

For a more complex endpoint (joining three services):

Parse + auth + authz ~1 ms
Call user-service ~5 ms (1 RTT + processing)
Call order-service ~5 ms
Call inventory-service ~5 ms
Aggregate + serialise ~1 ms
─────────────────────────────────────────────
Total (sequential) ~17 ms
Total (parallel) ~7 ms (max of 5 + overhead)

Sequential vs parallel is the biggest processing-time lever. Three 5ms calls in sequence is 15ms; three 5ms calls in parallel is 5ms. The senior move on any multi-service endpoint is to identify which calls have no inter-dependencies and parallelise them.

Network time — what the wire does#

The wire’s contribution decomposes by leg:

Client → API gateway ~30-100 ms (consumer internet)
Gateway → API server (same DC) ~0.5 ms
API server → DB (same DC) ~0.5 ms
API server → cache (same DC) ~0.5 ms
API server → cross-DC service ~1-5 ms (intra-region)
API server → other-region service ~70-150 ms

The user-to-server hop dominates for most APIs. A user on cellular in Mumbai talking to a US-East data center pays ~250ms round-trip. The same user talking to a Mumbai PoP pays ~20ms. Where the API runs matters more than how fast it computes.

Latency-by-region cheat (median):

User locationUS-EastUS-WestEU-WestAP-SouthAP-Northeast
New York5ms70ms80ms220ms200ms
London75ms140ms5ms110ms230ms
Singapore220ms170ms170ms50ms (Mumbai)70ms
Tokyo200ms110ms230ms100ms5ms

This is why edge runtimes (Cloudflare Workers, Vercel Edge, Lambda@Edge) matter — they pull the network leg from “user → distant origin” to “user → nearby PoP”.

Queueing time — what the server is doing for others#

The often-forgotten third term. Even if processing is 5ms and network is 1ms, the request can sit in queue for 50ms under load. Two queues to think about:

  • Application queue. If the server has 100 worker threads and 200 requests in flight, half the requests are queued. The expected wait is (in-flight / workers - 1) × average-processing-time. Little’s Law (L = λW) is the formal version.
  • Downstream queue. A call to a saturated database waits in the database’s connection pool queue.

Queueing time grows non-linearly with utilisation. The classic M/M/1 queue model: wait time is service_time × utilisation / (1 - utilisation). At 50% utilisation, wait = processing time. At 90%, wait = 9 × processing time. At 99%, wait = 99 × processing time. This is why p99 latency spikes when capacity gets tight, while p50 looks fine.

Utilisation Queue wait (as multiple of service time)
─────────────────────────────────────────────────────
50% 1×
80% 4×
90% 9×
95% 19×
99% 99×

A senior latency estimate for an SLA needs to specify the load level. “5ms at p50 under 50% utilisation” is honest; “5ms” without qualification glosses over the queue.

Putting it together — three worked estimates#

Example 1: Profile read, warm cache, same region.

Network: client → gateway ~50 ms (consumer internet RTT)
Network: gateway → API ~0.5 ms
Processing: parse + auth + serialise ~1 ms
Processing: Redis cache read ~1 ms (network 0.5ms + work 0.5ms)
Queueing (assume 50% util) ~1 ms
Network: response back to client ~50 ms
─────────────────────────────────────────────
Total (p50) ~103 ms
User-perceived (excluding their hop) ~3.5 ms (gateway → API → response)

Example 2: Order checkout, three-service fanout, same region.

Client to gateway (RTT) ~50 ms
Gateway to API (warm conn) ~0.5 ms
Auth + authz ~1 ms
Parallel fan-out:
- Inventory service ~5 ms
- Payment service ~30 ms (calls external)
- Order service ~3 ms
Total fan-out (max) ~30 ms
Aggregate + serialise ~1 ms
Queueing (50% util) ~3 ms
Response back to client ~50 ms
─────────────────────────────────────────────
Total (p50) ~135 ms

Example 3: Cross-continent — user in Tokyo, API in US-East.

Client to gateway (RTT) ~200 ms ← the killer
Same as Example 1 internally ~3.5 ms
Response back ~200 ms
─────────────────────────────────────────────
Total (p50) ~403 ms
With edge runtime (PoP in Tokyo) ~25 ms client RTT + ~3.5 ms work = ~28.5 ms

The third example shows the edge-runtime payoff: the same logic at the edge is 10x faster end-to-end because the network hop collapses.

Where to spend optimisation effort#

Given the latency breakdown, where does each ms of optimisation come from?

LeverTypical winEffort
Add an index to a slow query10-100msLow
Parallelise sequential fan-outN × per-call timeMedium
Add a Redis cache (in front of DB)1-5ms per cached missLow
Move to edge runtime (cross-continent)100-300msMedium-high
Switch to HTTP/2 or HTTP/350-100ms (handshake)Infrastructure
Pre-compute and cache hot readsUp to full processing timeVariable
Reduce response payload (sparse fields)10-50ms (bandwidth)Low
Use a faster JSON encoder100μs-1msLow
Switch DB from network to in-process~1ms per callHigh (architecture)

The order matters. Add an index before you switch JSON encoders. Parallelise before you cache. Move to edge before you optimise serialisation.

Variants and trade-offs#

Compute-bound API. Latency dominated by CPU and downstream calls. Optimise by parallelising, caching, indexing, reducing fan-out. The network is the small part; the work is the big part.

Network-bound API. Latency dominated by the user-to-server hop. Optimise by edge placement, HTTP/2 or HTTP/3, fewer round-trips, smaller payloads. The work is the small part; the wire is the big part.

Most consumer APIs are network-bound for cold connections and queue-bound under load. Internal microservice meshes are usually compute-bound (network is intra-DC and free).

Latency budgetAchievable approach
< 10ms end-to-endSame-DC client, warm cache, no downstream calls. Realistic only for intra-DC service calls.
< 50ms end-to-endEdge runtime + Redis cache, no DB call. Read-heavy public API.
< 100ms p50Standard same-region API, one or two indexed DB reads, modest fan-out. The default SLA.
< 200ms p99Above plus queue headroom (target 50-70% utilisation). Where most APIs operate.
< 500ms global p99Reasonable for cross-continent without edge; aggressive with edge.
> 1sEither a slow API or a slow client connection. Profile to find out which.

When this is asked in interviews#

Latency estimation comes up in every system-design and API-design interview:

  • The standalone “what’s the latency?” question. Decompose: processing + network + queueing. Cite the numbers. Don’t wave.
  • The “can you hit this SLA?” follow-up. Build the budget from the parts. Subtract each cost; what’s left for processing? Is it feasible?
  • The “p50 vs p99” question. Queueing time is non-linear in utilisation. p50 might be 50ms; p99 at 95% utilisation could be 200ms with the same processing time.
  • The “how do you bring it down?” follow-up. Indexes, parallelisation, caching, edge placement. Order by effort vs win.

Specific points to make:

  • Recite the numbers. Memory ~100ns, SSD ~100μs, same-DC RTT ~0.5ms, cross-region ~70ms, cross-continent ~80-150ms. Reference Jeff Dean’s 2007 talk as the canonical source.
  • Decompose into processing + network + queueing. Don’t skip queueing — it’s the source of most p99 surprises.
  • Acknowledge that user-to-API network usually dominates. This is why edge runtimes matter.
  • Specify the load level. p50 at 50% utilisation; p99 at design capacity. The same processing time gives very different latencies at different loads.
  • Use round numbers. Back-of-envelope is approximate; “~50ms” beats “47.3ms” every time.

The strongest one-liner: “Latency is processing plus network plus queueing. Memory is 100ns, SSD is 100μs, datacenter RTT is half a ms, cross-continent is 80ms. Add the parts; if it doesn’t fit the SLA, find which part is the dominant cost and move on that one.”

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.