Estimating API Latency — Back-of-Envelope
Processing time + network time + queueing. The numbers every engineer should know (memory, SSD, datacenter RTT, cross-continent RTT).
Summary#
Every API request’s latency decomposes into three parts:
- Processing time — what the server does. CPU work, memory access, disk reads, dependent service calls.
- Network time — what the wire does. DNS, TCP, TLS, the round-trip to the server, the round-trip back.
- Queueing time — what the server is doing for other people. The time the request waits in queue (kernel accept queue, application thread pool, downstream service queue) before processing starts.
A back-of-envelope latency estimate is the sum of these three. The senior move in an API-design interview is to estimate each one explicitly rather than wave at the answer. “Roughly 50ms” is a junior answer; “30ms processing (one indexed DB read + serialise) + 0.5ms intra-DC network + 10ms queueing under expected load = 40-50ms” is a senior answer.
The numbers you need to do this — and the numbers every infrastructure engineer should be able to recite — were popularised by Jeff Dean’s 2007 talk. Updated for 2026 storage and network reality, they cluster into a handful of orders of magnitude. Once you have the orders of magnitude in head, every API latency estimate is arithmetic.
Why it matters#
Three reasons the latency mental model is core to API design:
- “Can the API hit this SLA?” is the question behind every design choice. Synchronous or async? Cached or fresh? One call or many? Each decision adds or removes ms in known amounts. Without the numbers, the choice is guesswork.
- The interviewer will ask “what’s the latency of your design?” They want to see you can decompose it. Hand-waving means you don’t know whether your design is fast or slow.
- Order-of-magnitude reasoning catches design bugs early. A design with three sequential cross-DC calls is
3 × 50ms = 150msof pure network — far over a 100ms SLA. Naming the number forces the redesign (parallelise, batch, move closer).
The senior signal in an interview: “Memory access is ~100ns. SSD read is ~100μs. Datacenter RTT is ~0.5ms. Cross-continent RTT is ~80ms. If I add a cross-continent call, I just added 80ms to my p50.”
How it works#
The numbers — order of magnitude reference#
Updated for 2026 hardware and a typical cloud deployment:
| Operation | Time | Note |
|---|---|---|
| L1 cache reference | ~1 ns | Single CPU cycle on a 1 GHz pipeline |
| Branch mispredict | ~3 ns | Per the textbook |
| L2 cache reference | ~4 ns | Off-core, on-die |
| Mutex lock/unlock | ~17 ns | Uncontended |
| Main memory (RAM) reference | ~100 ns | DRAM access |
| Compress 1 KB with zstd | ~1-3 μs | Modern compression |
| Send 1 KB over 10 Gbps network (intra-DC) | ~10 μs | Wire-time only, no propagation |
| SSD random read | ~100 μs | NVMe SSD, ~10x faster than 2010 SATA SSD |
| Read 1 MB sequentially from SSD | ~500 μs | NVMe sequential |
| Read 1 MB sequentially from RAM | ~50 μs | DRAM streaming |
| Same-AZ network round-trip | ~0.5 ms | Same datacenter, same rack-cluster |
| Cross-AZ network round-trip (same region) | ~1-2 ms | Different availability zones |
| Cross-region (US-East to US-West) | ~70 ms | Continental scale |
| Cross-continent (US to EU) | ~80 ms | NYC → London ~75ms |
| US to Asia | ~150 ms | NYC → Singapore ~220ms; SF → Tokyo ~110ms |
| HDD random seek (legacy) | ~10 ms | Spinning disk — rare in 2026 production |
| TCP three-way handshake | 1 × RTT | One round-trip extra |
| TLS 1.3 handshake (new) | 1 × RTT | One extra; full TLS 1.2 is 2 × RTT |
| TLS 1.3 resumed (0-RTT) | 0 × RTT | Session resumption |
The clusters to keep in head:
- Nanoseconds: CPU, cache, RAM. Free at API scale.
- Microseconds: Local disk (SSD), in-process work. Negligible per request.
- Sub-millisecond: Same-DC network. Cheap.
- Single-digit milliseconds: Cross-AZ network, busy disk, cache-miss DB query. Noticeable.
- Tens of milliseconds: Cross-region network, cold DB queries, S3 reads. Add up fast.
- Hundreds of milliseconds: Cross-continent network, cold-start serverless, image processing. Budget destroyers.
Processing time — what the server does#
For a typical CRUD endpoint:
Receive request ~10 μs (kernel + framework parsing) Parse JSON ~10-100 μs (depends on body size) Authenticate (validate token) ~100 μs (HMAC) or ~1ms (DB lookup) Authorize (RBAC check) ~100 μs (cached) or ~1ms (DB lookup) Main query (DB indexed read) ~1-5 ms (Postgres, MySQL with index) Serialise response ~10-100 μs (JSON encode) Send response ~10 μs (kernel write) ───────────────────────────────────────────── Total ~2-7 ms (warm path, no cross-service)For a more complex endpoint (joining three services):
Parse + auth + authz ~1 ms Call user-service ~5 ms (1 RTT + processing) Call order-service ~5 ms Call inventory-service ~5 ms Aggregate + serialise ~1 ms ───────────────────────────────────────────── Total (sequential) ~17 ms Total (parallel) ~7 ms (max of 5 + overhead)Sequential vs parallel is the biggest processing-time lever. Three 5ms calls in sequence is 15ms; three 5ms calls in parallel is 5ms. The senior move on any multi-service endpoint is to identify which calls have no inter-dependencies and parallelise them.
Network time — what the wire does#
The wire’s contribution decomposes by leg:
Client → API gateway ~30-100 ms (consumer internet) Gateway → API server (same DC) ~0.5 ms API server → DB (same DC) ~0.5 ms API server → cache (same DC) ~0.5 ms API server → cross-DC service ~1-5 ms (intra-region) API server → other-region service ~70-150 msThe user-to-server hop dominates for most APIs. A user on cellular in Mumbai talking to a US-East data center pays ~250ms round-trip. The same user talking to a Mumbai PoP pays ~20ms. Where the API runs matters more than how fast it computes.
Latency-by-region cheat (median):
| User location | US-East | US-West | EU-West | AP-South | AP-Northeast |
|---|---|---|---|---|---|
| New York | 5ms | 70ms | 80ms | 220ms | 200ms |
| London | 75ms | 140ms | 5ms | 110ms | 230ms |
| Singapore | 220ms | 170ms | 170ms | 50ms (Mumbai) | 70ms |
| Tokyo | 200ms | 110ms | 230ms | 100ms | 5ms |
This is why edge runtimes (Cloudflare Workers, Vercel Edge, Lambda@Edge) matter — they pull the network leg from “user → distant origin” to “user → nearby PoP”.
Queueing time — what the server is doing for others#
The often-forgotten third term. Even if processing is 5ms and network is 1ms, the request can sit in queue for 50ms under load. Two queues to think about:
- Application queue. If the server has 100 worker threads and 200 requests in flight, half the requests are queued. The expected wait is
(in-flight / workers - 1) × average-processing-time. Little’s Law (L = λW) is the formal version. - Downstream queue. A call to a saturated database waits in the database’s connection pool queue.
Queueing time grows non-linearly with utilisation. The classic M/M/1 queue model: wait time is service_time × utilisation / (1 - utilisation). At 50% utilisation, wait = processing time. At 90%, wait = 9 × processing time. At 99%, wait = 99 × processing time. This is why p99 latency spikes when capacity gets tight, while p50 looks fine.
Utilisation Queue wait (as multiple of service time) ───────────────────────────────────────────────────── 50% 1× 80% 4× 90% 9× 95% 19× 99% 99×A senior latency estimate for an SLA needs to specify the load level. “5ms at p50 under 50% utilisation” is honest; “5ms” without qualification glosses over the queue.
Putting it together — three worked estimates#
Example 1: Profile read, warm cache, same region.
Network: client → gateway ~50 ms (consumer internet RTT) Network: gateway → API ~0.5 ms Processing: parse + auth + serialise ~1 ms Processing: Redis cache read ~1 ms (network 0.5ms + work 0.5ms) Queueing (assume 50% util) ~1 ms Network: response back to client ~50 ms ───────────────────────────────────────────── Total (p50) ~103 ms User-perceived (excluding their hop) ~3.5 ms (gateway → API → response)Example 2: Order checkout, three-service fanout, same region.
Client to gateway (RTT) ~50 ms Gateway to API (warm conn) ~0.5 ms Auth + authz ~1 ms Parallel fan-out: - Inventory service ~5 ms - Payment service ~30 ms (calls external) - Order service ~3 ms Total fan-out (max) ~30 ms Aggregate + serialise ~1 ms Queueing (50% util) ~3 ms Response back to client ~50 ms ───────────────────────────────────────────── Total (p50) ~135 msExample 3: Cross-continent — user in Tokyo, API in US-East.
Client to gateway (RTT) ~200 ms ← the killer Same as Example 1 internally ~3.5 ms Response back ~200 ms ───────────────────────────────────────────── Total (p50) ~403 ms With edge runtime (PoP in Tokyo) ~25 ms client RTT + ~3.5 ms work = ~28.5 msThe third example shows the edge-runtime payoff: the same logic at the edge is 10x faster end-to-end because the network hop collapses.
Where to spend optimisation effort#
Given the latency breakdown, where does each ms of optimisation come from?
| Lever | Typical win | Effort |
|---|---|---|
| Add an index to a slow query | 10-100ms | Low |
| Parallelise sequential fan-out | N × per-call time | Medium |
| Add a Redis cache (in front of DB) | 1-5ms per cached miss | Low |
| Move to edge runtime (cross-continent) | 100-300ms | Medium-high |
| Switch to HTTP/2 or HTTP/3 | 50-100ms (handshake) | Infrastructure |
| Pre-compute and cache hot reads | Up to full processing time | Variable |
| Reduce response payload (sparse fields) | 10-50ms (bandwidth) | Low |
| Use a faster JSON encoder | 100μs-1ms | Low |
| Switch DB from network to in-process | ~1ms per call | High (architecture) |
The order matters. Add an index before you switch JSON encoders. Parallelise before you cache. Move to edge before you optimise serialisation.
Variants and trade-offs#
Compute-bound API. Latency dominated by CPU and downstream calls. Optimise by parallelising, caching, indexing, reducing fan-out. The network is the small part; the work is the big part.
Network-bound API. Latency dominated by the user-to-server hop. Optimise by edge placement, HTTP/2 or HTTP/3, fewer round-trips, smaller payloads. The work is the small part; the wire is the big part.
Most consumer APIs are network-bound for cold connections and queue-bound under load. Internal microservice meshes are usually compute-bound (network is intra-DC and free).
| Latency budget | Achievable approach |
|---|---|
< 10ms end-to-end | Same-DC client, warm cache, no downstream calls. Realistic only for intra-DC service calls. |
< 50ms end-to-end | Edge runtime + Redis cache, no DB call. Read-heavy public API. |
< 100ms p50 | Standard same-region API, one or two indexed DB reads, modest fan-out. The default SLA. |
< 200ms p99 | Above plus queue headroom (target 50-70% utilisation). Where most APIs operate. |
< 500ms global p99 | Reasonable for cross-continent without edge; aggressive with edge. |
> 1s | Either a slow API or a slow client connection. Profile to find out which. |
When this is asked in interviews#
Latency estimation comes up in every system-design and API-design interview:
- The standalone “what’s the latency?” question. Decompose: processing + network + queueing. Cite the numbers. Don’t wave.
- The “can you hit this SLA?” follow-up. Build the budget from the parts. Subtract each cost; what’s left for processing? Is it feasible?
- The “p50 vs p99” question. Queueing time is non-linear in utilisation. p50 might be 50ms; p99 at 95% utilisation could be 200ms with the same processing time.
- The “how do you bring it down?” follow-up. Indexes, parallelisation, caching, edge placement. Order by effort vs win.
Specific points to make:
- Recite the numbers. Memory ~100ns, SSD ~100μs, same-DC RTT ~0.5ms, cross-region ~70ms, cross-continent ~80-150ms. Reference Jeff Dean’s 2007 talk as the canonical source.
- Decompose into processing + network + queueing. Don’t skip queueing — it’s the source of most p99 surprises.
- Acknowledge that user-to-API network usually dominates. This is why edge runtimes matter.
- Specify the load level. p50 at 50% utilisation; p99 at design capacity. The same processing time gives very different latencies at different loads.
- Use round numbers. Back-of-envelope is approximate; “~50ms” beats “47.3ms” every time.
The strongest one-liner: “Latency is processing plus network plus queueing. Memory is 100ns, SSD is 100μs, datacenter RTT is half a ms, cross-continent is 80ms. Add the parts; if it doesn’t fit the SLA, find which part is the dominant cost and move on that one.”
Related concepts#
- Latency and Throughput — the two-dimensional view; latency-vs-throughput trade-off.
- Speeding Up Web Page Loading — applying the latency budget to a full page-load path.
- Caching at Different Layers — where to spend cache money to reduce processing time.
- The Evolution of HTTP — 1.1, 2, 3 — handshake-time wins from HTTP/2 multiplexing and HTTP/3 0-RTT.
- The API-Design Walk-through — the seven-step recipe; latency estimation is step seven.