Estimating API Latency — Back-of-Envelope — API Design

Summary#

Every API request’s latency decomposes into three parts:

Processing time — what the server does. CPU work, memory access, disk reads, dependent service calls.
Network time — what the wire does. DNS, TCP, TLS, the round-trip to the server, the round-trip back.
Queueing time — what the server is doing for other people. The time the request waits in queue (kernel accept queue, application thread pool, downstream service queue) before processing starts.

A back-of-envelope latency estimate is the sum of these three. The senior move in an API-design interview is to estimate each one explicitly rather than wave at the answer. “Roughly 50ms” is a junior answer; “30ms processing (one indexed DB read + serialise) + 0.5ms intra-DC network + 10ms queueing under expected load = 40-50ms” is a senior answer.

The numbers you need to do this — and the numbers every infrastructure engineer should be able to recite — were popularised by Jeff Dean’s 2007 talk. Updated for 2026 storage and network reality, they cluster into a handful of orders of magnitude. Once you have the orders of magnitude in head, every API latency estimate is arithmetic.

Why it matters#

Three reasons the latency mental model is core to API design:

“Can the API hit this SLA?” is the question behind every design choice. Synchronous or async? Cached or fresh? One call or many? Each decision adds or removes ms in known amounts. Without the numbers, the choice is guesswork.
The interviewer will ask “what’s the latency of your design?” They want to see you can decompose it. Hand-waving means you don’t know whether your design is fast or slow.
Order-of-magnitude reasoning catches design bugs early. A design with three sequential cross-DC calls is 3 × 50ms = 150ms of pure network — far over a 100ms SLA. Naming the number forces the redesign (parallelise, batch, move closer).

The senior signal in an interview: “Memory access is ~100ns. SSD read is ~100μs. Datacenter RTT is ~0.5ms. Cross-continent RTT is ~80ms. If I add a cross-continent call, I just added 80ms to my p50.”

How it works#

The numbers — order of magnitude reference#

Updated for 2026 hardware and a typical cloud deployment:

Operation	Time	Note
L1 cache reference	~1 ns	Single CPU cycle on a 1 GHz pipeline
Branch mispredict	~3 ns	Per the textbook
L2 cache reference	~4 ns	Off-core, on-die
Mutex lock/unlock	~17 ns	Uncontended
Main memory (RAM) reference	~100 ns	DRAM access
Compress 1 KB with zstd	~1-3 μs	Modern compression
Send 1 KB over 10 Gbps network (intra-DC)	~10 μs	Wire-time only, no propagation
SSD random read	~100 μs	NVMe SSD, ~10x faster than 2010 SATA SSD
Read 1 MB sequentially from SSD	~500 μs	NVMe sequential
Read 1 MB sequentially from RAM	~50 μs	DRAM streaming
Same-AZ network round-trip	~0.5 ms	Same datacenter, same rack-cluster
Cross-AZ network round-trip (same region)	~1-2 ms	Different availability zones
Cross-region (US-East to US-West)	~70 ms	Continental scale
Cross-continent (US to EU)	~80 ms	NYC → London ~75ms
US to Asia	~150 ms	NYC → Singapore ~220ms; SF → Tokyo ~110ms
HDD random seek (legacy)	~10 ms	Spinning disk — rare in 2026 production
TCP three-way handshake	1 × RTT	One round-trip extra
TLS 1.3 handshake (new)	1 × RTT	One extra; full TLS 1.2 is 2 × RTT
TLS 1.3 resumed (0-RTT)	0 × RTT	Session resumption

The clusters to keep in head:

Nanoseconds: CPU, cache, RAM. Free at API scale.
Microseconds: Local disk (SSD), in-process work. Negligible per request.
Sub-millisecond: Same-DC network. Cheap.
Single-digit milliseconds: Cross-AZ network, busy disk, cache-miss DB query. Noticeable.
Tens of milliseconds: Cross-region network, cold DB queries, S3 reads. Add up fast.
Hundreds of milliseconds: Cross-continent network, cold-start serverless, image processing. Budget destroyers.

Processing time — what the server does#

For a typical CRUD endpoint:

   Receive request                       ~10 μs   (kernel + framework parsing)
   Parse JSON                            ~10-100 μs  (depends on body size)
   Authenticate (validate token)         ~100 μs  (HMAC) or ~1ms (DB lookup)
   Authorize (RBAC check)                ~100 μs  (cached) or ~1ms (DB lookup)
   Main query (DB indexed read)          ~1-5 ms  (Postgres, MySQL with index)
   Serialise response                    ~10-100 μs (JSON encode)
   Send response                         ~10 μs  (kernel write)
   ─────────────────────────────────────────────
   Total                                 ~2-7 ms (warm path, no cross-service)

For a more complex endpoint (joining three services):

   Parse + auth + authz                  ~1 ms
   Call user-service                     ~5 ms  (1 RTT + processing)
   Call order-service                    ~5 ms
   Call inventory-service                ~5 ms
   Aggregate + serialise                 ~1 ms
   ─────────────────────────────────────────────
   Total (sequential)                    ~17 ms
   Total (parallel)                      ~7 ms  (max of 5 + overhead)

Sequential vs parallel is the biggest processing-time lever. Three 5ms calls in sequence is 15ms; three 5ms calls in parallel is 5ms. The senior move on any multi-service endpoint is to identify which calls have no inter-dependencies and parallelise them.

Network time — what the wire does#

The wire’s contribution decomposes by leg:

   Client → API gateway                  ~30-100 ms  (consumer internet)
   Gateway → API server (same DC)        ~0.5 ms
   API server → DB (same DC)             ~0.5 ms
   API server → cache (same DC)          ~0.5 ms
   API server → cross-DC service         ~1-5 ms     (intra-region)
   API server → other-region service     ~70-150 ms

The user-to-server hop dominates for most APIs. A user on cellular in Mumbai talking to a US-East data center pays ~250ms round-trip. The same user talking to a Mumbai PoP pays ~20ms. Where the API runs matters more than how fast it computes.

Latency-by-region cheat (median):

User location	US-East	US-West	EU-West	AP-South	AP-Northeast
New York	5ms	70ms	80ms	220ms	200ms
London	75ms	140ms	5ms	110ms	230ms
Singapore	220ms	170ms	170ms	50ms (Mumbai)	70ms
Tokyo	200ms	110ms	230ms	100ms	5ms

This is why edge runtimes (Cloudflare Workers, Vercel Edge, Lambda@Edge) matter — they pull the network leg from “user → distant origin” to “user → nearby PoP”.

Queueing time — what the server is doing for others#

The often-forgotten third term. Even if processing is 5ms and network is 1ms, the request can sit in queue for 50ms under load. Two queues to think about:

Application queue. If the server has 100 worker threads and 200 requests in flight, half the requests are queued. The expected wait is (in-flight / workers - 1) × average-processing-time. Little’s Law (L = λW) is the formal version.
Downstream queue. A call to a saturated database waits in the database’s connection pool queue.

Queueing time grows non-linearly with utilisation. The classic M/M/1 queue model: wait time is service_time × utilisation / (1 - utilisation). At 50% utilisation, wait = processing time. At 90%, wait = 9 × processing time. At 99%, wait = 99 × processing time. This is why p99 latency spikes when capacity gets tight, while p50 looks fine.

   Utilisation    Queue wait (as multiple of service time)
   ─────────────────────────────────────────────────────
   50%            1×
   80%            4×
   90%            9×
   95%            19×
   99%            99×

A senior latency estimate for an SLA needs to specify the load level. “5ms at p50 under 50% utilisation” is honest; “5ms” without qualification glosses over the queue.

Putting it together — three worked estimates#

Example 1: Profile read, warm cache, same region.

   Network: client → gateway              ~50 ms  (consumer internet RTT)
   Network: gateway → API                 ~0.5 ms
   Processing: parse + auth + serialise   ~1 ms
   Processing: Redis cache read           ~1 ms   (network 0.5ms + work 0.5ms)
   Queueing (assume 50% util)             ~1 ms
   Network: response back to client       ~50 ms
   ─────────────────────────────────────────────
   Total (p50)                            ~103 ms
   User-perceived (excluding their hop)   ~3.5 ms  (gateway → API → response)

Example 2: Order checkout, three-service fanout, same region.

   Client to gateway (RTT)                ~50 ms
   Gateway to API (warm conn)             ~0.5 ms
   Auth + authz                           ~1 ms
   Parallel fan-out:
     - Inventory service                  ~5 ms
     - Payment service                    ~30 ms   (calls external)
     - Order service                      ~3 ms
   Total fan-out (max)                    ~30 ms
   Aggregate + serialise                  ~1 ms
   Queueing (50% util)                    ~3 ms
   Response back to client                ~50 ms
   ─────────────────────────────────────────────
   Total (p50)                            ~135 ms

Example 3: Cross-continent — user in Tokyo, API in US-East.

   Client to gateway (RTT)                ~200 ms   ← the killer
   Same as Example 1 internally           ~3.5 ms
   Response back                          ~200 ms
   ─────────────────────────────────────────────
   Total (p50)                            ~403 ms
   With edge runtime (PoP in Tokyo)       ~25 ms client RTT + ~3.5 ms work = ~28.5 ms

The third example shows the edge-runtime payoff: the same logic at the edge is 10x faster end-to-end because the network hop collapses.

Where to spend optimisation effort#

Given the latency breakdown, where does each ms of optimisation come from?

Lever	Typical win	Effort
Add an index to a slow query	10-100ms	Low
Parallelise sequential fan-out	N × per-call time	Medium
Add a Redis cache (in front of DB)	1-5ms per cached miss	Low
Move to edge runtime (cross-continent)	100-300ms	Medium-high
Switch to HTTP/2 or HTTP/3	50-100ms (handshake)	Infrastructure
Pre-compute and cache hot reads	Up to full processing time	Variable
Reduce response payload (sparse fields)	10-50ms (bandwidth)	Low
Use a faster JSON encoder	100μs-1ms	Low
Switch DB from network to in-process	~1ms per call	High (architecture)

The order matters. Add an index before you switch JSON encoders. Parallelise before you cache. Move to edge before you optimise serialisation.

Variants and trade-offs#

Compute-bound API. Latency dominated by CPU and downstream calls. Optimise by parallelising, caching, indexing, reducing fan-out. The network is the small part; the work is the big part.

Network-bound API. Latency dominated by the user-to-server hop. Optimise by edge placement, HTTP/2 or HTTP/3, fewer round-trips, smaller payloads. The work is the small part; the wire is the big part.

Most consumer APIs are network-bound for cold connections and queue-bound under load. Internal microservice meshes are usually compute-bound (network is intra-DC and free).

Latency budget	Achievable approach
`< 10ms` end-to-end	Same-DC client, warm cache, no downstream calls. Realistic only for intra-DC service calls.
`< 50ms` end-to-end	Edge runtime + Redis cache, no DB call. Read-heavy public API.
`< 100ms` p50	Standard same-region API, one or two indexed DB reads, modest fan-out. The default SLA.
`< 200ms` p99	Above plus queue headroom (target 50-70% utilisation). Where most APIs operate.
`< 500ms` global p99	Reasonable for cross-continent without edge; aggressive with edge.
`> 1s`	Either a slow API or a slow client connection. Profile to find out which.

When this is asked in interviews#

Latency estimation comes up in every system-design and API-design interview:

The standalone “what’s the latency?” question. Decompose: processing + network + queueing. Cite the numbers. Don’t wave.
The “can you hit this SLA?” follow-up. Build the budget from the parts. Subtract each cost; what’s left for processing? Is it feasible?
The “p50 vs p99” question. Queueing time is non-linear in utilisation. p50 might be 50ms; p99 at 95% utilisation could be 200ms with the same processing time.
The “how do you bring it down?” follow-up. Indexes, parallelisation, caching, edge placement. Order by effort vs win.

Specific points to make:

Recite the numbers. Memory ~100ns, SSD ~100μs, same-DC RTT ~0.5ms, cross-region ~70ms, cross-continent ~80-150ms. Reference Jeff Dean’s 2007 talk as the canonical source.
Decompose into processing + network + queueing. Don’t skip queueing — it’s the source of most p99 surprises.
Acknowledge that user-to-API network usually dominates. This is why edge runtimes matter.
Specify the load level. p50 at 50% utilisation; p99 at design capacity. The same processing time gives very different latencies at different loads.
Use round numbers. Back-of-envelope is approximate; “~50ms” beats “47.3ms” every time.

The strongest one-liner: “Latency is processing plus network plus queueing. Memory is 100ns, SSD is 100μs, datacenter RTT is half a ms, cross-continent is 80ms. Add the parts; if it doesn’t fit the SLA, find which part is the dominant cost and move on that one.”

Latency and Throughput — the two-dimensional view; latency-vs-throughput trade-off.
Speeding Up Web Page Loading — applying the latency budget to a full page-load path.
Caching at Different Layers — where to spend cache money to reduce processing time.
The Evolution of HTTP — 1.1, 2, 3 — handshake-time wins from HTTP/2 multiplexing and HTTP/3 0-RTT.
The API-Design Walk-through — the seven-step recipe; latency estimation is step seven.