Latency and Throughput — API Design

Summary#

Performance has two dimensions, and they are not the same. Latency is how long a single operation takes, measured from request to response. Throughput is how many operations the system can complete per unit time. A system can be fast on one and slow on the other. An API that returns in 20 ms but only handles 50 requests per second is a low-latency, low-throughput system. An API that handles 100,000 requests per second but each one takes 800 ms is a high-throughput, high-latency system. Most production APIs need both.

The trade between them is real and physical. Adding batching, buffering, or pipelining typically raises throughput at the cost of latency — the batch waits to fill. Adding parallelism, prefetching, or speculative execution typically lowers latency at the cost of throughput — the speculation wastes capacity on requests that did not need it. Compression trades CPU for bandwidth, which moves both dimensions. Caching is one of the rare interventions that improves both — on a hit. On a miss it does neither.

The numbers you carry in your head matter. Memory access is ~100 ns. A datacenter round-trip is ~0.5 ms. SSD read is ~100 μs. Cross-continent round-trip is ~80 ms. Human perception of “instant” tops out at ~100 ms. Interactive 60 fps gives you ~16 ms per frame. An API designer who knows these reference points to within an order of magnitude can sanity-check any latency budget on a whiteboard.

Why it matters#

Three reasons every API design has to confront both dimensions.

Users perceive latency, not throughput. A user who clicks a button and waits 2 seconds for the response perceives the API as slow regardless of whether the backend is handling 10 or 10 million requests per second elsewhere. Latency is the user-facing variable; throughput is the cost-of-goods variable. Optimising for one without the other is a common product mistake.
The two dimensions trade against each other. Designs that improve throughput (batching, async pipelining, queueing) frequently worsen the latency of any individual request. Designs that improve latency (parallel fan-out, speculative reads, aggressive caching) frequently worsen throughput by wasting work. The discipline is to name which dimension matters more for this API and design accordingly.
Tail latency is the truth; p50 lies. A service whose median request takes 30 ms can still have a p99 of 800 ms — and if your API call fans out to ten downstream services, the slowest tail of one of them defines your latency. The p50 number is a marketing number; the p99 and p99.9 numbers are the engineering numbers. Jeff Dean’s “The Tail at Scale” paper made this canon.

The corollary is that you cannot design an API contract well without a latency budget. “How fast does this need to be?” is a clarifying question, not a stretch goal.

How it works#

Both dimensions decompose into physical components you can budget against.

Latency budget#

Latency is additive along the call path. A typical browser-to-API call breaks down into:

Component	Typical contribution
DNS resolution (first call)	10–100 ms
TCP handshake	1× RTT
TLS handshake	1–2× RTT
HTTP request transit	0.5× RTT
Server processing	Variable
Downstream fan-out	Variable
HTTP response transit	0.5× RTT
Client deserialisation	1–10 ms

For an in-region API call (RTT ~5 ms), the network alone is ~20 ms on a cold connection — DNS plus three RTTs of handshakes. For a cross-continent call (RTT ~80 ms), the same handshake is ~320 ms before the server does any work. This is why connection reuse (HTTP keep-alive, HTTP/2 multiplexing) matters so much, and why HTTP/3 + QUIC bundle the TLS handshake into the transport (0-RTT or 1-RTT establishment).

The order-of-magnitude numbers an API designer should memorise:

L1 cache: ~1 ns
L2 cache: ~4 ns
Main memory: ~100 ns
SSD random read: ~100 μs
Spinning disk seek: ~10 ms
Same-datacenter RTT: ~0.5 ms
Same-region RTT: ~5 ms
Cross-continent RTT: ~80 ms
Cross-Pacific RTT: ~150 ms

These numbers move slowly — they have been within a factor of 2 of these values for over a decade — so memorising them once pays for years.

Throughput budget#

Throughput is bounded by whichever resource saturates first. The candidates are:

CPU — serialisation, hashing, compression, JSON parsing.
Memory bandwidth — copying buffers, GC pressure.
Network bandwidth — bytes per second between client and server, or between server and downstream.
Disk I/O — IOPS or sequential bandwidth.
Concurrent connections — the OS, the load balancer, or the connection pool runs out.
Lock contention — a critical section serialises requests that would otherwise parallelise.

A single instance of a modern HTTP server (Go, Rust, async Python) can typically handle thousands to tens of thousands of requests per second on commodity hardware. Past that, throughput scales with horizontal replication, which itself runs into coordination costs (state sharing, cache invalidation, queue depth).

Human perception ceilings#

The numbers worth designing against:

~100 ms — the threshold below which a response feels “instant” to a user. Below this, the user does not notice latency.
~16.7 ms — one frame at 60 fps. Below this, an interactive UI stays smooth.
~1 s — the threshold above which the user’s flow of thought breaks; they start to wonder if it’s working.
~10 s — the threshold above which the user gives up and switches tasks.

These come from Nielsen’s 1993 paper on response times, restated countless times since. They have not changed because human perception has not changed.

The Bandwidth-Delay Product#

The amount of data “in flight” on a connection at any moment is bandwidth times round-trip time. A 1 Gbps link with a 100 ms RTT can hold 1 Gbps × 0.1 s = 12.5 MB of unacknowledged data in transit at once. If your TCP window is smaller than the BDP, you leave bandwidth on the floor. This is why long-haul, high-bandwidth links need tuned TCP window sizes — and why a single TCP connection over a satellite link rarely fills the link.

Tail latency#

If the median request takes 30 ms and the p99 takes 300 ms, what does a fan-out call to 10 downstream services look like? The expected slowest of 10 p99 samples is closer to the p99.9 of one — and the parent’s p50 is now dominated by the children’s p99 tail. The arithmetic is brutal: fan-out amplifies the tail. The mitigations — hedged requests, request cancellation, tail-tolerant aggregation — are all upstream-system concerns that an API contract must accommodate (idempotency, cancellation semantics, partial-result tolerance).

Variants and trade-offs#

The latency-vs-throughput trade shows up in concrete API-design decisions.

Optimise for latency. Smaller payloads, fewer round-trips, parallel fan-out, aggressive caching, dedicated read replicas, geo-distributed edges. Suitable for interactive APIs where the user is waiting (search, autocomplete, page-load APIs, real-time dashboards). Cost: more infrastructure per request, more cache invalidation complexity, potentially redundant work.

Optimise for throughput. Batch endpoints, async queues, bulk operations, request coalescing, server-side aggregation, lower-tier hardware. Suitable for backend pipelines, analytics ingestion, write-heavy logging, ETL APIs. Cost: higher per-request latency, batching delays, harder debugging when a single record fails inside a batch.

Technique	Latency	Throughput	Notes
HTTP keep-alive	Better	Better	Reuses TCP/TLS handshake cost
HTTP/2 multiplexing	Better	Better	Concurrent streams on one connection
HTTP/3 (QUIC)	Better	Roughly same	Removes head-of-line blocking
Compression (gzip, br)	Worse on small, better on large	Better	CPU cost vs bandwidth saving
Caching (hit)	Better	Better	The one free lunch
Caching (miss)	Worse	Worse	Plus extra cache layer cost
Batching	Worse	Better	Wait-for-batch latency
Pagination	Better per page	Same total	Spreads work over many calls
Streaming / SSE	Better TTFB	Same	First byte arrives early

The skill is naming, on each API, which side of the trade matters more — and being able to justify it. “This is a user-facing search endpoint, latency under 200 ms is the spec, we will accept lower throughput per instance and scale horizontally” is a senior answer.

When this is asked in interviews#

The latency/throughput axis comes up in three predictable moments during an API-design round.

The first is the capacity-estimation step, where you produce a back-of-the-envelope number — X requests per second, Y ms per request, Z GB of data per day. The interviewer is looking for whether you can hold the order-of-magnitude numbers above in your head and apply them without a calculator.

The second is the fan-out discussion, when your design calls four downstream services. The interviewer asks “what is your tail latency?” and the senior signal is naming p99 explicitly, talking about hedged requests, and showing how the contract supports partial results or cancellation.

The third is the caching question, which is really a latency/throughput question in disguise. “Where would you cache?” — at the CDN, at the gateway, at the service, in-process. Each layer trades latency improvement for invalidation complexity. A candidate who answers “cache everywhere” without thinking about staleness is signalling shallow understanding; the strong answer picks one or two layers and justifies them against the read/write ratio and the freshness budget.

The phrase to leave the interviewer with: “I had a latency budget and a throughput budget and I designed for both.” If you cannot recite both numbers for your API, you have not finished the design.

Estimating API Latency — Back-of-Envelope — the operational technique for tracing a request through the stack and budgeting each hop.
Caching at Different Layers — the one intervention that improves both dimensions on hits.
The Narrow Waist of the Internet — the IP substrate whose physics set the latency floors.
The Evolution of HTTP — 1.1, 2, 3 — the protocol changes that reshape the latency budget by removing handshakes and head-of-line blocking.
What Is API Design? — the contract framing that has to respect the performance budget.