Rate Limiting
Token bucket, leaky bucket, fixed window, sliding window. Per-user vs per-IP, the 429 contract, and the burst question.
What it is#
Rate limiting is the server-side discipline of capping how many requests a given caller can make in a given time window. The cap exists for several reasons at once: protecting backend capacity from runaway clients, ensuring fair allocation across tenants, preventing abuse from credential-stuffing or scraping bots, and shedding load gracefully when the system is overwhelmed.
The mechanics live at the gateway or middleware layer — the request comes in, the limiter checks “has this caller exceeded their budget?”, and either passes the request through or rejects it with 429 Too Many Requests. The hard problems are not at the reject step; they’re in how the budget is counted (the algorithm), what the caller is keyed by (user, IP, API key, endpoint), and what the response tells the client so the client can back off correctly.
Four algorithms cover 99% of production use: token bucket (the most common; smooths bursts), leaky bucket (similar but enforces a strict outflow), fixed window (simplest; allows boundary-effect spikes), and sliding window (smoother; more state). The token bucket is the default for HTTP APIs; the others have their niches.
The 429 contract is part of the design, not an afterthought. A well-behaved API returns 429 with three response headers: Retry-After (how long to wait), X-RateLimit-Limit (the cap), and X-RateLimit-Remaining (current budget). A client that respects these headers degrades gracefully; a client that doesn’t gets stuck in a retry storm.
When to use it#
Rate limiting is mandatory for:
- Any public API. Without limits, a single buggy client or hostile script can exhaust capacity and degrade service for everyone else. This is non-negotiable.
- Multi-tenant APIs. Per-tenant rate limits are how you stop one customer’s bad day from becoming everyone’s bad day. “Noisy neighbour” is the failure mode rate limits prevent.
- Auth endpoints. Login, password reset, and 2FA-code endpoints get aggressive rate limits to defeat credential stuffing and brute force. Often these limits are tighter than the rest of the API by an order of magnitude.
- Anything backed by expensive resources. AI inference endpoints, image processing, full-text search — anything where each request has measurable cost.
Rate limiting is less load-bearing when:
- The API is internal and the callers are all known. A capacity limit at the service-mesh level may suffice; per-caller rate limits add operational complexity.
- The workload is trusted bulk traffic. Bulk import endpoints expect bursts; instead of rate-limiting, queue the work and process asynchronously.
How it works#
Token bucket — the default algorithm#
The token bucket is the most common choice for HTTP APIs because it allows controlled bursts while enforcing a long-run rate.
The model: each caller has a virtual bucket with capacity B tokens. Tokens are added at rate R per second (up to the cap). Each request consumes one token; if no token is available, the request is rejected.
Refill rate: R tokens/sec │ ▼ ┌─────────┐ │ ▓▓▓▓▓░░ │ capacity B = 100, current = 60 └─────────┘ │ ▼ 1 token per request (or N for cost-weighted requests)For example, R = 100/s, B = 1000: the caller can burst up to 1,000 requests immediately if the bucket is full, but can sustain only 100/s on average. After a burst, the bucket drains; refill takes 10 seconds.
This is the algorithm GitHub, Stripe, Slack, AWS, and almost every major HTTP API uses. The two knobs — sustained rate R and burst capacity B — let you tune for the workload.
Leaky bucket — strict outflow#
The leaky bucket is the dual of the token bucket: instead of tokens accumulating up to a cap, requests accumulate up to a cap, and they “leak out” (i.e., are processed) at a fixed rate.
┌─────────┐ │░░░░░░░░░│ ← requests arrive │░░░░░░░░░│ │░░░░░░░░░│ └────┬────┘ │ outflow rate R (fixed) ▼ processed by backendThe behavioural difference: a leaky bucket enforces a strict outflow rate; a token bucket allows bursts. If a workload has spikes and that’s tolerable, use token bucket. If downstream capacity is fixed (a database that can handle exactly 1,000 qps and not one more), use leaky bucket.
Leaky bucket is more common at the network layer (router buffers) than in HTTP rate limiters.
Fixed window — the simplest#
Count requests in discrete time windows: “100 requests per minute” means resetting the counter at every minute boundary.
minute 0:00 minute 0:01 minute 0:02 ┌─────────────┬─────────────┬─────────────┐ │ count = 0 │ count = 0 │ count = 0 │ └─────────────┴─────────────┴─────────────┘ reset resetTrivial to implement (one counter per caller, reset on a cron). The problem: boundary effects. A caller can send 100 requests in the last second of minute 0 and 100 more in the first second of minute 1 — 200 requests in a two-second span, despite the “100/minute” limit.
Useful for coarse limits where the boundary effect doesn’t matter (daily quotas), or for situations where simplicity outweighs precision.
Sliding window — the smoothest#
Maintain a rolling count over the last N seconds, weighted by how much of each prior window has elapsed.
A common approximation: keep the count for the current window and the prior window, and compute a weighted blend based on elapsed time.
weighted_count = prior_count * (1 - elapsed_fraction) + current_countIf elapsed_fraction = 0.3 into the current minute, count 0.7 * prior_minute_count + current_minute_count.
This eliminates the boundary effect of fixed windows at modest extra state (two counters instead of one). Cloudflare and Stripe both use approximations of this for their per-second limits.
Keying — who counts as a caller?#
The caller is the thing the limit applies to. Choices, roughly in order of trustworthiness:
| Key | Use when | Risk |
|---|---|---|
| API key | Authenticated APIs | Best signal of identity |
| User ID | After authentication, when the same user has multiple keys | Requires resolving the key to a user |
| Account / organisation | Multi-tenant SaaS where a single account spans many users | Requires hierarchical resolution |
| IP address | Anonymous endpoints; login pre-auth | Spoofable; NAT collisions; mobile carriers share IPs |
| (IP + User-Agent + cookie) | Fingerprinting anonymous abuse | Fragile; many false positives |
Best practice: layer multiple limits. Per-API-key for normal traffic; per-IP as a fallback for unauthenticated endpoints; per-account as an aggregate cap so a single account with many API keys can’t escape the limit by sharding.
The IP key is the weakest because of NAT. A corporate office or a mobile carrier may have thousands of users behind one IP; limiting “100 requests per minute per IP” effectively rate-limits the entire office. Avoid for authenticated endpoints.
The 429 contract#
When a request is rejected, the server returns 429 Too Many Requests with response headers that tell the client how to back off:
HTTP/1.1 429 Too Many RequestsRetry-After: 30X-RateLimit-Limit: 5000X-RateLimit-Remaining: 0X-RateLimit-Reset: 1717070400Content-Type: application/json
{ "error": { "code": "rate_limited", "message": "Too many requests. Retry after 30 seconds.", "retry_after_seconds": 30 }}The four headers (only Retry-After is in the HTTP RFC; the X-RateLimit-* family is a de-facto convention):
Retry-After— seconds to wait before retrying, or an HTTP-date. The RFC-blessed header.X-RateLimit-Limit— the cap for this window.X-RateLimit-Remaining— how many requests are left (often 0 in a 429, but informative on non-429 responses too).X-RateLimit-Reset— Unix timestamp when the bucket refills (or when the window resets).
The IETF is standardising these as RateLimit (without the X-) per draft-ietf-httpapi-ratelimit-headers. Most production APIs still use the X- prefix.
Client behaviour — respecting the headers#
A well-written client respects 429 with backoff:
import timeimport randomimport requests
def call_with_rate_limit(url, headers=None, max_attempts=5): for attempt in range(max_attempts): resp = requests.get(url, headers=headers, timeout=10) if resp.status_code != 429: resp.raise_for_status() return resp
# Server told us how long to wait retry_after = resp.headers.get("Retry-After") if retry_after: wait = float(retry_after) else: # Fall back to exponential backoff with jitter wait = (2 ** attempt) + random.random()
# Cap the wait so we don't hang forever wait = min(wait, 60) time.sleep(wait)
raise RuntimeError(f"rate-limited after {max_attempts} attempts")package main
import ( "fmt" "math/rand" "net/http" "strconv" "time")
func callWithRateLimit(url string, maxAttempts int) (*http.Response, error) { client := &http.Client{Timeout: 10 * time.Second} for attempt := 0; attempt < maxAttempts; attempt++ { resp, err := client.Get(url) if err != nil { return nil, err } if resp.StatusCode != http.StatusTooManyRequests { return resp, nil } resp.Body.Close()
wait := time.Duration(0) if ra := resp.Header.Get("Retry-After"); ra != "" { if secs, err := strconv.Atoi(ra); err == nil { wait = time.Duration(secs) * time.Second } } if wait == 0 { base := time.Duration(1<<uint(attempt)) * time.Second wait = base + time.Duration(rand.Intn(1000))*time.Millisecond } if wait > 60*time.Second { wait = 60 * time.Second } time.Sleep(wait) } return nil, fmt.Errorf("rate-limited after %d attempts", maxAttempts)}async function callWithRateLimit(url, maxAttempts = 5) { for (let attempt = 0; attempt < maxAttempts; attempt++) { const resp = await fetch(url, { signal: AbortSignal.timeout(10_000) }); if (resp.status !== 429) { if (!resp.ok) throw new Error(`HTTP ${resp.status}`); return resp; }
const retryAfter = resp.headers.get("retry-after"); let waitMs; if (retryAfter) { waitMs = Number(retryAfter) * 1000; } else { // Exponential backoff with jitter waitMs = (2 ** attempt) * 1000 + Math.random() * 1000; } waitMs = Math.min(waitMs, 60_000); await new Promise((r) => setTimeout(r, waitMs)); } throw new Error(`rate-limited after ${maxAttempts} attempts`);}Two non-obvious points in the client code:
- Prefer
Retry-Afterover locally-computed backoff. The server knows exactly when the bucket refills. Trust it. - Cap the wait. A naive client following a malicious server’s
Retry-After: 86400would hang for a day. Cap at a reasonable upper bound (60s is typical).
Implementation — Redis is the workhorse#
Most production rate limiters use Redis. The token-bucket implementation is a single Lua script that atomically reads the current bucket state, computes the refill, decrements if there’s a token, and stores the new state. This avoids the read-modify-write race that plagues naive implementations.
For very high throughput, sharded Redis or per-region rate limiters reduce hot keys. Cloudflare’s rate limiter uses a hierarchical local-then-global approach: each edge maintains a local counter that gets reconciled across regions every few seconds.
Variants#
| Variant | Mechanism | When it fits |
|---|---|---|
| Token bucket | Bucket of capacity B, refilled at rate R | HTTP APIs; the default |
| Leaky bucket | Queue with strict outflow rate | When downstream has hard capacity (databases, queues) |
| Fixed window | Counter resets at window boundary | Daily/monthly quotas; simple use cases |
| Sliding window log | List of timestamps, count those in last N seconds | Precise but memory-heavy |
| Sliding window counter | Two counters with weighted blend | Best precision-to-cost ratio for sub-minute windows |
| Concurrent connection limit | Cap simultaneous open connections per caller | WebSocket / long-poll APIs |
| Cost-weighted | Different endpoints cost different numbers of tokens | Mixed-cost APIs (cheap reads, expensive writes) |
Trade-offs#
What rate limiting gives you:
- Protection against runaway clients. Buggy clients in a retry loop don’t take down the service.
- Fair allocation across tenants. No single noisy neighbour starves the others.
- Abuse mitigation. Credential stuffing, scraping, and dictionary attacks become economically unattractive.
- Predictable backend load. With per-key caps and a known key population, you can reason about peak load.
What rate limiting costs you:
- Operational complexity. Choosing keys, algorithms, limits per endpoint, and tuning over time is real work.
- Latency on every request. The limiter is in the critical path. A poorly-implemented limiter adds 5-10ms per request.
- State management. Per-caller state must live somewhere (Redis, in-memory with eventual consistency, or a dedicated rate-limit service).
- Calibration risk. Limits too tight break legitimate customers; too loose let abuse through. Tuning is ongoing.
- The “fair share” question is genuinely hard. Per-account, per-key, per-endpoint, per-IP — every choice has edge cases.
Common pitfalls#
- Treating 429 as a hard failure. Clients that don’t retry on 429 break under any limit; clients that retry without backoff cause retry storms. Both extremes are wrong.
- Per-IP limits on authenticated endpoints. NAT collapses thousands of users into one IP. Authenticate first, then key by user.
- No
Retry-Afterheader. Forces clients to guess. Always include it. - One global limit for everything. Different endpoints have different costs and risk profiles. Login should be 10x tighter than read endpoints.
- Counting in memory without coordination. A multi-replica service with local counters lets a caller multiply their effective limit by the number of replicas they hit. Use a shared store.
- Forgetting per-endpoint limits. A caller within their global limit can still hammer a single expensive endpoint. Endpoint-level caps are necessary for the heavy ones.
- No telemetry. Without “what % of requests are 429’d per account?” dashboards, you find out about a bad limit when a customer complains.
- Soft-limit mode missing. When tuning a new limit, ship it in “warn but don’t reject” mode first, observe, then enforce.
Related building blocks#
- The Circuit Breaker Pattern — the upstream half of the resilience pair; rate limiting protects you, circuit breakers protect your callers from your failures.
- Managing Retries — the client-side contract that pairs with the 429 response.
- API Monitoring — observing the 429 rate per caller and per endpoint is how you tune limits over time.
- Caching at Different Layers — caching reduces the request count to the origin, which reduces pressure on the rate limiter.
- HTTP — The Foundational Protocol for APIs — the
Retry-Afterheader and429status code definitions.