Rate Limiting

Token bucket, leaky bucket, fixed window, sliding window. Per-user vs per-IP, the 429 contract, and the burst question.

Building Block Intermediate
12 min read
rate-limiting throttling http capacity reliability

What it is#

Rate limiting is the server-side discipline of capping how many requests a given caller can make in a given time window. The cap exists for several reasons at once: protecting backend capacity from runaway clients, ensuring fair allocation across tenants, preventing abuse from credential-stuffing or scraping bots, and shedding load gracefully when the system is overwhelmed.

The mechanics live at the gateway or middleware layer — the request comes in, the limiter checks “has this caller exceeded their budget?”, and either passes the request through or rejects it with 429 Too Many Requests. The hard problems are not at the reject step; they’re in how the budget is counted (the algorithm), what the caller is keyed by (user, IP, API key, endpoint), and what the response tells the client so the client can back off correctly.

Four algorithms cover 99% of production use: token bucket (the most common; smooths bursts), leaky bucket (similar but enforces a strict outflow), fixed window (simplest; allows boundary-effect spikes), and sliding window (smoother; more state). The token bucket is the default for HTTP APIs; the others have their niches.

The 429 contract is part of the design, not an afterthought. A well-behaved API returns 429 with three response headers: Retry-After (how long to wait), X-RateLimit-Limit (the cap), and X-RateLimit-Remaining (current budget). A client that respects these headers degrades gracefully; a client that doesn’t gets stuck in a retry storm.

When to use it#

Rate limiting is mandatory for:

  • Any public API. Without limits, a single buggy client or hostile script can exhaust capacity and degrade service for everyone else. This is non-negotiable.
  • Multi-tenant APIs. Per-tenant rate limits are how you stop one customer’s bad day from becoming everyone’s bad day. “Noisy neighbour” is the failure mode rate limits prevent.
  • Auth endpoints. Login, password reset, and 2FA-code endpoints get aggressive rate limits to defeat credential stuffing and brute force. Often these limits are tighter than the rest of the API by an order of magnitude.
  • Anything backed by expensive resources. AI inference endpoints, image processing, full-text search — anything where each request has measurable cost.

Rate limiting is less load-bearing when:

  • The API is internal and the callers are all known. A capacity limit at the service-mesh level may suffice; per-caller rate limits add operational complexity.
  • The workload is trusted bulk traffic. Bulk import endpoints expect bursts; instead of rate-limiting, queue the work and process asynchronously.

How it works#

Token bucket — the default algorithm#

The token bucket is the most common choice for HTTP APIs because it allows controlled bursts while enforcing a long-run rate.

The model: each caller has a virtual bucket with capacity B tokens. Tokens are added at rate R per second (up to the cap). Each request consumes one token; if no token is available, the request is rejected.

Refill rate: R tokens/sec
┌─────────┐
│ ▓▓▓▓▓░░ │ capacity B = 100, current = 60
└─────────┘
1 token per request (or N for cost-weighted requests)

For example, R = 100/s, B = 1000: the caller can burst up to 1,000 requests immediately if the bucket is full, but can sustain only 100/s on average. After a burst, the bucket drains; refill takes 10 seconds.

This is the algorithm GitHub, Stripe, Slack, AWS, and almost every major HTTP API uses. The two knobs — sustained rate R and burst capacity B — let you tune for the workload.

Leaky bucket — strict outflow#

The leaky bucket is the dual of the token bucket: instead of tokens accumulating up to a cap, requests accumulate up to a cap, and they “leak out” (i.e., are processed) at a fixed rate.

┌─────────┐
│░░░░░░░░░│ ← requests arrive
│░░░░░░░░░│
│░░░░░░░░░│
└────┬────┘
│ outflow rate R (fixed)
processed by backend

The behavioural difference: a leaky bucket enforces a strict outflow rate; a token bucket allows bursts. If a workload has spikes and that’s tolerable, use token bucket. If downstream capacity is fixed (a database that can handle exactly 1,000 qps and not one more), use leaky bucket.

Leaky bucket is more common at the network layer (router buffers) than in HTTP rate limiters.

Fixed window — the simplest#

Count requests in discrete time windows: “100 requests per minute” means resetting the counter at every minute boundary.

minute 0:00 minute 0:01 minute 0:02
┌─────────────┬─────────────┬─────────────┐
│ count = 0 │ count = 0 │ count = 0 │
└─────────────┴─────────────┴─────────────┘
reset reset

Trivial to implement (one counter per caller, reset on a cron). The problem: boundary effects. A caller can send 100 requests in the last second of minute 0 and 100 more in the first second of minute 1 — 200 requests in a two-second span, despite the “100/minute” limit.

Useful for coarse limits where the boundary effect doesn’t matter (daily quotas), or for situations where simplicity outweighs precision.

Sliding window — the smoothest#

Maintain a rolling count over the last N seconds, weighted by how much of each prior window has elapsed.

A common approximation: keep the count for the current window and the prior window, and compute a weighted blend based on elapsed time.

weighted_count = prior_count * (1 - elapsed_fraction) + current_count

If elapsed_fraction = 0.3 into the current minute, count 0.7 * prior_minute_count + current_minute_count.

This eliminates the boundary effect of fixed windows at modest extra state (two counters instead of one). Cloudflare and Stripe both use approximations of this for their per-second limits.

Keying — who counts as a caller?#

The caller is the thing the limit applies to. Choices, roughly in order of trustworthiness:

KeyUse whenRisk
API keyAuthenticated APIsBest signal of identity
User IDAfter authentication, when the same user has multiple keysRequires resolving the key to a user
Account / organisationMulti-tenant SaaS where a single account spans many usersRequires hierarchical resolution
IP addressAnonymous endpoints; login pre-authSpoofable; NAT collisions; mobile carriers share IPs
(IP + User-Agent + cookie)Fingerprinting anonymous abuseFragile; many false positives

Best practice: layer multiple limits. Per-API-key for normal traffic; per-IP as a fallback for unauthenticated endpoints; per-account as an aggregate cap so a single account with many API keys can’t escape the limit by sharding.

The IP key is the weakest because of NAT. A corporate office or a mobile carrier may have thousands of users behind one IP; limiting “100 requests per minute per IP” effectively rate-limits the entire office. Avoid for authenticated endpoints.

The 429 contract#

When a request is rejected, the server returns 429 Too Many Requests with response headers that tell the client how to back off:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1717070400
Content-Type: application/json
{
"error": {
"code": "rate_limited",
"message": "Too many requests. Retry after 30 seconds.",
"retry_after_seconds": 30
}
}

The four headers (only Retry-After is in the HTTP RFC; the X-RateLimit-* family is a de-facto convention):

  • Retry-After — seconds to wait before retrying, or an HTTP-date. The RFC-blessed header.
  • X-RateLimit-Limit — the cap for this window.
  • X-RateLimit-Remaining — how many requests are left (often 0 in a 429, but informative on non-429 responses too).
  • X-RateLimit-Reset — Unix timestamp when the bucket refills (or when the window resets).

The IETF is standardising these as RateLimit (without the X-) per draft-ietf-httpapi-ratelimit-headers. Most production APIs still use the X- prefix.

Client behaviour — respecting the headers#

A well-written client respects 429 with backoff:

Rate-limit-aware HTTP client — Python
import time
import random
import requests
def call_with_rate_limit(url, headers=None, max_attempts=5):
for attempt in range(max_attempts):
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code != 429:
resp.raise_for_status()
return resp
# Server told us how long to wait
retry_after = resp.headers.get("Retry-After")
if retry_after:
wait = float(retry_after)
else:
# Fall back to exponential backoff with jitter
wait = (2 ** attempt) + random.random()
# Cap the wait so we don't hang forever
wait = min(wait, 60)
time.sleep(wait)
raise RuntimeError(f"rate-limited after {max_attempts} attempts")

Two non-obvious points in the client code:

  • Prefer Retry-After over locally-computed backoff. The server knows exactly when the bucket refills. Trust it.
  • Cap the wait. A naive client following a malicious server’s Retry-After: 86400 would hang for a day. Cap at a reasonable upper bound (60s is typical).

Implementation — Redis is the workhorse#

Most production rate limiters use Redis. The token-bucket implementation is a single Lua script that atomically reads the current bucket state, computes the refill, decrements if there’s a token, and stores the new state. This avoids the read-modify-write race that plagues naive implementations.

For very high throughput, sharded Redis or per-region rate limiters reduce hot keys. Cloudflare’s rate limiter uses a hierarchical local-then-global approach: each edge maintains a local counter that gets reconciled across regions every few seconds.

Variants#

VariantMechanismWhen it fits
Token bucketBucket of capacity B, refilled at rate RHTTP APIs; the default
Leaky bucketQueue with strict outflow rateWhen downstream has hard capacity (databases, queues)
Fixed windowCounter resets at window boundaryDaily/monthly quotas; simple use cases
Sliding window logList of timestamps, count those in last N secondsPrecise but memory-heavy
Sliding window counterTwo counters with weighted blendBest precision-to-cost ratio for sub-minute windows
Concurrent connection limitCap simultaneous open connections per callerWebSocket / long-poll APIs
Cost-weightedDifferent endpoints cost different numbers of tokensMixed-cost APIs (cheap reads, expensive writes)

Trade-offs#

What rate limiting gives you:

  • Protection against runaway clients. Buggy clients in a retry loop don’t take down the service.
  • Fair allocation across tenants. No single noisy neighbour starves the others.
  • Abuse mitigation. Credential stuffing, scraping, and dictionary attacks become economically unattractive.
  • Predictable backend load. With per-key caps and a known key population, you can reason about peak load.

What rate limiting costs you:

  • Operational complexity. Choosing keys, algorithms, limits per endpoint, and tuning over time is real work.
  • Latency on every request. The limiter is in the critical path. A poorly-implemented limiter adds 5-10ms per request.
  • State management. Per-caller state must live somewhere (Redis, in-memory with eventual consistency, or a dedicated rate-limit service).
  • Calibration risk. Limits too tight break legitimate customers; too loose let abuse through. Tuning is ongoing.
  • The “fair share” question is genuinely hard. Per-account, per-key, per-endpoint, per-IP — every choice has edge cases.

Common pitfalls#

  • Treating 429 as a hard failure. Clients that don’t retry on 429 break under any limit; clients that retry without backoff cause retry storms. Both extremes are wrong.
  • Per-IP limits on authenticated endpoints. NAT collapses thousands of users into one IP. Authenticate first, then key by user.
  • No Retry-After header. Forces clients to guess. Always include it.
  • One global limit for everything. Different endpoints have different costs and risk profiles. Login should be 10x tighter than read endpoints.
  • Counting in memory without coordination. A multi-replica service with local counters lets a caller multiply their effective limit by the number of replicas they hit. Use a shared store.
  • Forgetting per-endpoint limits. A caller within their global limit can still hammer a single expensive endpoint. Endpoint-level caps are necessary for the heavy ones.
  • No telemetry. Without “what % of requests are 429’d per account?” dashboards, you find out about a bad limit when a customer complains.
  • Soft-limit mode missing. When tuning a new limit, ship it in “warn but don’t reject” mode first, observe, then enforce.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.