Rate Limiter — System Design · Engineering Playbook

Use cases#

Rate limiters protect a service from over-consumption — whether the source is malicious (DDoS, credential stuffing), greedy (a single misconfigured client looping), or noisy (a viral event sending normal traffic at abnormal volume). Every public API endpoint should sit behind one. The three canonical placements:

At the edge / CDN — coarse, IP-based, before the request consumes any compute.
At the API gateway — per-user / per-API-key, with quota.
Inline at a service — fine-grained, per-business-operation (e.g. “writes to this account at most 5/sec”).

Functional requirements#

Allow up to N requests per time window T per identity (user, IP, API key).
Return a clear signal when limited: HTTP 429, Retry-After header, X-RateLimit-* response headers.
Configurable limits — different tiers (free vs paid), different endpoints, different windows.
Optional per-route overrides.

Non-functional requirements#

Latency: the check itself must be sub-millisecond at p99. A rate limiter that adds 50ms is worse than the abuse it prevents.
Availability: 99.99%. If the limiter fails open you allow abuse; fail closed and you reject legitimate traffic. Both are bad — make the choice deliberately.
Throughput: matches the traffic of the protected service, so often millions of QPS globally.
Consistency: weakly consistent across regions is acceptable; a 1-second window over-allowing 20% is rarely the threat model.

High-level design#

client ──> edge / CDN ──> API gateway ──┬──> rate-limit check (Redis)
                                        │
                                        └──> service ──> DB

The rate-limit check is a synchronous read-modify-write on a counter keyed by (identity, route, window). Redis (or a similar low-latency KV) is the canonical store; in-memory per-instance counters work too if you can tolerate per-instance drift.

Detailed design#

Algorithms — pick one per limiter#

Token bucket. Each identity has a bucket of tokens that refills at a steady rate r up to a max B. Each request costs one token. If the bucket is empty, reject. Great for bursts: a brand-new client can consume B tokens immediately if needed.

struct Bucket { double tokens; long lastRefillMs; }

bool allow(Bucket& b, long nowMs, double r, double B) {
    b.tokens = min(B, b.tokens + (nowMs - b.lastRefillMs) * r / 1000.0);
    b.lastRefillMs = nowMs;
    if (b.tokens >= 1.0) { b.tokens -= 1.0; return true; }
    return false;
}

Leaky bucket. Same shape, but the bucket “leaks” at a fixed rate. Smooths bursts into a steady stream — equivalent to token bucket with B = 1. Used in network shaping.

Fixed window counter. Per-(identity, route, current-window), increment. Reject when count > N. Cheapest implementation; vulnerable to a 2× burst at window boundaries (a client fires N requests at 0:59 and N more at 1:00).

Sliding window log. Keep timestamps of every request in the last T seconds; reject when length > N. Most accurate; storage cost scales with rate.

Sliding window counter. Linear interpolation between two adjacent fixed windows. Strikes the practical sweet spot: O(1) storage, no boundary spike, ~99% accurate at typical rates.

Distributed coordination#

A single Redis instance is the simplest backend — every gateway in every region routes its rate-limit checks there. Latency-sensitive deployments shard Redis by identity (hash(user_id) % N) so each check has a local Redis. Cross-shard limits (e.g. “5 calls/sec across all of user.create_*”) require pre-aggregating, accepting drift, or running a small consensus protocol.

For very high traffic, the common approach is local + global hybrid: each gateway maintains a local counter and asynchronously syncs to a global Redis every N ms. Local checks are nanoseconds; global drift is bounded by sync interval. Slack and CloudFlare both publish variants of this design.

Quota replenishment#

Token bucket replenishes implicitly via the tokens + elapsed * rate formula — no background job needed. Leaky bucket is the same. Fixed window resets at wall-clock boundaries (you can use Redis EXPIRE to auto-clear). Sliding window log requires periodic eviction of old timestamps.

What to return#

HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1715800000

The Retry-After header is critical — clients should back off, not retry immediately. Modern HTTP clients (axios, fetch with libraries, gRPC) honor it automatically.

Trade-offs#

Fail closed — when Redis is unreachable, reject all requests. Protects downstream completely. Risks degrading the entire site if the limiter dies.

Fail open — when Redis is unreachable, allow all requests. Site keeps working but is vulnerable for the outage window. Most public APIs choose this.

Other dials:

Strict vs lenient: counting failed requests against the limit deters credential-stuffing but punishes flaky clients. The right answer depends on the endpoint.
Per-IP vs per-user: IP limiting is necessary at the edge (you don’t know users yet) but breaks NAT’d clients. Combine both.
Burst capacity: bigger B is friendlier to legitimate spike traffic but lets attackers stockpile.

Real-world examples#

Stripe publishes 100 RPS / 100 read-RPS / 1000 write-RPS soft limits per account with token-bucket replenishment.
GitHub API uses fixed-window per-hour limits (5000/hr for authenticated, 60/hr for anonymous) and surfaces remaining quota in response headers.
AWS API Gateway offers token bucket with per-API-key plans; uses local counters with eventual aggregation.
Cloudflare runs sliding-window-style limits at the edge with rule-based per-route configuration.

Distributed Cache — the storage substrate for most rate limiters.
Load Balancers — where edge rate-limiting sits in the stack.
Pub-Sub — used to broadcast quota updates across gateways.