Rate Limiting — API Design · Engineering Playbook

What it is#

Rate limiting is the server-side discipline of capping how many requests a given caller can make in a given time window. The cap exists for several reasons at once: protecting backend capacity from runaway clients, ensuring fair allocation across tenants, preventing abuse from credential-stuffing or scraping bots, and shedding load gracefully when the system is overwhelmed.

The mechanics live at the gateway or middleware layer — the request comes in, the limiter checks “has this caller exceeded their budget?”, and either passes the request through or rejects it with 429 Too Many Requests. The hard problems are not at the reject step; they’re in how the budget is counted (the algorithm), what the caller is keyed by (user, IP, API key, endpoint), and what the response tells the client so the client can back off correctly.

Four algorithms cover 99% of production use: token bucket (the most common; smooths bursts), leaky bucket (similar but enforces a strict outflow), fixed window (simplest; allows boundary-effect spikes), and sliding window (smoother; more state). The token bucket is the default for HTTP APIs; the others have their niches.

The 429 contract is part of the design, not an afterthought. A well-behaved API returns 429 with three response headers: Retry-After (how long to wait), X-RateLimit-Limit (the cap), and X-RateLimit-Remaining (current budget). A client that respects these headers degrades gracefully; a client that doesn’t gets stuck in a retry storm.

When to use it#

Rate limiting is mandatory for:

Any public API. Without limits, a single buggy client or hostile script can exhaust capacity and degrade service for everyone else. This is non-negotiable.
Multi-tenant APIs. Per-tenant rate limits are how you stop one customer’s bad day from becoming everyone’s bad day. “Noisy neighbour” is the failure mode rate limits prevent.
Auth endpoints. Login, password reset, and 2FA-code endpoints get aggressive rate limits to defeat credential stuffing and brute force. Often these limits are tighter than the rest of the API by an order of magnitude.
Anything backed by expensive resources. AI inference endpoints, image processing, full-text search — anything where each request has measurable cost.

Rate limiting is less load-bearing when:

The API is internal and the callers are all known. A capacity limit at the service-mesh level may suffice; per-caller rate limits add operational complexity.
The workload is trusted bulk traffic. Bulk import endpoints expect bursts; instead of rate-limiting, queue the work and process asynchronously.

How it works#

Token bucket — the default algorithm#

The token bucket is the most common choice for HTTP APIs because it allows controlled bursts while enforcing a long-run rate.

The model: each caller has a virtual bucket with capacity B tokens. Tokens are added at rate R per second (up to the cap). Each request consumes one token; if no token is available, the request is rejected.

   Refill rate: R tokens/sec
       │
       ▼
   ┌─────────┐
   │ ▓▓▓▓▓░░ │  capacity B = 100, current = 60
   └─────────┘
       │
       ▼
   1 token per request (or N for cost-weighted requests)

For example, R = 100/s, B = 1000: the caller can burst up to 1,000 requests immediately if the bucket is full, but can sustain only 100/s on average. After a burst, the bucket drains; refill takes 10 seconds.

This is the algorithm GitHub, Stripe, Slack, AWS, and almost every major HTTP API uses. The two knobs — sustained rate R and burst capacity B — let you tune for the workload.

Leaky bucket — strict outflow#

The leaky bucket is the dual of the token bucket: instead of tokens accumulating up to a cap, requests accumulate up to a cap, and they “leak out” (i.e., are processed) at a fixed rate.

   ┌─────────┐
   │░░░░░░░░░│ ← requests arrive
   │░░░░░░░░░│
   │░░░░░░░░░│
   └────┬────┘
        │ outflow rate R (fixed)
        ▼
   processed by backend

The behavioural difference: a leaky bucket enforces a strict outflow rate; a token bucket allows bursts. If a workload has spikes and that’s tolerable, use token bucket. If downstream capacity is fixed (a database that can handle exactly 1,000 qps and not one more), use leaky bucket.

Leaky bucket is more common at the network layer (router buffers) than in HTTP rate limiters.

Fixed window — the simplest#

Count requests in discrete time windows: “100 requests per minute” means resetting the counter at every minute boundary.

   minute 0:00     minute 0:01     minute 0:02
   ┌─────────────┬─────────────┬─────────────┐
   │  count = 0  │  count = 0  │  count = 0  │
   └─────────────┴─────────────┴─────────────┘
       reset           reset

Trivial to implement (one counter per caller, reset on a cron). The problem: boundary effects. A caller can send 100 requests in the last second of minute 0 and 100 more in the first second of minute 1 — 200 requests in a two-second span, despite the “100/minute” limit.

Useful for coarse limits where the boundary effect doesn’t matter (daily quotas), or for situations where simplicity outweighs precision.

Sliding window — the smoothest#

Maintain a rolling count over the last N seconds, weighted by how much of each prior window has elapsed.

A common approximation: keep the count for the current window and the prior window, and compute a weighted blend based on elapsed time.

   weighted_count = prior_count * (1 - elapsed_fraction) + current_count

If elapsed_fraction = 0.3 into the current minute, count 0.7 * prior_minute_count + current_minute_count.

This eliminates the boundary effect of fixed windows at modest extra state (two counters instead of one). Cloudflare and Stripe both use approximations of this for their per-second limits.

Keying — who counts as a caller?#

The caller is the thing the limit applies to. Choices, roughly in order of trustworthiness:

Key	Use when	Risk
API key	Authenticated APIs	Best signal of identity
User ID	After authentication, when the same user has multiple keys	Requires resolving the key to a user
Account / organisation	Multi-tenant SaaS where a single account spans many users	Requires hierarchical resolution
IP address	Anonymous endpoints; login pre-auth	Spoofable; NAT collisions; mobile carriers share IPs
(IP + User-Agent + cookie)	Fingerprinting anonymous abuse	Fragile; many false positives

Best practice: layer multiple limits. Per-API-key for normal traffic; per-IP as a fallback for unauthenticated endpoints; per-account as an aggregate cap so a single account with many API keys can’t escape the limit by sharding.

The IP key is the weakest because of NAT. A corporate office or a mobile carrier may have thousands of users behind one IP; limiting “100 requests per minute per IP” effectively rate-limits the entire office. Avoid for authenticated endpoints.

The 429 contract#

When a request is rejected, the server returns 429 Too Many Requests with response headers that tell the client how to back off:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1717070400
Content-Type: application/json

{
  "error": {
    "code": "rate_limited",
    "message": "Too many requests. Retry after 30 seconds.",
    "retry_after_seconds": 30
  }
}

The four headers (only Retry-After is in the HTTP RFC; the X-RateLimit-* family is a de-facto convention):

Retry-After — seconds to wait before retrying, or an HTTP-date. The RFC-blessed header.
X-RateLimit-Limit — the cap for this window.
X-RateLimit-Remaining — how many requests are left (often 0 in a 429, but informative on non-429 responses too).
X-RateLimit-Reset — Unix timestamp when the bucket refills (or when the window resets).

The IETF is standardising these as RateLimit (without the X-) per draft-ietf-httpapi-ratelimit-headers. Most production APIs still use the X- prefix.

Client behaviour — respecting the headers#

A well-written client respects 429 with backoff:

import time
import random
import requests

def call_with_rate_limit(url, headers=None, max_attempts=5):
    for attempt in range(max_attempts):
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code != 429:
            resp.raise_for_status()
            return resp

        # Server told us how long to wait
        retry_after = resp.headers.get("Retry-After")
        if retry_after:
            wait = float(retry_after)
        else:
            # Fall back to exponential backoff with jitter
            wait = (2 ** attempt) + random.random()

        # Cap the wait so we don't hang forever
        wait = min(wait, 60)
        time.sleep(wait)

    raise RuntimeError(f"rate-limited after {max_attempts} attempts")

package main

import (
    "fmt"
    "math/rand"
    "net/http"
    "strconv"
    "time"
)

func callWithRateLimit(url string, maxAttempts int) (*http.Response, error) {
    client := &http.Client{Timeout: 10 * time.Second}
    for attempt := 0; attempt < maxAttempts; attempt++ {
        resp, err := client.Get(url)
        if err != nil {
            return nil, err
        }
        if resp.StatusCode != http.StatusTooManyRequests {
            return resp, nil
        }
        resp.Body.Close()

        wait := time.Duration(0)
        if ra := resp.Header.Get("Retry-After"); ra != "" {
            if secs, err := strconv.Atoi(ra); err == nil {
                wait = time.Duration(secs) * time.Second
            }
        }
        if wait == 0 {
            base := time.Duration(1<<uint(attempt)) * time.Second
            wait = base + time.Duration(rand.Intn(1000))*time.Millisecond
        }
        if wait > 60*time.Second {
            wait = 60 * time.Second
        }
        time.Sleep(wait)
    }
    return nil, fmt.Errorf("rate-limited after %d attempts", maxAttempts)
}

async function callWithRateLimit(url, maxAttempts = 5) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const resp = await fetch(url, { signal: AbortSignal.timeout(10_000) });
    if (resp.status !== 429) {
      if (!resp.ok) throw new Error(`HTTP ${resp.status}`);
      return resp;
    }

    const retryAfter = resp.headers.get("retry-after");
    let waitMs;
    if (retryAfter) {
      waitMs = Number(retryAfter) * 1000;
    } else {
      // Exponential backoff with jitter
      waitMs = (2 ** attempt) * 1000 + Math.random() * 1000;
    }
    waitMs = Math.min(waitMs, 60_000);
    await new Promise((r) => setTimeout(r, waitMs));
  }
  throw new Error(`rate-limited after ${maxAttempts} attempts`);
}

Two non-obvious points in the client code:

Prefer Retry-After over locally-computed backoff. The server knows exactly when the bucket refills. Trust it.
Cap the wait. A naive client following a malicious server’s Retry-After: 86400 would hang for a day. Cap at a reasonable upper bound (60s is typical).

Implementation — Redis is the workhorse#

Most production rate limiters use Redis. The token-bucket implementation is a single Lua script that atomically reads the current bucket state, computes the refill, decrements if there’s a token, and stores the new state. This avoids the read-modify-write race that plagues naive implementations.

For very high throughput, sharded Redis or per-region rate limiters reduce hot keys. Cloudflare’s rate limiter uses a hierarchical local-then-global approach: each edge maintains a local counter that gets reconciled across regions every few seconds.

Variants#

Variant	Mechanism	When it fits
Token bucket	Bucket of capacity `B`, refilled at rate `R`	HTTP APIs; the default
Leaky bucket	Queue with strict outflow rate	When downstream has hard capacity (databases, queues)
Fixed window	Counter resets at window boundary	Daily/monthly quotas; simple use cases
Sliding window log	List of timestamps, count those in last N seconds	Precise but memory-heavy
Sliding window counter	Two counters with weighted blend	Best precision-to-cost ratio for sub-minute windows
Concurrent connection limit	Cap simultaneous open connections per caller	WebSocket / long-poll APIs
Cost-weighted	Different endpoints cost different numbers of tokens	Mixed-cost APIs (cheap reads, expensive writes)

Trade-offs#

What rate limiting gives you:

Protection against runaway clients. Buggy clients in a retry loop don’t take down the service.
Fair allocation across tenants. No single noisy neighbour starves the others.
Abuse mitigation. Credential stuffing, scraping, and dictionary attacks become economically unattractive.
Predictable backend load. With per-key caps and a known key population, you can reason about peak load.

What rate limiting costs you:

Operational complexity. Choosing keys, algorithms, limits per endpoint, and tuning over time is real work.
Latency on every request. The limiter is in the critical path. A poorly-implemented limiter adds 5-10ms per request.
State management. Per-caller state must live somewhere (Redis, in-memory with eventual consistency, or a dedicated rate-limit service).
Calibration risk. Limits too tight break legitimate customers; too loose let abuse through. Tuning is ongoing.
The “fair share” question is genuinely hard. Per-account, per-key, per-endpoint, per-IP — every choice has edge cases.

Common pitfalls#

Treating 429 as a hard failure. Clients that don’t retry on 429 break under any limit; clients that retry without backoff cause retry storms. Both extremes are wrong.
Per-IP limits on authenticated endpoints. NAT collapses thousands of users into one IP. Authenticate first, then key by user.
No Retry-After header. Forces clients to guess. Always include it.
One global limit for everything. Different endpoints have different costs and risk profiles. Login should be 10x tighter than read endpoints.
Counting in memory without coordination. A multi-replica service with local counters lets a caller multiply their effective limit by the number of replicas they hit. Use a shared store.
Forgetting per-endpoint limits. A caller within their global limit can still hammer a single expensive endpoint. Endpoint-level caps are necessary for the heavy ones.
No telemetry. Without “what % of requests are 429’d per account?” dashboards, you find out about a bad limit when a customer complains.
Soft-limit mode missing. When tuning a new limit, ship it in “warn but don’t reject” mode first, observe, then enforce.

The Circuit Breaker Pattern — the upstream half of the resilience pair; rate limiting protects you, circuit breakers protect your callers from your failures.
Managing Retries — the client-side contract that pairs with the 429 response.
API Monitoring — observing the 429 rate per caller and per endpoint is how you tune limits over time.
Caching at Different Layers — caching reduces the request count to the origin, which reduces pressure on the rate limiter.
HTTP — The Foundational Protocol for APIs — the Retry-After header and 429 status code definitions.