Managing Retries — API Design · Engineering Playbook

What it is#

Retries are the client-side discipline of reissuing a failed request in the hope that the failure was transient. Done right, retries hide the dropped packets, brief congestion spikes, and momentary partial failures that every distributed system has — the request that would have failed at 200ms succeeds at 800ms after one retry, and the user never sees the seam.

Done wrong, retries are the mechanism by which a recovering service gets killed for a second time. A service that goes down for 30 seconds and starts coming back up faces every client’s accumulated retry queue all at once — a self-inflicted denial of service that prevents recovery. The 2015 AWS DynamoDB outage and the 2020 Cloudflare Workers KV incident both included a “retry-storm extended the outage by 20+ minutes” finding.

The vocabulary that separates safe retries from dangerous ones has four parts: exponential backoff (each retry waits longer than the last), jitter (randomise the wait so clients don’t synchronise), retry budgets (cap the cost — max attempts, max time, max queue size), and idempotency (the prerequisite — retrying a non-idempotent operation creates duplicates).

The senior signal in any interview that touches retries: retries are not a substitute for reliability. They’re a tool for masking transient failure. If a service is failing 50% of requests, retrying won’t help — the right answer is a circuit breaker, not more retries.

When to use it#

Retries are appropriate for:

Network-level transient failures. TCP resets, DNS hiccups, brief TLS handshake failures. These resolve quickly; a retry hides them.
Server-side transient failures. 503 Service Unavailable with Retry-After, occasional 502 Bad Gateway from a load balancer’s brief view of a restarting backend, 504 Gateway Timeout for a stuck request.
Rate-limit responses (429). With Retry-After honored.
Idempotent operations. Reads, PUTs, DELETEs, and POSTs with an idempotency key. See idempotency-in-api-design.

Retries are inappropriate for:

Client errors (4xx other than 429). A 400 Bad Request won’t become 200 OK on retry; the request is malformed. A 403 Forbidden won’t grant access on retry. A 404 Not Found won’t find the resource on retry. Don’t retry these.
Non-idempotent operations without an idempotency key. Charging a card, sending an email, creating an order. A retry creates a duplicate.
Persistent failures. If 5 retries don’t help, retry 6 won’t either. Move to the fail path.

How it works#

Exponential backoff#

The wait between retries doubles (or scales by some factor) on each attempt:

delay = base * factor^attempt

With base = 100ms and factor = 2:

Attempt 0: try immediately
Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms
Attempt 5: wait 1600ms

The rationale: a brief blip (100ms) clears on the first retry. A medium failure (1-2s) clears on the second or third. A sustained outage takes the client past the budget without hammering the server.

Jitter — the bit everyone forgets#

Without jitter, every client retries on the same schedule. If 1,000 clients all see a failure at the same instant, they all retry at exactly 100ms, then 200ms, then 400ms — a synchronised thundering herd hitting the recovering service.

The fix is jitter: randomise the wait. Three common variants:

full jitter:           wait = rand(0, base * factor^attempt)
equal jitter:          wait = base * factor^attempt / 2 + rand(0, base * factor^attempt / 2)
decorrelated jitter:   wait = rand(base, last_wait * 3)   # capped at max

AWS’s “Exponential Backoff And Jitter” blog post (2015) showed that full jitter is the best of the three for both client experience and server recovery time. Most libraries default to full jitter today.

   Without jitter:         With full jitter:
       |                       :        :
   ----▼-----▼-----▼-----  ----▼-----:▼--▼-----:--▼:--
       all clients              spread out
       retry together

Retry budgets — the cost cap#

A budget caps how many retries the system as a whole will spend. Three forms:

Per-call attempts cap. Maximum retry count for a single call, often 3-5. Beyond this, fail the call.
Per-call time budget. Maximum total elapsed time including retries, often 30s. If the budget runs out mid-backoff, abandon.
Per-target retry quota. Across all calls to the same target, cap retries as a fraction of total requests (e.g., “at most 10% of requests can be retries”). Prevents retry-storm during incidents.

The third form is what Google’s SRE Book calls a “retry budget” specifically and what gRPC implements as RetryThrottlingPolicy. The idea: when the system is healthy, retries are cheap and rare. When the system is degraded, the retry rate spikes and starts to consume the budget. When the budget is exhausted, retries are disabled until the underlying success rate recovers.

This is the discipline that prevents a recovering service from being re-killed.

The retry-storm anti-pattern#

Without budgets, retries form a positive-feedback loop during partial failures:

   Backend at 80% capacity, 20% requests time out
   Each timeout triggers 3 retries → effective request rate = 1.0 + 0.2*3 = 1.6x
   Now backend is at 130% capacity → 60% timeout
   Each timeout triggers 3 retries → effective rate = 1.0 + 0.6*3 = 2.8x
   Now backend is at 230% capacity → 100% timeout
   Service is now in a self-sustaining outage.

The cure has three parts:

Cap attempts per call. 3-5, not unlimited.
Honour Retry-After. When the server says “wait 30 seconds”, wait 30 seconds.
Coordinate via a retry budget. When too high a fraction of requests are retries, stop retrying.

Idempotency is the prerequisite#

Retries change the question from “did the request succeed?” to “did the server execute the operation, and if so, exactly once?” For reads (GET), the answer is obvious — reading twice is the same as reading once. For non-idempotent writes (POST to create), retrying a request that already succeeded creates a duplicate.

The two solutions, in order of preference:

Use idempotent verbs where possible. PUT /users/42 is idempotent by definition; DELETE /orders/42 is idempotent. Both can be retried freely.
Use idempotency keys for non-idempotent operations. A client-generated Idempotency-Key header lets the server deduplicate retries. See idempotency-in-api-design.

A retry policy without one of these is a duplicate-write factory waiting for the first network hiccup.

Which responses to retry#

A canonical decision table:

Outcome	Retry?	Notes
Connection refused	yes	Service not up; transient if restarting
DNS lookup failure	yes	Transient
TCP reset	yes	Brief network issue
TLS handshake failure	yes	Often transient (load balancer warming up)
Request timeout (client side)	conditionally	Only if idempotent — request may have succeeded
503 Service Unavailable	yes	Honor `Retry-After`
502 Bad Gateway	yes	Load balancer can’t reach a backend; brief
504 Gateway Timeout	yes	Upstream slow; backoff and retry
429 Too Many Requests	yes	Honor `Retry-After`
500 Internal Server Error	conditionally	Idempotent only; usually transient
4xx (other than 429)	no	Client bug; won’t fix on retry
Successful response	no

The “conditionally” rows are where idempotency does its work. A 500 on a POST /charges without an idempotency key is dangerous to retry; a 500 on the same call with an idempotency key is safe.

Three-language example#

A retry wrapper with exponential backoff, full jitter, and Retry-After handling:

import time
import random
import requests

RETRYABLE_STATUS = {429, 500, 502, 503, 504}
MAX_ATTEMPTS = 5
BASE_DELAY = 0.1
MAX_DELAY = 30.0

def call_with_retry(method, url, idempotency_key=None, **kwargs):
    headers = kwargs.pop("headers", {}) or {}
    if idempotency_key:
        headers["Idempotency-Key"] = idempotency_key

    for attempt in range(MAX_ATTEMPTS):
        try:
            resp = requests.request(
                method, url, headers=headers, timeout=10, **kwargs,
            )
        except (requests.ConnectionError, requests.Timeout):
            if attempt == MAX_ATTEMPTS - 1:
                raise
            _sleep(attempt, retry_after=None)
            continue

        if resp.status_code not in RETRYABLE_STATUS:
            return resp

        retry_after = resp.headers.get("Retry-After")
        if attempt == MAX_ATTEMPTS - 1:
            return resp
        _sleep(attempt, retry_after)

def _sleep(attempt, retry_after):
    if retry_after is not None:
        wait = float(retry_after)
    else:
        # Full jitter exponential backoff
        ceiling = min(BASE_DELAY * (2 ** attempt), MAX_DELAY)
        wait = random.uniform(0, ceiling)
    time.sleep(wait)

package main

import (
    "context"
    "math"
    "math/rand"
    "net/http"
    "strconv"
    "time"
)

var retryable = map[int]bool{429: true, 500: true, 502: true, 503: true, 504: true}

const (
    maxAttempts = 5
    baseDelay   = 100 * time.Millisecond
    maxDelay    = 30 * time.Second
)

func callWithRetry(ctx context.Context, req *http.Request, idempotencyKey string) (*http.Response, error) {
    if idempotencyKey != "" {
        req.Header.Set("Idempotency-Key", idempotencyKey)
    }
    client := &http.Client{Timeout: 10 * time.Second}

    var lastResp *http.Response
    for attempt := 0; attempt < maxAttempts; attempt++ {
        resp, err := client.Do(req.Clone(ctx))
        if err != nil {
            if attempt == maxAttempts-1 {
                return nil, err
            }
            sleep(attempt, "")
            continue
        }
        if !retryable[resp.StatusCode] {
            return resp, nil
        }
        lastResp = resp
        if attempt == maxAttempts-1 {
            return resp, nil
        }
        retryAfter := resp.Header.Get("Retry-After")
        resp.Body.Close()
        sleep(attempt, retryAfter)
    }
    return lastResp, nil
}

func sleep(attempt int, retryAfter string) {
    if retryAfter != "" {
        if secs, err := strconv.Atoi(retryAfter); err == nil {
            time.Sleep(time.Duration(secs) * time.Second)
            return
        }
    }
    ceiling := time.Duration(math.Min(
        float64(baseDelay)*math.Pow(2, float64(attempt)),
        float64(maxDelay),
    ))
    time.Sleep(time.Duration(rand.Int63n(int64(ceiling))))
}

const RETRYABLE = new Set([429, 500, 502, 503, 504]);
const MAX_ATTEMPTS = 5;
const BASE_DELAY_MS = 100;
const MAX_DELAY_MS = 30_000;

async function callWithRetry(url, options = {}, idempotencyKey) {
  const headers = { ...(options.headers || {}) };
  if (idempotencyKey) headers["Idempotency-Key"] = idempotencyKey;

  let lastResp;
  for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {
    let resp;
    try {
      resp = await fetch(url, {
        ...options,
        headers,
        signal: AbortSignal.timeout(10_000),
      });
    } catch (err) {
      if (attempt === MAX_ATTEMPTS - 1) throw err;
      await sleep(attempt, null);
      continue;
    }

    if (!RETRYABLE.has(resp.status)) return resp;
    lastResp = resp;
    if (attempt === MAX_ATTEMPTS - 1) return resp;
    await sleep(attempt, resp.headers.get("retry-after"));
  }
  return lastResp;
}

function sleep(attempt, retryAfter) {
  let ms;
  if (retryAfter) {
    ms = Number(retryAfter) * 1000;
  } else {
    // Full jitter
    const ceiling = Math.min(BASE_DELAY_MS * 2 ** attempt, MAX_DELAY_MS);
    ms = Math.random() * ceiling;
  }
  return new Promise((r) => setTimeout(r, ms));
}

Three details across the implementations:

The Retry-After header takes priority over computed backoff. The server knows when it’s ready; trust it.
The retry list is restrictive — only specific 5xx codes and 429. Don’t retry 4xx (other than 429); they won’t change.
The idempotency key is sent on every retry, identical. That’s how the server dedupes.

Where retries should live — and where they shouldn’t#

A common mistake: retries at every layer. If the SDK retries 5 times, and the proxy retries 5 times, and the gateway retries 5 times, a single user click becomes 125 backend requests under failure.

The right pattern: retry at one layer, ideally the highest layer that has full context. Often that’s the SDK or the application code. Lower layers (gateway, proxy) should not retry unless they’re the only one — and if so, they should be aware of being the only one.

Some teams adopt the convention: client SDK retries; service-to-service calls do not retry (the upstream caller is expected to retry instead). This prevents amplification through the service graph.

Variants#

Variant	Mechanism	When it fits
Fixed delay	Constant wait between retries	Rare; only when load on the dependency is irrelevant
Linear backoff	`delay = base * attempt`	Niche; usually pick exponential instead
Exponential backoff (no jitter)	`delay = base * 2^attempt`	Single-client tools; unsafe for many clients
Exponential + full jitter	`wait = rand(0, base * 2^attempt)`	The default for production
Decorrelated jitter	`wait = rand(base, prev * 3)`	AWS SDK default; smoother under load
Adaptive (e.g., gRPC retry throttle)	Disable retries when ratio of retries to requests exceeds a threshold	Service meshes, retry budgets

Trade-offs#

What good retry policies give you:

Masked transient failures. The 1% of requests that hit a TCP reset never surface to users.
Higher effective success rate. A 99% backend with retries looks like 99.99% to clients.
Operational margin. Brief incidents (a pod restart, a brief network blip) don’t cause user-visible failure.

What good retry policies cost you:

Higher peak load. Retries on a degraded backend amplify load exactly when you can least afford it.
Latency variance. A retried request takes 100ms + backoff + 100ms = noticeably slower.
Duplicate writes if idempotency isn’t perfect.
Operational subtlety. Retries are easy to write and hard to tune. The default settings are wrong for most workloads.

Common pitfalls#

Retrying without backoff. A tight retry loop is a denial-of-service attack against your own dependency.
Retrying without jitter. Synchronised retries from many clients form a thundering herd.
Retrying non-idempotent operations without an idempotency key. Each retry under timeout creates a duplicate.
Retrying 4xx errors. A 400 won’t become 200; you’re spending budget on a bug.
Ignoring Retry-After. The server’s hint is more accurate than your computed backoff. Use it.
Unbounded retries. “Retry until success” is a great way to hang a thread forever during an outage.
Retries at every layer. SDK retries × gateway retries × proxy retries = 125x amplification. Pick one layer.
No retry budget. A retry-storm during a partial outage extends the outage.
Not distinguishing connection errors from response errors. A connection refused is safe to retry; a 500 on a non-idempotent POST is not.

The Role of Idempotency in API Design — the prerequisite; retries are unsafe without idempotency.
The Circuit Breaker Pattern — the complement; retries handle transient failures, circuit breakers handle sustained ones.
Rate Limiting — the server-side counterpart to client-side retries; 429 + Retry-After is the contract.
API Monitoring — retry rate, retry budget consumption, and Retry-After header counts are core metrics to dashboard.
What Causes API Failures — A Taxonomy — retry storms are a named failure mode; this is where to study them.