Caching at Different Layers — API Design

What it is#

Caching is the discipline of storing the result of an expensive operation so that subsequent identical operations can return the stored result instead of recomputing. In API design, caches sit at five different layers — browser, CDN, gateway, application, database — and each layer has its own protocol, its own invalidation story, and its own failure mode.

The headline question is not “should we cache?” but “where in the stack should the cache live, and what invalidates it?”. A response cached in the browser for one user has a different lifetime than a response cached at the CDN for all users. A SQL query cached in the application’s memory has a different invalidation contract than a value cached in Redis shared across replicas. Getting the layer wrong is the source of most cache-related bugs.

Phil Karlton’s joke — “there are only two hard problems in computer science: cache invalidation and naming things” — is not a joke. The invalidation problem dominates the design of any non-trivial cache. The question “when does the cached value stop being correct, and how does every layer find out?” has no clean answer; every cache design picks a trade-off between staleness tolerance, invalidation cost, and operational complexity.

The HTTP cache protocol (RFC 9111) is the foundation everything else builds on: Cache-Control, ETag, Last-Modified, Vary, If-None-Match. These headers compose to give a layered cache where each layer (browser → CDN → gateway → origin) can independently decide what to store and how to revalidate. Understanding them precisely is the difference between a cache that works and a cache that silently serves stale data.

When to use it#

Cache when:

Reads dominate writes. The classic cache-aside win: read-heavy APIs cache responses, the cache catches >90% of requests, the origin sees only the misses. Newsfeeds, product catalogues, public profiles.
The data has tolerance for some staleness. “Last updated 30 seconds ago” is fine for a leaderboard, intolerable for a bank balance. Match the TTL to the tolerance.
The computation is expensive. Aggregations, full-text searches, AI inference, third-party API calls — anything where each miss costs measurable real money or latency.
The same query repeats. Top 100 products by region. Recently viewed for the logged-in user. Patterns where keying is obvious.

Don’t cache when:

Every request is unique. Personalised dashboards with no overlap don’t benefit from a shared cache. Per-user caching can still help, but the win is smaller.
Stale data is dangerous. Inventory levels during checkout, fraud-score lookups, anything where a moment-old value causes real harm.
The compute is already cheap. A cache adds a network hop and an invalidation surface. If the underlying operation is sub-millisecond, caching may add net latency.

How it works#

The five layers#

A typical request to a modern API passes through up to five cache layers on the way to the database:

   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
   │ Browser  │→ │   CDN    │→ │ Gateway  │→ │   App    │→ │    DB    │
   │  cache   │  │  (edge)  │  │  cache   │  │  cache   │  │  cache   │
   └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘
                                                                │
                                                                ▼
                                                            origin DB

Each layer has a different scope, lifetime, and invalidation mechanism:

Layer	Scope	Typical TTL	Invalidation
Browser	Single user / device	seconds to minutes	`Cache-Control: max-age`, `ETag` revalidation
CDN (edge)	Global, all users	minutes to days	TTL expiry + explicit purge (URL or surrogate key)
API gateway	Per-region or global	seconds to minutes	TTL + explicit invalidation API
App (in-process)	Single replica	seconds	TTL + eviction (LRU)
App (shared, Redis)	Cluster-wide	seconds to hours	TTL + explicit DEL on write
DB query cache	Single DB instance	seconds	Often automatic, sometimes off

Browser caching — HTTP `Cache-Control` is everything#

The browser cache is the layer closest to the user; the right headers can save a round-trip entirely.

HTTP/1.1 200 OK
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-9f2a"
Last-Modified: Thu, 28 May 2026 14:22:18 GMT
Content-Type: application/json

The Cache-Control directives that matter:

public / private — public allows shared caches (CDN, gateway) to store the response. private restricts to the browser only (personalised content).
max-age=N — store for N seconds. After that, revalidate.
no-cache — store, but always revalidate before reuse. Different from no-store.
no-store — don’t store at all. The strongest setting; for sensitive data.
stale-while-revalidate=N — after expiry, serve the stale value for N more seconds while fetching the new one in the background. A latency win for moderately-staleness-tolerant content.
stale-if-error=N — if the origin returns 5xx, serve the stale value for N more seconds. A reliability win.

The ETag header is a content fingerprint. On revalidation, the client sends If-None-Match: "v3-9f2a"; the server returns 304 Not Modified (with no body) if the ETag still matches, saving the response body bytes.

Last-Modified is the older, weaker form — a timestamp. Use ETag when you can.

CDN caching — the global layer#

CDNs (Cloudflare, Fastly, CloudFront, Akamai) intercept requests at edge POPs and cache responses. The cache key is typically the URL plus a subset of headers (Vary controls which headers split the cache).

The Vary header is load-bearing and dangerous:

Vary: Accept-Encoding, Accept-Language

This tells the cache that responses differ by these request headers — the cache must split entries by them. Get it wrong (Vary: User-Agent, for example) and every browser version gets its own cache entry; the hit rate collapses.

Surrogate keys (Fastly) or cache tags (Cloudflare) let you group cache entries for bulk invalidation. A response for /products/42 might be tagged product-42 and category-electronics; purging the product-42 tag invalidates every URL associated with it (product page, category listing, search result snippet). This is how the CDN-purge-on-write pattern stays manageable at scale.

API gateway caching#

API gateways (Kong, AWS API Gateway, Apigee) can cache responses by URL+method+headers, much like a CDN but inside your perimeter. Useful when:

Some responses are expensive to compute but cheap to serve.
The cache should be inside the auth boundary (don’t expose internal responses on the public CDN).
You want per-API-key or per-account cache keys (CDNs don’t do this well).

The gateway cache is typically smaller (per-region rather than global) and has a tighter TTL than the CDN. It’s less common than the other layers but valuable for specific workloads.

Application caching#

Inside the application, caching takes two forms:

In-process (LRU). A bounded LRU cache (Python’s functools.lru_cache, Go’s golang/groupcache, Node’s lru-cache) lives in the application’s memory. Fast (no network), but per-replica (not shared), evaporates on restart, and consumes RAM.
Out-of-process (Redis, Memcached). A shared cache replicates across replicas, survives restarts, and can hold gigabytes. Trades network latency (sub-millisecond on a hot LAN) for the win of “one copy per cluster, not per replica”.

The pattern most production APIs use:

   Request → in-process LRU → Redis → database
   (5μs)        (sub-1ms)   (sub-2ms)   (5-50ms)

Hot keys land in the in-process cache; warm keys in Redis; cold misses go to the database.

Cache strategies — aside, write-through, write-behind#

How writes interact with the cache determines the failure modes:

Cache-aside (lazy). Reads check cache, fall back to DB on miss, populate cache. Writes go directly to DB and either invalidate the cache or rely on TTL. Pros: simple, robust to cache failure. Cons: cold-start misses; race conditions on concurrent writes.

Write-through. Writes hit the cache and DB in the same transaction (or close to it). Reads always check cache first. Pros: cache is always warm and current. Cons: write latency includes both writes; cache failure can wedge writes.

A third pattern, write-behind (write-back), queues writes to the DB asynchronously after acknowledging the client. Pro: fast writes. Con: durability risk if the cache fails before the queue drains. Used in specialised systems (some metrics pipelines); rare for general API design.

The cache-aside pattern is the default for almost every API: it’s simple, the failure modes are well-understood, and it’s robust to cache outages (a Redis crash means everything falls through to the DB — slow, but correct).

The cache-invalidation problem#

The hard part. Three approaches, none clean:

TTL-based. The cache entry expires after N seconds; no explicit invalidation. Simple, but every read for the first N seconds after a write is stale. Acceptable when staleness is tolerable.
Explicit invalidation on write. When the data changes, delete the cache entry. Better consistency, but every write must know every cache key that touches the data — fragile as the system grows.
Versioned keys. The cache key embeds a version (user:42:v17). When the data changes, the version bumps; old entries become unreachable and eventually expire. Strong consistency, more storage churn.

The Stripe / GitHub / Reddit production pattern is usually TTL plus best-effort explicit invalidation: write to DB, then DEL from Redis, then rely on TTL as a safety net if the DEL fails or if a stale entry slipped in through a race.

Setting cache headers in code — three-language example#

A typical pattern: serve a public list endpoint with browser+CDN caching, a private user endpoint with browser-only, and a never-cached write endpoint.

import hashlib
import json
from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/v1/products")
def list_products(request: Request, response: Response):
    products = fetch_products()  # cheap if cached, expensive if not
    body = json.dumps({"data": products}, sort_keys=True)
    etag = f'"{hashlib.sha1(body.encode()).hexdigest()[:12]}"'

    # Honour If-None-Match for conditional GET
    if request.headers.get("If-None-Match") == etag:
        return Response(status_code=304)

    response.headers["Cache-Control"] = "public, max-age=60, stale-while-revalidate=30"
    response.headers["ETag"] = etag
    response.headers["Vary"] = "Accept-Encoding"
    return JSONResponse(content={"data": products}, headers=dict(response.headers))


@app.get("/v1/me")
def get_me(response: Response):
    response.headers["Cache-Control"] = "private, max-age=30"
    return {"user": current_user()}


@app.post("/v1/orders")
def create_order(response: Response):
    response.headers["Cache-Control"] = "no-store"
    return {"order": create_new_order()}

package main

import (
    "crypto/sha1"
    "encoding/hex"
    "encoding/json"
    "net/http"
)

func listProducts(w http.ResponseWriter, r *http.Request) {
    products := fetchProducts()
    body, _ := json.Marshal(map[string]any{"data": products})

    sum := sha1.Sum(body)
    etag := `"` + hex.EncodeToString(sum[:6]) + `"`

    if r.Header.Get("If-None-Match") == etag {
        w.WriteHeader(http.StatusNotModified)
        return
    }

    w.Header().Set("Cache-Control", "public, max-age=60, stale-while-revalidate=30")
    w.Header().Set("ETag", etag)
    w.Header().Set("Vary", "Accept-Encoding")
    w.Header().Set("Content-Type", "application/json")
    w.Write(body)
}

func getMe(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Cache-Control", "private, max-age=30")
    json.NewEncoder(w).Encode(currentUser())
}

func createOrder(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Cache-Control", "no-store")
    json.NewEncoder(w).Encode(newOrder())
}

import express from "express";
import crypto from "crypto";

const app = express();

app.get("/v1/products", (req, res) => {
  const products = fetchProducts();
  const body = JSON.stringify({ data: products });
  const etag = `"${crypto
    .createHash("sha1")
    .update(body)
    .digest("hex")
    .slice(0, 12)}"`;

  if (req.get("If-None-Match") === etag) {
    return res.status(304).end();
  }

  res.set({
    "Cache-Control": "public, max-age=60, stale-while-revalidate=30",
    "ETag": etag,
    "Vary": "Accept-Encoding",
    "Content-Type": "application/json",
  });
  res.send(body);
});

app.get("/v1/me", (req, res) => {
  res.set("Cache-Control", "private, max-age=30");
  res.json({ user: currentUser() });
});

app.post("/v1/orders", (req, res) => {
  res.set("Cache-Control", "no-store");
  res.json({ order: newOrder() });
});

Three points across all three implementations:

Vary on Accept-Encoding so the cache stores gzipped and uncompressed versions separately.
private on user-specific endpoints prevents the CDN from serving one user’s data to another.
no-store on writes so the response is never cached anywhere.

Variants#

Variant	Mechanism	When it fits
HTTP cache (browser + CDN)	`Cache-Control`, `ETag`, `Vary` headers	Public GET responses; the foundational pattern
In-process LRU	Bounded memory cache, single replica	Hot keys, fast lookups; loses on replica restart
Shared cache (Redis, Memcached)	Out-of-process key-value store	Cluster-wide consistency; survives restarts
Read-through cache	Library transparently fills on miss	Simpler app code; tighter coupling to the cache
Cache-aside	App reads cache, falls back to DB, populates on miss	The most common; explicit control
Write-through	Writes hit cache and DB together	Strong consistency between cache and DB
Cache tags / surrogate keys	Group entries for bulk purge	CDN-heavy workloads with complex invalidation

Trade-offs#

What caching gives you:

Lower origin load. A 95% cache hit rate means the origin sees 5% of the request volume. Capacity wins are dramatic.
Lower latency. Edge-cache hits in tens of milliseconds beat origin round-trips by 10-100x.
Cost savings. Bandwidth, compute, database queries — all reduce proportionally to hit rate.
Burst absorption. A traffic spike that would crater the origin is absorbed at the edge.

What caching costs you:

Staleness. The fundamental trade-off; tighter TTLs = fresher data + more origin load.
Invalidation complexity. Knowing what to invalidate when something changes is the hardest part of any large system.
Operational surface. Cache outages, hot-key issues, cache stampedes (everything misses at once), thundering herd on TTL expiry.
Consistency anomalies. A user updates their profile and refreshes — their browser cache shows the old version. Five minutes pass. Now it shows the new version. The user is confused.
Per-user vs shared cache decisions. Easy to get wrong; leaks of one user’s data to another are catastrophic.

Common pitfalls#

Cache-Control: public on personalised data. A CDN serves one user’s profile to every other user hitting the same URL. Use private for anything user-specific.
No Vary header on responses that vary by request header. Compressed and uncompressed versions get mixed up; localised responses serve the wrong language.
Vary: User-Agent. Splits the cache into millions of useless entries. Don’t.
No invalidation on write. The cache stays stale until TTL; users see old data after their own updates.
TTL too long. Stale data persists for hours after an update.
TTL too short. Cache provides little win; origin still under load.
No cache stampede protection. When a hot key expires, every concurrent request misses and hammers the origin simultaneously. Use single-flight (Go), cache.set(key, "pending") sentinel, or staggered TTLs.
No fallback for cache outage. When Redis goes down, the app falls through to the database, which can’t handle the load — coordinated outage.
Caching error responses. A transient 500 gets cached for 5 minutes; every user sees the error long after the origin recovered. Cache only 2xx (and selectively 304, 404).
Forgetting that POST/PUT/DELETE responses are not cacheable by default. Don’t try to cache them; the protocol won’t let you safely.

Rate Limiting — caches reduce origin load, which reduces pressure on rate limits. A well-cached API needs looser limits per caller because most calls never touch the origin.
API Monitoring — cache hit rate is one of the first dashboards to build; a hit rate drop signals either a code change or a stampede.
The Circuit Breaker Pattern — stale-if-error is the HTTP-layer analogue of a circuit-breaker fallback; serve cache when origin is down.
HTTP — The Foundational Protocol for APIs — the Cache-Control, ETag, Vary headers are the substrate every layer depends on.
Speeding Up Web Page Loading — browser caching is one of the most leveraged page-load optimisations; this writeup is its API-design twin.