API Monitoring — API Design · Engineering Playbook

What it is#

API monitoring is the operational discipline of knowing what your API is doing right now, and what it did 30 seconds ago when the alert fired. It has three substrates — logs, metrics, and traces — each answering a different question, and a small set of high-signal indicators on top — the four golden signals (latency, traffic, errors, saturation) and the RED method (Rate, Errors, Duration) — that fit on a single dashboard.

The pillars are not interchangeable:

Logs answer “what exactly happened with this request?” — structured event records with a unique request_id per call. Searchable; expensive in volume; the post-mortem substrate.
Metrics answer “how is the system trending?” — numeric time series aggregated across all requests. Cheap; the alerting substrate.
Traces answer “where did this slow request spend its time?” — a directed acyclic graph of spans across services. Sampled; the debugging substrate.

The on-call engineer at 3am does not have time to read logs. They need a dashboard that says, in five seconds: request rate is normal, error rate is spiking on POST /charges, p99 latency is fine, saturation is at 60%. From there, they drill into traces for the slow path, then logs for the smoking gun. Build the dashboard first, then the alerts, then the deeper instrumentation.

The Google SRE book formalised this with the four golden signals: latency (how long things take), traffic (how many things are happening), errors (how many are failing), saturation (how full the system is). Brendan Gregg’s USE method (Utilisation, Saturation, Errors) covers the infrastructure side. The Weaveworks RED method (Rate, Errors, Duration) covers the service side. All three frameworks agree on the same handful of indicators; pick one vocabulary and stick to it.

When to use it#

API monitoring is mandatory for any production service. The question is not “should we have monitoring?” but “what should be on the first dashboard?”

Reach harder for monitoring when:

The service is on the critical path. Payment APIs, auth, anything that holds up user-facing flows. A 5-minute outage here is a 5-minute revenue outage.
There are multiple downstream dependencies. Distributed traces become the only feasible debugging tool; logs alone don’t show the cross-service picture.
The team is on rotation. New on-call engineers need the dashboard to teach them where pain comes from.
You’re in regulated industries. Audit logs are mandatory; tracing requirements often follow.

You can be lighter with monitoring when:

Internal tools used by a known group. A status page and basic alerts may suffice.
Single-replica services where logs are enough. Tracing infrastructure has overhead; if grep on one log file works, that’s fine.

But “lighter” never means “none”. Even a side project benefits from /metrics and a Prometheus scrape.

How it works#

The three pillars#

Logs — structured JSON, request_id correlation#

Modern logging is structured: every log line is a JSON object with named fields. Plain-text logs that look like [2026-05-30 14:22:18] ERROR: charge failed for user 42 are searchable only by free-text grep; they don’t aggregate.

{
  "ts": "2026-05-30T14:22:18.847Z",
  "level": "error",
  "service": "charges-api",
  "request_id": "req_a3f9c2",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "user_id": "u_42",
  "endpoint": "POST /v1/charges",
  "status": 500,
  "duration_ms": 1842,
  "error": "downstream timeout: card-network",
  "downstream_endpoint": "POST https://card.network/charge"
}

The single most useful field is request_id — a UUID generated at the edge (gateway or ingress) and propagated through every service the request touches. With request_id in every log, finding the full story of a failed request is one query: request_id="req_a3f9c2".

The trace_id field hooks the log into the distributed trace; clicking from a log entry to its trace (and vice versa) is the workflow that makes both useful.

Metrics — counters, gauges, histograms#

Metrics are numeric time series aggregated by labels. Three primitive types:

Counters — monotonically increasing values: http_requests_total{method="POST", endpoint="/v1/charges", status="500"}. Rate-divided over time to give “errors per second”.
Gauges — point-in-time values: db_pool_connections_open, circuit_breaker_state. Can go up or down.
Histograms — distributions of values, bucketed: http_request_duration_seconds_bucket{le="0.1"}. Used to compute percentiles (p50, p95, p99).

Prometheus is the de-facto standard exposition format; almost every modern service exposes /metrics in this format. Grafana visualises it. Cloud providers (Datadog, New Relic, GCP Cloud Monitoring) consume it.

The cardinality trap: every unique combination of labels creates a separate time series. http_requests_total{user_id="u_42"} with millions of users explodes cardinality and bankrupts the metrics store. Keep labels bounded — endpoint and status good, user_id and request_id bad. Those high-cardinality dimensions belong in logs, not metrics.

Traces — OpenTelemetry, spans, parent-child#

A trace represents one request’s journey through a distributed system. It’s a tree (technically a DAG) of spans — each span is a single operation in a single service with a start time, duration, and parent.

   trace_id = 0af7651916cd43dd8448eb211c80319c
   ├── span: POST /v1/charges                  [gateway]      18ms
   │   └── span: validate token                [auth]          4ms
   │   └── span: charges.create                [charges-api]  14ms
   │       └── span: db.insert charges         [postgres]      2ms
   │       └── span: card_network.charge       [card-network] 8ms
   │           └── span: external HTTP POST    [http-client]  7ms

OpenTelemetry (OTel) is the cross-language standard. The OTel SDK auto-instruments most HTTP clients/servers, database drivers, and message queues; manual instrumentation adds spans for business logic. Backends: Jaeger, Zipkin, Tempo, Honeycomb, Datadog APM.

Tracing is sampled — capturing every trace at high QPS is prohibitive. Typical sampling: 1% head-based (decide at the start of each request) or 100% tail-based-for-errors (capture all error traces, sample the rest). The OTel SDK supports both.

The four golden signals#

Google SRE’s distillation. For any service, alert on:

Signal	What it measures	Typical metric
Latency	How long requests take, especially slow ones	p50, p95, p99 of `http_request_duration_seconds`
Traffic	How much demand the service is handling	`rate(http_requests_total[1m])`
Errors	How many requests are failing	`rate(http_requests_total{status=~"5.."}[1m])`
Saturation	How full the system is	CPU%, memory%, queue depth, connection pool usage

The on-call dashboard has these four at the top, big. Everything else is supplementary detail you drill into.

Latency: percentiles, not averages. The average is dominated by the fast majority; the p99 is what the slow minority of users experiences. Alert on p99 above the SLO target.

Errors: separate 4xx and 5xx. 4xx (client errors) is mostly business as usual; 5xx (server errors) is what hurts. Alert on 5xx rate, not 4xx.

Saturation: the leading indicator. Saturation rises before errors do. CPU pegged at 95%, DB connection pool at 90% utilised, message queue at 80% of capacity — all signal trouble coming. Alert before saturation maxes out.

The RED method — for services specifically#

A simpler model from Weaveworks for instrumenting microservices:

Rate — requests per second
Errors — error rate (5xx percentage)
Duration — latency distribution

RED is the service-side projection of the four golden signals. For service-mesh metrics (Istio, Linkerd), RED comes essentially for free — the mesh observes every request.

The on-call 5-second test#

The first dashboard must answer four questions in five seconds:

   ┌─────────────────────────────────────────────────────────────┐
   │  charges-api status (last 10 minutes)                       │
   ├─────────────────────────────────────────────────────────────┤
   │                                                              │
   │   Rate    1,243 req/s     ─────────╮       (normal: ~1200) │
   │                                     ╲                       │
   │   Errors  0.4%            ─╮       ╱  ALERT                 │
   │           (5xx)            ╲_____╱                          │
   │                                                              │
   │   p99     312ms           ────────────  (SLO: 500ms)        │
   │           latency                                            │
   │                                                              │
   │   Saturation  CPU 62%   DB pool 71%   queue 30%             │
   │                                                              │
   └─────────────────────────────────────────────────────────────┘

Below the fold: per-endpoint breakdown of the same four metrics, then downstream dependency health, then recent deployments, then traces by error class. The first screen is rate / errors / latency / saturation, every time.

Instrumenting an endpoint — three-language example#

A minimum-viable instrumentation: count requests by status, time the duration, and emit a trace span.

from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, make_asgi_app
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import time

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # auto-instruments every route
tracer = trace.get_tracer(__name__)

REQUESTS = Counter(
    "http_requests_total", "HTTP requests",
    ["method", "endpoint", "status"],
)
DURATION = Histogram(
    "http_request_duration_seconds", "HTTP request duration",
    ["method", "endpoint"],
    buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5),
)

app.mount("/metrics", make_asgi_app())  # Prometheus scrape endpoint

@app.middleware("http")
async def observe(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start

    endpoint = request.url.path
    REQUESTS.labels(request.method, endpoint, str(response.status_code)).inc()
    DURATION.labels(request.method, endpoint).observe(duration)
    return response

@app.post("/v1/charges")
async def create_charge(charge: dict):
    with tracer.start_as_current_span("charges.create") as span:
        span.set_attribute("charge.amount", charge["amount"])
        span.set_attribute("charge.currency", charge["currency"])
        return await persist_and_authorise(charge)

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/otel"
)

var (
    requests = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "HTTP requests",
    }, []string{"method", "endpoint", "status"})

    duration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration",
        Buckets: []float64{0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
    }, []string{"method", "endpoint"})

    tracer = otel.Tracer("charges-api")
)

func observe(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rr := &recordingResponse{ResponseWriter: w, status: 200}
        next.ServeHTTP(rr, r)
        elapsed := time.Since(start).Seconds()

        requests.WithLabelValues(r.Method, r.URL.Path,
            http.StatusText(rr.status)).Inc()
        duration.WithLabelValues(r.Method, r.URL.Path).Observe(elapsed)
    })
}

type recordingResponse struct {
    http.ResponseWriter
    status int
}

func (r *recordingResponse) WriteHeader(s int) { r.status = s; r.ResponseWriter.WriteHeader(s) }

func createCharge(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "charges.create")
    defer span.End()
    span.SetAttributes(/* ... charge attributes ... */)
    persistAndAuthorise(ctx, /* ... */)
}

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/v1/charges", createCharge)
    mux.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", observe(mux))
}

import express from "express";
import client from "prom-client";
import { trace } from "@opentelemetry/api";

const app = express();
const tracer = trace.getTracer("charges-api");

const requests = new client.Counter({
  name: "http_requests_total",
  help: "HTTP requests",
  labelNames: ["method", "endpoint", "status"],
});

const duration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration",
  labelNames: ["method", "endpoint"],
  buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});

app.use((req, res, next) => {
  const start = process.hrtime.bigint();
  res.on("finish", () => {
    const elapsed = Number(process.hrtime.bigint() - start) / 1e9;
    requests.labels(req.method, req.route?.path ?? req.path, String(res.statusCode)).inc();
    duration.labels(req.method, req.route?.path ?? req.path).observe(elapsed);
  });
  next();
});

app.get("/metrics", async (_req, res) => {
  res.set("Content-Type", client.register.contentType);
  res.end(await client.register.metrics());
});

app.post("/v1/charges", async (req, res) => {
  await tracer.startActiveSpan("charges.create", async (span) => {
    span.setAttribute("charge.amount", req.body.amount);
    span.setAttribute("charge.currency", req.body.currency);
    const result = await persistAndAuthorise(req.body);
    span.end();
    res.json(result);
  });
});

Three details to notice across implementations:

Endpoint label uses the route pattern, not the URL. POST /v1/charges/:id becomes one label, not millions. Avoid the cardinality explosion.
Histogram buckets match SLO thresholds. If the SLO is “p99 under 500ms”, make sure 0.5 is a bucket edge so the percentile interpolation is accurate.
The metrics endpoint exposes everything Prometheus needs. No separate “publish” call.

Alerting — symptom-based, not cause-based#

The Google SRE book’s other big idea: alert on symptoms, not causes. “p99 latency is breaching SLO” is a symptom — page someone. “CPU is at 95%” is a cause — make it a dashboard, not a page.

The reason: symptoms are user-facing; causes are infrastructure-facing. A symptom alert tells you the user is having a bad time and you need to fix it now. A cause alert tells you a component is hot but the user might not even notice. Pages should be for symptoms; dashboards should be for causes.

Concrete rules of thumb:

Page on: error rate spike, latency SLO breach, complete service outage.
Ticket on (Slack notification, not pager): saturation high, queue depth growing, retry rate increasing.
Dashboard only: everything else.

Aim for fewer than one page per on-call shift on average. Anything more is alert fatigue; the on-call stops trusting the pager.

Variants#

Variant	Mechanism	When it fits
Logs (structured JSON)	Per-request event records	Post-mortem; specific request investigation
Metrics (Prometheus / OpenMetrics)	Aggregated time series	Alerts; dashboards; SLO tracking
Traces (OpenTelemetry)	Span trees across services	Slow-request debugging; cross-service flows
Real User Monitoring (RUM)	Client-side performance, in-browser	User experience as actually felt
Synthetic monitoring	Periodic scripted requests	Endpoint availability when traffic is low
Log-derived metrics	Counts from log queries	Bridging the pillars; expensive
Profiling (continuous)	CPU / memory profiles in production	Performance regressions; root-cause for saturation

Trade-offs#

What good monitoring gives you:

Fast root-cause. A trace narrows a 30-minute investigation to 30 seconds.
Trust during incidents. “The dashboard says everything is fine” or “the dashboard says exactly where the problem is” — both end the question fast.
SLO accountability. Numeric tracking against latency / error budgets.
Capacity planning. Saturation trends predict the next bottleneck weeks out.

What good monitoring costs you:

Storage and bandwidth. Logs and traces are voluminous; metrics with high cardinality explode.
Per-request overhead. Tracing adds latency (sub-millisecond per span; adds up). Logging is non-trivial for very chatty services.
Operational surface. A monitoring stack (Prometheus, Grafana, Loki, Tempo, alertmanager) is itself a service that can fail.
Alert maintenance. Bad thresholds wake people for nothing; missing alerts hide real issues. Tuning is forever.

Common pitfalls#

Logging unstructured strings. Free-text logs aggregate to nothing. Move to JSON early.
No request_id propagation. Each log line becomes an island; cross-request correlation is impossible.
Averages instead of percentiles. The average request time misses the tail; p99 is what users feel.
High-cardinality metric labels. user_id as a label kills the metrics store.
Alerting on causes, not symptoms. “CPU is 95%” pages someone when no user is affected. Page on symptoms.
No saturation metrics. You see the explosion (errors rise) instead of the build-up (saturation rises 10 minutes earlier).
Sampling 100% of traces. Trace storage and ingestion get expensive fast. Use head-based 1% plus tail-based for errors.
The “we’ll add monitoring after we ship” anti-pattern. Monitoring is part of shipping. Build it in.
Pages that don’t link to runbooks. When the pager goes off at 3am, the on-call needs the runbook in one click, not a hunt.

Rate Limiting — the 429 rate, per-account, is one of the most useful dashboards for tuning limits.
The Circuit Breaker Pattern — breaker state transitions are high-signal events; monitor every one.
Managing Retries — retry rate as a fraction of total requests is a leading indicator of dependency degradation.
What Causes API Failures — A Taxonomy — monitoring is the discipline that catches each failure mode before it cascades.
Caching at Different Layers — cache hit rate is one of the first dashboards to build; a hit-rate drop signals deeper trouble.