Distributed Monitoring
Metrics, logs, traces — the three pillars and the data structures that scale each.
Use cases#
Monitoring is the answer to “is my system healthy, and if not, why?” Three orthogonal data shapes serve three questions:
- Metrics answer “what’s the rate / latency / error count over time?” — pre-aggregated, cheap, the basis of dashboards and alerts.
- Logs answer “what exactly happened in this request?” — high-cardinality, expensive, the basis of forensic debugging.
- Traces answer “where in the call graph did the latency come from?” — per-request, sampled, the basis of microservice debugging.
A mature observability stack runs all three. A startup typically starts with logs, adds metrics when bills hurt, adds tracing when microservices proliferate.
Functional requirements#
- Ingest metrics, logs, and traces from every service with low-overhead client libraries.
- Aggregate across instances, regions, and timeframes.
- Persist for a configurable retention window (7-90 days hot, longer cold).
- Query via dashboards (Grafana, Datadog UI) and alert rules (Alertmanager, PagerDuty).
- Correlate across the three pillars — click a slow trace, jump to the log line that emitted it.
Non-functional requirements#
- Ingest throughput: a thousand-host production system emits 10M+ metric data points/min, 1M+ log lines/sec, 10k+ traces/sec. The monitoring system must absorb this without bottlenecking the application.
- Query latency: dashboard panels p95 under 2 s; ad-hoc log search p95 under 10 s.
- Cardinality: metric label cardinality is the single biggest cost driver — see warning below.
- Availability: 99.95% minimum. Monitoring must be more reliable than what it monitors; otherwise an outage of monitoring tools masks the real outage.
High-level design#
app collector storage query ───── ───────── ─────── ───── metrics ──> Prometheus scrape ───> Prometheus TSDB ──┐ logs ──> Fluent Bit / Vector ──> Loki / OpenSearch ┼──> Grafana / API traces ──> OTel SDK ──────────> Tempo / Jaeger ─────┘ │ long-term: S3 / GCSThree pipelines, three storage shapes, one query layer. OpenTelemetry (OTel) is the open standard that unifies the client side; the backends still differ because the storage shapes differ.
Detailed design#
Metrics: time-series databases#
A metric is (name, labels, timestamp, value) — e.g. http_requests_total{method="GET",route="/api/v1/users",status="200"} = 4271 @ 2024-05-16T10:00:00Z. Stored as time-series; each unique (name, labels) combination is one series.
Prometheus, VictoriaMetrics, M3, and InfluxDB use specialized time-series databases (TSDBs) with:
- Chunked storage: 2-hour blocks of
(timestamp, value)per series, compressed with delta-delta + XOR encoding (Gorilla compression — ~1.4 bytes per data point). - Inverted index on labels to answer
sum(rate(http_requests_total{method="GET"}[5m])) by (route)efficiently. - Retention tiers: keep raw data for 15 days, rolled-up 5-min aggregates for 60 days, hourly aggregates for a year.
Logs: indexed text or just text#
Two camps:
Full-text indexed (Elasticsearch, OpenSearch, Splunk) ✓ Powerful search — query any field, regex match ✓ Slow ingestion ceiling (~100k events/sec/node typical) ✗ Expensive — indexes can be 5-10× the raw log size
Label-indexed only (Loki, ClickHouse) ✓ Cheap — store raw logs in object storage, index only labels ✓ Ingestion scales horizontally ✗ Full-text search is a linear scan over time-windowed shardsLoki’s bet is that you almost always know which service and roughly when — those are the indexed labels — and you’ll happily scan a few minutes of logs from one service to find the line. Saves 80-90% on cost vs Elasticsearch.
Traces: spans and sampling#
A trace is a DAG of spans; each span is (trace_id, span_id, parent_id, service, op, start_ts, duration, attributes). The client SDK auto-instruments HTTP/gRPC calls and propagates trace_id via headers (traceparent).
Tracing volume is fundamentally unmanageable at full rate — every request generates dozens of spans. Sampling strategies:
- Head-based: decide at the trace start (e.g. 1% random). Simple, predictable cost.
- Tail-based: collect all spans for a trace, decide after completion (e.g. keep all traces > 1 s, keep all error traces, keep 1% otherwise). Best signal-per-byte, requires buffering at the collector.
Honeycomb pioneered tail-sampling for traces; Datadog and Jaeger now both support it.
The three pillars converge#
Modern systems (Honeycomb, Lightstep, recent Datadog) treat all three as wide events — high-cardinality structured records — and aggregate on the fly to derive metrics, with full retention for logs and traces. The thesis: pre-aggregated metrics lose the dimension you need most when debugging.
Alerts#
rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 for: 5m labels: severity: pageAlert design principles:
- Alert on symptoms (user-visible errors, latency SLO breaches), not causes (CPU at 90%). CPU being high isn’t a problem; it’s the system using what you paid for.
- Use the multi-window / multi-burn-rate pattern from Google SRE: alert when a 5-min burn AND a 1-hour burn both exceed the SLO budget — catches both fast and slow regressions, fewer false positives.
- Avoid alert fatigue: every paging alert should require human action. Everything else is a ticket.
USE and RED#
Two complementary frameworks:
- USE (utilization, saturation, errors) — for resources (CPU, disk, network). Brendan Gregg.
- RED (rate, errors, duration) — for request-driven services. Tom Wilkie.
Together they cover most production dashboards.
Trade-offs#
/metrics; central scraper pulls every 15 s. Easy service discovery, scraper sees what it scrapes. Hard to monitor short-lived jobs and clients behind NAT. Other tensions:
- High-cardinality logs vs aggregated metrics — logs preserve every dimension but cost 10-100× more per data point. Metrics aggregate aggressively but lose dimensions you didn’t pre-declare.
- Self-hosted vs managed — Prometheus + Grafana + Loki on your own infra is open and cheap until you hit cardinality issues. Datadog and Honeycomb are expensive but operationally turnkey.
- Per-service silos vs unified backend — easier to start per-service but ad-hoc cross-service queries become painful. The OpenTelemetry-on-one-backend pattern wins for orgs with > 20 services.
Real-world examples#
- Prometheus + Grafana — the open-source default. Kubernetes operators run Prometheus per cluster, federate to a central one.
- Datadog — agent-based, monolithic UI for metrics + logs + APM. Powers most fintech and YC-stage observability.
- Honeycomb — wide-events philosophy; pioneered tail-sampling and “BubbleUp” (find the dimension that distinguishes slow requests).
- Netflix Atlas — in-house dimensional TSDB, handles 2B+ metric values/min.
- Uber M3 — Cassandra-backed TSDB, scaled to 9B+ active time series.
- Grafana Loki + Tempo + Mimir — the cost-optimized open stack; Loki indexes labels, Tempo stores traces in object storage, Mimir is horizontally-scaled Prometheus.
Related building blocks#
- Distributed Logging — the log pipeline that feeds the log pillar.
- Server-Side Error Monitoring — error capture sits alongside metrics and logs.
- Distributed Cache — TSDBs cache hot label-indexes aggressively.
- Distributed Search — Elasticsearch is both a search engine and a log store, with the same inverted index.