Distributed Monitoring

Metrics, logs, traces — the three pillars and the data structures that scale each.

Building Block Intermediate
6 min read
observability metrics tracing
Companies this resembles: Prometheus · Datadog · Honeycomb · Grafana · New Relic

Use cases#

Monitoring is the answer to “is my system healthy, and if not, why?” Three orthogonal data shapes serve three questions:

  • Metrics answer “what’s the rate / latency / error count over time?” — pre-aggregated, cheap, the basis of dashboards and alerts.
  • Logs answer “what exactly happened in this request?” — high-cardinality, expensive, the basis of forensic debugging.
  • Traces answer “where in the call graph did the latency come from?” — per-request, sampled, the basis of microservice debugging.

A mature observability stack runs all three. A startup typically starts with logs, adds metrics when bills hurt, adds tracing when microservices proliferate.

Functional requirements#

  • Ingest metrics, logs, and traces from every service with low-overhead client libraries.
  • Aggregate across instances, regions, and timeframes.
  • Persist for a configurable retention window (7-90 days hot, longer cold).
  • Query via dashboards (Grafana, Datadog UI) and alert rules (Alertmanager, PagerDuty).
  • Correlate across the three pillars — click a slow trace, jump to the log line that emitted it.

Non-functional requirements#

  • Ingest throughput: a thousand-host production system emits 10M+ metric data points/min, 1M+ log lines/sec, 10k+ traces/sec. The monitoring system must absorb this without bottlenecking the application.
  • Query latency: dashboard panels p95 under 2 s; ad-hoc log search p95 under 10 s.
  • Cardinality: metric label cardinality is the single biggest cost driver — see warning below.
  • Availability: 99.95% minimum. Monitoring must be more reliable than what it monitors; otherwise an outage of monitoring tools masks the real outage.

High-level design#

app collector storage query
───── ───────── ─────── ─────
metrics ──> Prometheus scrape ───> Prometheus TSDB ──┐
logs ──> Fluent Bit / Vector ──> Loki / OpenSearch ┼──> Grafana / API
traces ──> OTel SDK ──────────> Tempo / Jaeger ─────┘
long-term: S3 / GCS

Three pipelines, three storage shapes, one query layer. OpenTelemetry (OTel) is the open standard that unifies the client side; the backends still differ because the storage shapes differ.

Detailed design#

Metrics: time-series databases#

A metric is (name, labels, timestamp, value) — e.g. http_requests_total{method="GET",route="/api/v1/users",status="200"} = 4271 @ 2024-05-16T10:00:00Z. Stored as time-series; each unique (name, labels) combination is one series.

Prometheus, VictoriaMetrics, M3, and InfluxDB use specialized time-series databases (TSDBs) with:

  • Chunked storage: 2-hour blocks of (timestamp, value) per series, compressed with delta-delta + XOR encoding (Gorilla compression — ~1.4 bytes per data point).
  • Inverted index on labels to answer sum(rate(http_requests_total{method="GET"}[5m])) by (route) efficiently.
  • Retention tiers: keep raw data for 15 days, rolled-up 5-min aggregates for 60 days, hourly aggregates for a year.

Logs: indexed text or just text#

Two camps:

Full-text indexed (Elasticsearch, OpenSearch, Splunk)
✓ Powerful search — query any field, regex match
✓ Slow ingestion ceiling (~100k events/sec/node typical)
✗ Expensive — indexes can be 5-10× the raw log size
Label-indexed only (Loki, ClickHouse)
✓ Cheap — store raw logs in object storage, index only labels
✓ Ingestion scales horizontally
✗ Full-text search is a linear scan over time-windowed shards

Loki’s bet is that you almost always know which service and roughly when — those are the indexed labels — and you’ll happily scan a few minutes of logs from one service to find the line. Saves 80-90% on cost vs Elasticsearch.

Traces: spans and sampling#

A trace is a DAG of spans; each span is (trace_id, span_id, parent_id, service, op, start_ts, duration, attributes). The client SDK auto-instruments HTTP/gRPC calls and propagates trace_id via headers (traceparent).

Tracing volume is fundamentally unmanageable at full rate — every request generates dozens of spans. Sampling strategies:

  • Head-based: decide at the trace start (e.g. 1% random). Simple, predictable cost.
  • Tail-based: collect all spans for a trace, decide after completion (e.g. keep all traces > 1 s, keep all error traces, keep 1% otherwise). Best signal-per-byte, requires buffering at the collector.

Honeycomb pioneered tail-sampling for traces; Datadog and Jaeger now both support it.

The three pillars converge#

Modern systems (Honeycomb, Lightstep, recent Datadog) treat all three as wide events — high-cardinality structured records — and aggregate on the fly to derive metrics, with full retention for logs and traces. The thesis: pre-aggregated metrics lose the dimension you need most when debugging.

Alerts#

rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: page

Alert design principles:

  • Alert on symptoms (user-visible errors, latency SLO breaches), not causes (CPU at 90%). CPU being high isn’t a problem; it’s the system using what you paid for.
  • Use the multi-window / multi-burn-rate pattern from Google SRE: alert when a 5-min burn AND a 1-hour burn both exceed the SLO budget — catches both fast and slow regressions, fewer false positives.
  • Avoid alert fatigue: every paging alert should require human action. Everything else is a ticket.

USE and RED#

Two complementary frameworks:

  • USE (utilization, saturation, errors) — for resources (CPU, disk, network). Brendan Gregg.
  • RED (rate, errors, duration) — for request-driven services. Tom Wilkie.

Together they cover most production dashboards.

Trade-offs#

Pull-based metrics (Prometheus) — service exposes /metrics; central scraper pulls every 15 s. Easy service discovery, scraper sees what it scrapes. Hard to monitor short-lived jobs and clients behind NAT.
Push-based metrics (StatsD, OTLP) — app pushes to an aggregator. Works for batch jobs and ephemeral pods. Costs an aggregator tier; back-pressure during outages can lose data.

Other tensions:

  • High-cardinality logs vs aggregated metrics — logs preserve every dimension but cost 10-100× more per data point. Metrics aggregate aggressively but lose dimensions you didn’t pre-declare.
  • Self-hosted vs managed — Prometheus + Grafana + Loki on your own infra is open and cheap until you hit cardinality issues. Datadog and Honeycomb are expensive but operationally turnkey.
  • Per-service silos vs unified backend — easier to start per-service but ad-hoc cross-service queries become painful. The OpenTelemetry-on-one-backend pattern wins for orgs with > 20 services.

Real-world examples#

  • Prometheus + Grafana — the open-source default. Kubernetes operators run Prometheus per cluster, federate to a central one.
  • Datadog — agent-based, monolithic UI for metrics + logs + APM. Powers most fintech and YC-stage observability.
  • Honeycomb — wide-events philosophy; pioneered tail-sampling and “BubbleUp” (find the dimension that distinguishes slow requests).
  • Netflix Atlas — in-house dimensional TSDB, handles 2B+ metric values/min.
  • Uber M3 — Cassandra-backed TSDB, scaled to 9B+ active time series.
  • Grafana Loki + Tempo + Mimir — the cost-optimized open stack; Loki indexes labels, Tempo stores traces in object storage, Mimir is horizontally-scaled Prometheus.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.