Distributed Monitoring — System Design

Use cases#

Monitoring is the answer to “is my system healthy, and if not, why?” Three orthogonal data shapes serve three questions:

Metrics answer “what’s the rate / latency / error count over time?” — pre-aggregated, cheap, the basis of dashboards and alerts.
Logs answer “what exactly happened in this request?” — high-cardinality, expensive, the basis of forensic debugging.
Traces answer “where in the call graph did the latency come from?” — per-request, sampled, the basis of microservice debugging.

A mature observability stack runs all three. A startup typically starts with logs, adds metrics when bills hurt, adds tracing when microservices proliferate.

Functional requirements#

Ingest metrics, logs, and traces from every service with low-overhead client libraries.
Aggregate across instances, regions, and timeframes.
Persist for a configurable retention window (7-90 days hot, longer cold).
Query via dashboards (Grafana, Datadog UI) and alert rules (Alertmanager, PagerDuty).
Correlate across the three pillars — click a slow trace, jump to the log line that emitted it.

Non-functional requirements#

Ingest throughput: a thousand-host production system emits 10M+ metric data points/min, 1M+ log lines/sec, 10k+ traces/sec. The monitoring system must absorb this without bottlenecking the application.
Query latency: dashboard panels p95 under 2 s; ad-hoc log search p95 under 10 s.
Cardinality: metric label cardinality is the single biggest cost driver — see warning below.
Availability: 99.95% minimum. Monitoring must be more reliable than what it monitors; otherwise an outage of monitoring tools masks the real outage.

High-level design#

   app                 collector                  storage              query
  ─────                ─────────                  ───────              ─────
  metrics ──> Prometheus scrape ───> Prometheus TSDB ──┐
  logs    ──> Fluent Bit / Vector ──> Loki / OpenSearch ┼──> Grafana / API
  traces  ──> OTel SDK ──────────> Tempo / Jaeger ─────┘
                                        │
                              long-term: S3 / GCS

Three pipelines, three storage shapes, one query layer. OpenTelemetry (OTel) is the open standard that unifies the client side; the backends still differ because the storage shapes differ.

Detailed design#

Metrics: time-series databases#

A metric is (name, labels, timestamp, value) — e.g. http_requests_total{method="GET",route="/api/v1/users",status="200"} = 4271 @ 2024-05-16T10:00:00Z. Stored as time-series; each unique (name, labels) combination is one series.

Prometheus, VictoriaMetrics, M3, and InfluxDB use specialized time-series databases (TSDBs) with:

Chunked storage: 2-hour blocks of (timestamp, value) per series, compressed with delta-delta + XOR encoding (Gorilla compression — ~1.4 bytes per data point).
Inverted index on labels to answer sum(rate(http_requests_total{method="GET"}[5m])) by (route) efficiently.
Retention tiers: keep raw data for 15 days, rolled-up 5-min aggregates for 60 days, hourly aggregates for a year.

Logs: indexed text or just text#

Two camps:

Full-text indexed (Elasticsearch, OpenSearch, Splunk)
  ✓  Powerful search — query any field, regex match
  ✓  Slow ingestion ceiling (~100k events/sec/node typical)
  ✗  Expensive — indexes can be 5-10× the raw log size

Label-indexed only (Loki, ClickHouse)
  ✓  Cheap — store raw logs in object storage, index only labels
  ✓  Ingestion scales horizontally
  ✗  Full-text search is a linear scan over time-windowed shards

Loki’s bet is that you almost always know which service and roughly when — those are the indexed labels — and you’ll happily scan a few minutes of logs from one service to find the line. Saves 80-90% on cost vs Elasticsearch.

Traces: spans and sampling#

A trace is a DAG of spans; each span is (trace_id, span_id, parent_id, service, op, start_ts, duration, attributes). The client SDK auto-instruments HTTP/gRPC calls and propagates trace_id via headers (traceparent).

Tracing volume is fundamentally unmanageable at full rate — every request generates dozens of spans. Sampling strategies:

Head-based: decide at the trace start (e.g. 1% random). Simple, predictable cost.
Tail-based: collect all spans for a trace, decide after completion (e.g. keep all traces > 1 s, keep all error traces, keep 1% otherwise). Best signal-per-byte, requires buffering at the collector.

Honeycomb pioneered tail-sampling for traces; Datadog and Jaeger now both support it.

The three pillars converge#

Modern systems (Honeycomb, Lightstep, recent Datadog) treat all three as wide events — high-cardinality structured records — and aggregate on the fly to derive metrics, with full retention for logs and traces. The thesis: pre-aggregated metrics lose the dimension you need most when debugging.

Alerts#

rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
    for: 5m
    labels:
      severity: page

Alert design principles:

Alert on symptoms (user-visible errors, latency SLO breaches), not causes (CPU at 90%). CPU being high isn’t a problem; it’s the system using what you paid for.
Use the multi-window / multi-burn-rate pattern from Google SRE: alert when a 5-min burn AND a 1-hour burn both exceed the SLO budget — catches both fast and slow regressions, fewer false positives.
Avoid alert fatigue: every paging alert should require human action. Everything else is a ticket.

USE and RED#

Two complementary frameworks:

USE (utilization, saturation, errors) — for resources (CPU, disk, network). Brendan Gregg.
RED (rate, errors, duration) — for request-driven services. Tom Wilkie.

Together they cover most production dashboards.

Trade-offs#

Pull-based metrics (Prometheus) — service exposes /metrics; central scraper pulls every 15 s. Easy service discovery, scraper sees what it scrapes. Hard to monitor short-lived jobs and clients behind NAT.

Push-based metrics (StatsD, OTLP) — app pushes to an aggregator. Works for batch jobs and ephemeral pods. Costs an aggregator tier; back-pressure during outages can lose data.

Other tensions:

High-cardinality logs vs aggregated metrics — logs preserve every dimension but cost 10-100× more per data point. Metrics aggregate aggressively but lose dimensions you didn’t pre-declare.
Self-hosted vs managed — Prometheus + Grafana + Loki on your own infra is open and cheap until you hit cardinality issues. Datadog and Honeycomb are expensive but operationally turnkey.
Per-service silos vs unified backend — easier to start per-service but ad-hoc cross-service queries become painful. The OpenTelemetry-on-one-backend pattern wins for orgs with > 20 services.

Real-world examples#

Prometheus + Grafana — the open-source default. Kubernetes operators run Prometheus per cluster, federate to a central one.
Datadog — agent-based, monolithic UI for metrics + logs + APM. Powers most fintech and YC-stage observability.
Honeycomb — wide-events philosophy; pioneered tail-sampling and “BubbleUp” (find the dimension that distinguishes slow requests).
Netflix Atlas — in-house dimensional TSDB, handles 2B+ metric values/min.
Uber M3 — Cassandra-backed TSDB, scaled to 9B+ active time series.
Grafana Loki + Tempo + Mimir — the cost-optimized open stack; Loki indexes labels, Tempo stores traces in object storage, Mimir is horizontally-scaled Prometheus.

Distributed Logging — the log pipeline that feeds the log pillar.
Server-Side Error Monitoring — error capture sits alongside metrics and logs.
Distributed Cache — TSDBs cache hot label-indexes aggressively.
Distributed Search — Elasticsearch is both a search engine and a log store, with the same inverted index.