Distributed Logging — System Design

Use cases#

A distributed logging system collects, transports, indexes, stores, and queries log events from every service in your fleet. Distinct from Distributed Monitoring (metrics) in that logs preserve full per-event detail rather than pre-aggregating. The dominant use cases:

Production debugging — “what did request 87a3 do?” The log line is the ground truth.
Audit and compliance — security events, access logs, payment events; SOC 2, HIPAA, PCI require log retention with integrity.
Postmortem reconstruction — sequence of events leading to an outage. Stitching together logs across services is often the only way to figure out cascade failures.
Real-time alerting on patterns — alert if any service logs OutOfMemoryError 5 times in 1 min.
Analytics on application events — feeding logs into BigQuery / Snowflake for product analytics.

Functional requirements#

Each service emits structured (JSON, logfmt) or unstructured logs.
A collector tails files / stdout / journald per host and ships to a central pipeline.
The pipeline enriches (add host, region, service, version), filters (drop noisy lines), and routes to one or more sinks.
Storage layer holds logs for a retention window with searchable indexing.
Query layer lets engineers grep / filter / aggregate by service, time, severity, custom fields.

Non-functional requirements#

Ingest throughput: a thousand-host system emits 100k-1M log lines/sec. The pipeline must absorb without dropping.
Storage: at ~500 bytes/line and 1M lines/sec, that’s 43 TB/day raw. Compression and tiering bring it down 5-20×.
Query latency: ad-hoc query p95 under 10 s on a day’s worth of logs.
Reliability: must survive collector restart, broker outage, pipeline backpressure without losing logs. The on-disk buffer is non-negotiable.
End-to-end delay: production trail under 30 s from log line emit to searchable. Debug-friendly.

High-level design#

   host                           pipeline                            storage / query
  ─────────                  ──────────────                      ─────────────────────
  app stdout/file ─┐
  syslog          ─┼──> Fluent Bit / Vector / Filebeat (per host)
  containers      ─┘    │ tail, parse, enrich, buffer to disk
                        ▼
                   Kafka / Kinesis / Pulsar (durable transport)
                        │
                        ▼
                   Processors (parse, route, redact, sample)
                        │
            ┌───────────┼────────────┬──────────────┐
            ▼           ▼            ▼              ▼
        Loki         OpenSearch   ClickHouse    S3 (cold storage)
        (hot tier)   (hot tier)   (analytics)   (archive)
                        │
                        ▼
                   Grafana / Kibana / SQL

Three logical stages: collect (per-host agent), transport (durable queue), store + query (indexed backend + archive).

Detailed design#

Structured logging#

A log line carrying a free-form string is hard to query; a log line carrying structured fields is queryable forever:

{
  "ts":"2024-05-16T10:00:00Z",
  "level":"error",
  "service":"checkout",
  "trace_id":"abc123",
  "user_id":42,
  "duration_ms":342,
  "route":"/api/v1/orders",
  "msg":"failed to charge card",
  "err":"card_declined"
}

Every modern logging library emits structured logs (zap, slog, winston, pino, structlog). The pipeline preserves the structure end-to-end so queries become service:checkout AND level:error AND err:card_declined.

Collection: tail and ship#

Per-host agents (Fluent Bit, Vector, Filebeat, Datadog Agent, Promtail) tail log files and / or stdout. Modern containers route stdout to a per-container file on the host; the agent tails them all. Kubernetes’ canonical pattern: a DaemonSet runs one agent per node.

Critical features:

On-disk buffer with checkpoint, so a pipeline outage doesn’t lose logs.
Backpressure — slow down or drop low-priority logs when downstream is congested.
At-least-once delivery to the transport — the application has already committed the event by writing it to disk.

Transport: durable queue#

Kafka, Kinesis, or Pulsar between the collector and the storage tier serve three purposes:

Buffering — absorbs ingestion spikes without overloading storage.
Multi-consumer fan-out — the same log stream feeds the hot search index, the cold S3 archive, and any number of downstream analytics jobs.
Replay — re-process the last hour of logs after fixing a parser bug.

For low-volume systems (small startup), the agent can ship directly to the backend; Kafka is justified once you’re in the 50k+ events/sec range.

Storage and indexing#

Two camps, paralleling the metrics-side debate:

Fully-indexed (Elasticsearch / OpenSearch / Splunk) — every field is searchable, aggregations are fast, query language is rich. Storage cost is 3-5× the raw log size; ingest tops out at ~100k events/sec/node.

Label-indexed (Loki, ClickHouse) — only a small set of labels (service, host, level) are indexed. Raw logs sit in object storage and are scanned by time-and-label. ~80% cost reduction; full-text search is a linear scan over a small window.

The Loki bet: most queries narrow by service and time first, so label-indexing wins. The OpenSearch / Splunk bet: free-form ad-hoc queries dominate, so full indexing pays for itself.

ClickHouse occupies a third position — columnar storage with native SQL — and is gaining ground (Cloudflare’s logging stack runs on it).

Retention tiering#

Hot       (0-7 days)    — indexed, fast queries. Loki, OpenSearch. SSDs.
Warm      (7-30 days)   — searchable but slower; bigger blocks, less compute.
Cold      (30+ days)    — archived raw logs in S3. Query via Athena / restore.
Frozen    (compliance)  — write-once, sealed, decades.

The 80/20: 95% of queries hit the last 24 hours. Spending hot-storage money on year-old logs is the most common cost mistake.

Sampling and filtering#

Not every log line is worth keeping forever. Strategies:

Drop noisy lines at the collector — health-check logs, debug-level lines in prod.
Sample by trace — keep all spans for a sampled trace; drop unsampled ones. Same heuristics as Distributed Monitoring traces.
Rate-limit per service — a runaway service that logs 100k lines/sec must not blow up the whole pipeline.
PII redaction — regex-strip credit cards, SSNs, emails. Defense-in-depth (SDK + pipeline + storage).

Correlation with traces and metrics#

Logs become exponentially more useful when correlated:

Trace ID injection — every log line in a request carries the trace_id. Click a slow trace → jump to its logs.
Service-name standardization — same label scheme across logs, metrics, traces.
Common dashboards — Grafana panels show metric, logs, and traces side-by-side.

This is the OpenTelemetry pitch and the modern observability default.

Trade-offs#

Other axes:

Push vs pull collection — push (agent → broker) is the dominant pattern; pull (Prometheus-style scrape) doesn’t really apply to logs since they’re discrete events, not gauges.
JSON vs binary log format — JSON is universal but verbose; binary (Protobuf, MessagePack) is 3-5× smaller. Most pipelines use JSON for portability and rely on compression.
Self-hosted vs SaaS — Loki + Grafana + S3 is cheap to run but operationally significant. Datadog / Splunk / Sumo Logic are turnkey but bill per-GB at high markup.
Append-only vs indexed-mutable — append-only is faster ingestion; indexed-mutable supports update / delete (rarely needed for logs except for GDPR right-to-erasure).

Real-world examples#

Loki + Grafana + Promtail — the open low-cost stack. Cloudflare, GitLab, and many cost-conscious teams run this.
OpenSearch / Elasticsearch + Kibana + Filebeat — the rich-query stack. AWS, Netflix Mantis, and most enterprise log analytics.
Splunk — the enterprise default for the last 20 years; powerful query language (SPL), expensive.
Datadog Logs — agent + hosted backend + correlation with metrics and APM. Heavy at YC and fintech shops.
Cloudflare — internally migrated to ClickHouse; ingests trillions of events/day across edge fleet.
Vector (Datadog open source) — the fastest collector in the field; written in Rust.
Honeycomb — collapses logs and traces into wide events; the unification thesis.

Distributed Monitoring — sibling pillar; logs and metrics often share the same ingestion path.
Distributed Messaging Queue — Kafka is the canonical transport between collectors and storage.
Blob Store — cold log archive lives here.
Distributed Search — Elasticsearch is both a search engine and a log store.