Distributed Logging
Log shipping, structured fields, aggregation, retention tiers, search-on-logs vs metrics.
Use cases#
A distributed logging system collects, transports, indexes, stores, and queries log events from every service in your fleet. Distinct from Distributed Monitoring (metrics) in that logs preserve full per-event detail rather than pre-aggregating. The dominant use cases:
- Production debugging — “what did request 87a3 do?” The log line is the ground truth.
- Audit and compliance — security events, access logs, payment events; SOC 2, HIPAA, PCI require log retention with integrity.
- Postmortem reconstruction — sequence of events leading to an outage. Stitching together logs across services is often the only way to figure out cascade failures.
- Real-time alerting on patterns — alert if any service logs
OutOfMemoryError5 times in 1 min. - Analytics on application events — feeding logs into BigQuery / Snowflake for product analytics.
Functional requirements#
- Each service emits structured (JSON, logfmt) or unstructured logs.
- A collector tails files / stdout / journald per host and ships to a central pipeline.
- The pipeline enriches (add
host,region,service,version), filters (drop noisy lines), and routes to one or more sinks. - Storage layer holds logs for a retention window with searchable indexing.
- Query layer lets engineers grep / filter / aggregate by service, time, severity, custom fields.
Non-functional requirements#
- Ingest throughput: a thousand-host system emits 100k-1M log lines/sec. The pipeline must absorb without dropping.
- Storage: at ~500 bytes/line and 1M lines/sec, that’s 43 TB/day raw. Compression and tiering bring it down 5-20×.
- Query latency: ad-hoc query p95 under 10 s on a day’s worth of logs.
- Reliability: must survive collector restart, broker outage, pipeline backpressure without losing logs. The on-disk buffer is non-negotiable.
- End-to-end delay: production trail under 30 s from log line emit to searchable. Debug-friendly.
High-level design#
host pipeline storage / query ───────── ────────────── ───────────────────── app stdout/file ─┐ syslog ─┼──> Fluent Bit / Vector / Filebeat (per host) containers ─┘ │ tail, parse, enrich, buffer to disk ▼ Kafka / Kinesis / Pulsar (durable transport) │ ▼ Processors (parse, route, redact, sample) │ ┌───────────┼────────────┬──────────────┐ ▼ ▼ ▼ ▼ Loki OpenSearch ClickHouse S3 (cold storage) (hot tier) (hot tier) (analytics) (archive) │ ▼ Grafana / Kibana / SQLThree logical stages: collect (per-host agent), transport (durable queue), store + query (indexed backend + archive).
Detailed design#
Structured logging#
A log line carrying a free-form string is hard to query; a log line carrying structured fields is queryable forever:
{ "ts":"2024-05-16T10:00:00Z", "level":"error", "service":"checkout", "trace_id":"abc123", "user_id":42, "duration_ms":342, "route":"/api/v1/orders", "msg":"failed to charge card", "err":"card_declined"}Every modern logging library emits structured logs (zap, slog, winston, pino, structlog). The pipeline preserves the structure end-to-end so queries become service:checkout AND level:error AND err:card_declined.
Collection: tail and ship#
Per-host agents (Fluent Bit, Vector, Filebeat, Datadog Agent, Promtail) tail log files and / or stdout. Modern containers route stdout to a per-container file on the host; the agent tails them all. Kubernetes’ canonical pattern: a DaemonSet runs one agent per node.
Critical features:
- On-disk buffer with checkpoint, so a pipeline outage doesn’t lose logs.
- Backpressure — slow down or drop low-priority logs when downstream is congested.
- At-least-once delivery to the transport — the application has already committed the event by writing it to disk.
Transport: durable queue#
Kafka, Kinesis, or Pulsar between the collector and the storage tier serve three purposes:
- Buffering — absorbs ingestion spikes without overloading storage.
- Multi-consumer fan-out — the same log stream feeds the hot search index, the cold S3 archive, and any number of downstream analytics jobs.
- Replay — re-process the last hour of logs after fixing a parser bug.
For low-volume systems (small startup), the agent can ship directly to the backend; Kafka is justified once you’re in the 50k+ events/sec range.
Storage and indexing#
Two camps, paralleling the metrics-side debate:
service, host, level) are indexed. Raw logs sit in object storage and are scanned by time-and-label. ~80% cost reduction; full-text search is a linear scan over a small window. The Loki bet: most queries narrow by service and time first, so label-indexing wins. The OpenSearch / Splunk bet: free-form ad-hoc queries dominate, so full indexing pays for itself.
ClickHouse occupies a third position — columnar storage with native SQL — and is gaining ground (Cloudflare’s logging stack runs on it).
Retention tiering#
Hot (0-7 days) — indexed, fast queries. Loki, OpenSearch. SSDs.Warm (7-30 days) — searchable but slower; bigger blocks, less compute.Cold (30+ days) — archived raw logs in S3. Query via Athena / restore.Frozen (compliance) — write-once, sealed, decades.The 80/20: 95% of queries hit the last 24 hours. Spending hot-storage money on year-old logs is the most common cost mistake.
Sampling and filtering#
Not every log line is worth keeping forever. Strategies:
- Drop noisy lines at the collector — health-check logs, debug-level lines in prod.
- Sample by trace — keep all spans for a sampled trace; drop unsampled ones. Same heuristics as Distributed Monitoring traces.
- Rate-limit per service — a runaway service that logs 100k lines/sec must not blow up the whole pipeline.
- PII redaction — regex-strip credit cards, SSNs, emails. Defense-in-depth (SDK + pipeline + storage).
Correlation with traces and metrics#
Logs become exponentially more useful when correlated:
- Trace ID injection — every log line in a request carries the
trace_id. Click a slow trace → jump to its logs. - Service-name standardization — same label scheme across logs, metrics, traces.
- Common dashboards — Grafana panels show metric, logs, and traces side-by-side.
This is the OpenTelemetry pitch and the modern observability default.
Trade-offs#
Other axes:
- Push vs pull collection — push (agent → broker) is the dominant pattern; pull (Prometheus-style scrape) doesn’t really apply to logs since they’re discrete events, not gauges.
- JSON vs binary log format — JSON is universal but verbose; binary (Protobuf, MessagePack) is 3-5× smaller. Most pipelines use JSON for portability and rely on compression.
- Self-hosted vs SaaS — Loki + Grafana + S3 is cheap to run but operationally significant. Datadog / Splunk / Sumo Logic are turnkey but bill per-GB at high markup.
- Append-only vs indexed-mutable — append-only is faster ingestion; indexed-mutable supports update / delete (rarely needed for logs except for GDPR right-to-erasure).
Real-world examples#
- Loki + Grafana + Promtail — the open low-cost stack. Cloudflare, GitLab, and many cost-conscious teams run this.
- OpenSearch / Elasticsearch + Kibana + Filebeat — the rich-query stack. AWS, Netflix Mantis, and most enterprise log analytics.
- Splunk — the enterprise default for the last 20 years; powerful query language (SPL), expensive.
- Datadog Logs — agent + hosted backend + correlation with metrics and APM. Heavy at YC and fintech shops.
- Cloudflare — internally migrated to ClickHouse; ingests trillions of events/day across edge fleet.
- Vector (Datadog open source) — the fastest collector in the field; written in Rust.
- Honeycomb — collapses logs and traces into wide events; the unification thesis.
Related building blocks#
- Distributed Monitoring — sibling pillar; logs and metrics often share the same ingestion path.
- Distributed Messaging Queue — Kafka is the canonical transport between collectors and storage.
- Blob Store — cold log archive lives here.
- Distributed Search — Elasticsearch is both a search engine and a log store.