Server-Side Error Monitoring
Real-time error capture, deduplication, alerting, blast-radius scoping.
Use cases#
Server-side error monitoring captures unhandled exceptions, panics, and explicit error reports from backend services and turns them into actionable, deduplicated incidents:
- Production triage — a deploy ships a regression; the error tracker surfaces the new exception within seconds, attributes it to the bad release, pages an on-call.
- Long-tail bugs — errors that happen 1-in-10,000 times never show up in tests but accumulate over months. The error tracker groups them and surfaces patterns.
- Customer support context — when a user reports a problem, search the error tracker by user ID to see exactly what blew up.
It complements metrics (which tell you the rate is up) and logs (which carry the full forensic context) by deduplicating into a much smaller incident list a human can actually work through.
Functional requirements#
- Capture exceptions with stack trace, request context (URL, method, user ID, headers), and runtime state.
- Group similar errors into one issue so 10,000 instances of “NullPointerException at UserService.java:42” appear as one entry.
- Track first-seen, last-seen, frequency, and affected users per issue.
- Notify (email, Slack, PagerDuty) on new issues, regressions, and frequency thresholds.
- Integrate with source control (link stack frame to commit) and issue trackers (Jira, Linear).
- Release tracking — attribute each issue to the release that introduced it.
Non-functional requirements#
- Ingest latency: end-to-end (error fires → visible in UI) under 30 s. Critical for fast triage.
- Throughput: hundreds of thousands of events per second in aggregate. Sentry’s hosted backend handles ~5M events/sec at peak.
- Reliability of capture: errors fire during outages — the SDK must buffer locally and retry, and must never throw an error itself.
- Storage: with grouping, store the full payload of one representative event per issue and aggregate counters for the rest.
High-level design#
service SDK pipeline UI ───────── ──────── ────────── ──────── exception ──> capture(exc, ctx) ──> async ─── ingest API buffered │ queue (disk-backed) │ ├─ fingerprint ── group by hash │ normalize stack ├─ sample (if over-budget) ├─ enrich (release, user) └─ persist payload + counter │ dashboard, alertsThe SDK is a thin in-process library that captures, enriches, and ships. The pipeline (Kafka → processors → storage) handles grouping, deduplication, sampling, and rate-limiting. The UI surfaces grouped issues, sorted by frequency, recency, or impact.
Detailed design#
Capturing#
In most languages: global uncaughtException / excepthook / Thread.setDefaultUncaughtExceptionHandler plus middleware in the web framework. Most frameworks have an official Sentry / Rollbar SDK that wraps both.
What you capture per event:
- Stack trace (with local variable values in dev, redacted in prod).
- Request context: method, URL, headers (PII-scrubbed), query params, body (rarely — too sensitive).
- User context: ID, email (if compliant), session ID.
- Runtime: hostname, region, deploy SHA, runtime version, env (
prod/staging). - Breadcrumbs: a short trail of log lines and HTTP calls leading up to the error.
Fingerprinting and grouping#
The pipeline’s hardest job is recognizing two events as “the same issue”. The canonical algorithm:
fingerprint = hash( exception_type, // NullPointerException first_in_app_stack_frame(file, function), normalize(message) // strip variable substrings)first_in_app_stack_frame excludes vendor / framework code so an error in requests.send is grouped by the user’s code that called it, not by the library frame.
Normalization of the message strips things like UUIDs, numbers, and timestamps so "user 42 not found" and "user 39281 not found" collapse to one fingerprint.
Edge cases:
- Wrapped exceptions —
RuntimeException: caused by IOException ...should group by the root cause, not the wrapper. - Minified stack traces (Node, Browser) — apply source maps server-side before fingerprinting.
- Async stacks — JS Promise chains and Go goroutines split the apparent stack; the SDK has to stitch them back via instrumentation.
Sampling and rate-limiting#
A production outage can generate millions of identical errors in seconds. The pipeline rate-limits per fingerprint:
if events_for_fingerprint_in_window(fp, 60s) > 100: increment_counter(fp) // keep the count drop_payload() // discard the duplicate bodyThe UI shows “12,847 events in the last hour” without storing 12,847 copies.
Reservoir sampling preserves a uniform random subset of payloads when you do want to keep multiple representatives.
Alerting and blast-radius scoping#
Useful alert dimensions:
- New issue — never-seen-before fingerprint just appeared. Often the most actionable alert.
- Regression — an issue marked “resolved” reappeared in a later release.
- Frequency spike — error count > 5× last week’s baseline at the same time of day.
- High-impact — affects > 0.1% of active users or > 1% of a specific route.
- Release health — % of sessions with errors in a release exceeds threshold.
Blast-radius scoping pulls ”% of total traffic affected” so a 1000-event/min issue from a single misbehaving client doesn’t page the whole on-call rotation.
Source maps and symbolication#
Server-side stack traces are usually plaintext, but minified or stripped binaries need help:
- JavaScript (Node SSR, Lambda) — upload source maps with each release; the pipeline resolves frames at ingest time.
- Go — built-in stack traces are plain; just strip the build path prefix.
- C++ / Rust / Swift — upload symbol files (
.dSYM,.pdb); the backend resolves at ingest. - JVM — ProGuard mappings or Java’s built-in line tables.
Release tracking#
Tag every event with a release identifier (commit SHA or semver). The dashboard shows “issues first seen in release X” which is gold for git bisect-style root cause work. Tie this to deploy webhooks for automatic release boundary detection.
Trade-offs#
Other axes:
- Sample-everything vs sample-by-fingerprint — sample-everything risks missing rare events; sample-by-fingerprint risks under-counting high-volume issues. Most systems combine: keep 100% of new-fingerprint events, sample subsequent ones.
- Inline ingest vs queue-based — inline (HTTP → store) is simpler; queue-based (HTTP → Kafka → processor → store) absorbs spikes without dropping. Production systems always queue.
- Errors-only vs all-events — Sentry-style is errors-only. APM-style (Datadog, NewRelic) capture every transaction and tag the errored ones. Errors-only is cheaper and more focused; full APM gives latency context but at 100× the volume.
Real-world examples#
- Sentry — the dominant SaaS for error tracking; SDKs for 40+ languages; on-call rotation integration; release health.
- Rollbar — pioneered “deploy tracking” (link issues to git revisions) circa 2012.
- Bugsnag — strong mobile + server story; “stability score” metric.
- Honeybadger — Ruby-first, with cron-job monitoring (alert when a job didn’t run).
- Airbrake — one of the originals; popular in Rails shops.
- Internal at scale — Google’s Buganizer, Meta’s Slasher, Stripe’s in-house tooling all share the same patterns: fingerprint, group, dedupe, blast-radius-scope.
Related building blocks#
- Distributed Logging — the broader log pipeline; error events are a high-priority subset.
- Distributed Monitoring — metrics tell you the rate is up, errors tell you what’s failing.
- Client-Side Error Monitoring — same problem on the browser/mobile side.
- Pub-Sub — the ingest queue between SDK and pipeline.