The Circuit Breaker Pattern
Closed → Open → Half-Open. Failing fast when a dependency is sick; the cascade-prevention pattern Netflix made famous.
What it is#
A circuit breaker is a piece of state that sits between your service and a downstream dependency, tracks whether that dependency is healthy, and fails fast when it isn’t. The name comes from the electrical circuit breaker in a household panel: when the current exceeds the safe limit, the breaker trips and stops the flow, protecting everything downstream from the fault.
The pattern has three states — Closed, Open, Half-Open — and transitions between them based on observed failure rates. When closed, requests flow through normally. When open, requests fail immediately without ever calling the downstream service. After a cooldown, the breaker enters half-open, lets a small number of probe requests through, and either closes (downstream is back) or re-opens (downstream is still sick).
Circuit breakers exist for one specific failure mode: cascading failures. When a downstream service is slow or down, every upstream caller’s threads pile up waiting for it, every caller’s queue fills, and every caller eventually fails too. Without circuit breakers, the sick service takes down its callers, then their callers, until the whole system is in a coordinated outage. With circuit breakers, the sick service is isolated within seconds; callers fail fast, free their threads, and stay healthy enough to serve other traffic.
Netflix’s Hystrix library, released in 2012, made the pattern famous in the microservices world. Hystrix was deprecated in 2018 in favour of Resilience4j (Java), Polly (.NET), gobreaker (Go), and the circuit-breaker primitive in service meshes like Istio and Envoy. The pattern is now standard infrastructure, but the engineering decisions inside it still matter.
When to use it#
Reach for a circuit breaker when:
- You call any downstream service over the network. External APIs, internal microservices, databases — anything where a failure can wedge a thread. Especially load-bearing for synchronous request-reply paths where the upstream caller is waiting.
- A slow dependency can cascade. Slow is worse than down for cascade. A dependency that’s still answering, but at 30 seconds per request, will fill every thread pool upstream before anyone notices. The circuit breaker’s “trip on slow” mode catches this.
- You can degrade gracefully. The breaker only helps if you have a fallback (a cached result, a default value, a clean error). If failing fast just propagates a different error, the breaker reduces latency but doesn’t help reliability.
Avoid circuit breakers when:
- The call is to an in-process resource. Local cache, in-memory queue — no network, no thread-blocking I/O, no cascade.
- A retry would succeed quickly. A transient TCP reset doesn’t need a breaker; a quick retry handles it. Reserve the breaker for sustained badness.
- You have no fallback path. If “circuit open” just becomes a 500 anyway, the breaker is shifting where the error originates without improving the outcome.
How it works#
The three states#
┌──────────────────┐ │ Closed │ │ (normal flow) │ └────────┬─────────┘ │ failure rate exceeds threshold │ ▼ ┌──────────────────┐ ┌────────│ Open │ │ │ (fail fast) │ │ └────────┬─────────┘ │ │ │ cooldown timer expires │ │ │ ▼ │ ┌──────────────────┐ │ │ Half-Open │ │ │ (probe requests) │ │ └────────┬─────────┘ │ │ │ ┌───────────┴───────────┐ │ │ │ │ probe probe │ fails succeeds │ │ │ └─────┘ │ ▼ back to ClosedClosed — Default state. Requests pass through. A sliding-window counter tracks the failure rate. If the rate crosses a threshold (e.g., 50% failures over the last 20 requests, with a minimum sample size), trip to Open.
Open — Every call to the breaker returns immediately with a circuit-open error. No request is made downstream. A timer runs for the cooldown period (typically 30-60 seconds).
Half-Open — When the cooldown expires, the breaker lets a small number of probe requests through (often just 1, or a fixed quota like 5). If a probe succeeds, transition to Closed. If a probe fails, transition back to Open and reset the cooldown.
The four knobs you must set#
Every circuit breaker has the same four tuneable parameters. Getting them right takes telemetry, not theory.
| Knob | Typical value | What it controls |
|---|---|---|
| Failure threshold | 50% over 20 requests | When to trip from Closed to Open |
| Minimum sample size | 20 requests | Avoid tripping on noise (don’t trip after one failure) |
| Sliding window | 10 seconds or 100 requests | How far back the failure rate looks |
| Cooldown duration | 30-60 seconds | How long Open lasts before probing |
The minimum sample size is the most-missed parameter. Without it, a breaker can trip on a single failure: “1 failure out of 1 call = 100% failure rate”. Set a floor (often 20 calls) before the rate is even computed.
The cooldown is a trade-off: short enough to recover quickly when the dependency comes back, long enough to not hammer a still-recovering dependency. 30-60 seconds is typical for HTTP services; longer (5+ minutes) for slower-recovering things like databases.
”Failure” is per-call — define it carefully#
What counts as a failure? Naive answer: any exception or non-2xx response. The richer answer:
- Connection errors, timeouts, 5xx responses — yes, breaker-relevant failure.
- 4xx responses — usually no. A
400 Bad Requestis a client bug, not a server health signal. Tripping the breaker on 400s would cause one buggy caller to open the breaker for everyone else. - 429 (rate limit) — debatable. Some libraries treat it as a failure (the dependency is signalling “back off”); others as a non-failure (it’s a successful answer of “no”).
- Slow but successful responses — yes, often. Some breakers trip on latency too: if the p99 latency over the window exceeds a threshold, treat that as a soft failure even if responses are still 200.
The Hystrix lineage exposes this as recordExceptionPredicate / recordResultPredicate — the calling code decides what counts.
Hystrix’s legacy: bulkheads and timeouts#
Hystrix bundled the circuit breaker with two other patterns that complete the picture:
- Timeouts. Every downstream call has an aggressive timeout (often 1-3 seconds for synchronous APIs). Without this, a “slow” failure mode never trips the breaker because requests never complete.
- Bulkheads (thread isolation). Each downstream dependency gets its own thread pool. If one dependency goes slow, only its pool fills up; other dependencies’ calls have separate pools. This prevents one bad dependency from starving the entire upstream service of threads.
Modern libraries (Resilience4j, Polly) keep all three patterns but make thread isolation optional (use semaphore isolation instead for non-blocking IO).
Fallbacks — the breaker’s other half#
A circuit breaker only helps if the caller has something to do when the breaker is open. Common fallback patterns:
- Cached value. “If the recommendations service is down, return last-good recommendations from the cache.” Works for read-heavy paths.
- Default response. “If the personalisation service is down, return the generic homepage.” Slightly worse experience, but the system stays up.
- Empty result with a flag. “If the related-articles service is down, omit the section and add a
degraded: trueflag to the response.” Lets the UI hide the section. - Clean error. Sometimes the only correct fallback is “return 503 with a clear message”. Failing fast is still better than hanging.
A breaker without a fallback path saves latency but not reliability.
Implementation — three-language example#
Wrapping an HTTP call in a circuit breaker:
from pybreaker import CircuitBreaker, CircuitBreakerErrorimport requests
# Trip after 5 failures in a row; stay open for 30 secondsrecommendations_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, exclude=[requests.exceptions.HTTPError], # 4xx not counted)
def get_recommendations(user_id): try: return _call_recommendations(user_id) except CircuitBreakerError: # Breaker open — return cached/default value return cached_recommendations_for(user_id) or []
@recommendations_breakerdef _call_recommendations(user_id): resp = requests.get( f"https://recs.internal/users/{user_id}", timeout=2, # aggressive timeout is mandatory ) resp.raise_for_status() return resp.json()["recommendations"]package main
import ( "context" "encoding/json" "net/http" "time"
"github.com/sony/gobreaker")
var recommendationsBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{ Name: "recommendations", MaxRequests: 5, // probes when half-open Interval: 10 * time.Second, // sliding window Timeout: 30 * time.Second, // cooldown in Open ReadyToTrip: func(counts gobreaker.Counts) bool { return counts.Requests >= 20 && float64(counts.TotalFailures)/float64(counts.Requests) >= 0.5 },})
func getRecommendations(ctx context.Context, userID string) ([]string, error) { result, err := recommendationsBreaker.Execute(func() (any, error) { ctx, cancel := context.WithTimeout(ctx, 2*time.Second) defer cancel() req, _ := http.NewRequestWithContext(ctx, "GET", "https://recs.internal/users/"+userID, nil) resp, err := http.DefaultClient.Do(req) if err != nil { return nil, err } defer resp.Body.Close() if resp.StatusCode >= 500 { return nil, &http.ProtocolError{ErrorString: "5xx from recs"} } var body struct { Recommendations []string `json:"recommendations"` } return body.Recommendations, json.NewDecoder(resp.Body).Decode(&body) }) if err != nil { // Breaker open or call failed — fall back to cached return cachedRecommendationsFor(userID), nil } return result.([]string), nil}import CircuitBreaker from "opossum";
async function callRecommendations(userId) { const resp = await fetch(`https://recs.internal/users/${userId}`, { signal: AbortSignal.timeout(2_000), }); if (!resp.ok) throw new Error(`HTTP ${resp.status}`); const body = await resp.json(); return body.recommendations;}
const recommendationsBreaker = new CircuitBreaker(callRecommendations, { timeout: 2_000, // aggressive request timeout errorThresholdPercentage: 50, resetTimeout: 30_000, // cooldown in Open volumeThreshold: 20, // minimum sample size before tripping});
// Fallback when breaker is openrecommendationsBreaker.fallback((userId) => cachedRecommendationsFor(userId) ?? []);
export function getRecommendations(userId) { return recommendationsBreaker.fire(userId);}Three things to notice across all three samples:
- Aggressive request timeout in addition to the breaker. Without it, slow responses never trip the breaker.
- Fallback is part of the contract. The breaker-open path returns something usable, not just an error.
- The
volumeThreshold(Node) / minimum-request (Go) / failure-count (Python) prevents tripping on tiny samples.
Where the breaker lives — library vs proxy vs service mesh#
Three places to implement:
- In-process library. Hystrix, Resilience4j, Polly, gobreaker, opossum. Best fit for application logic that varies fallbacks per call site.
- Sidecar proxy. Envoy, Linkerd. The proxy intercepts all outbound traffic and applies the breaker — no application code change.
- API gateway. Kong, Zuul, Istio. Centralised, applies to all traffic flowing through the gateway.
The trend in microservices is toward the sidecar/mesh approach because it standardises the policy across services. The library approach survives where the fallback logic is application-specific (e.g., “return cached value” only makes sense in code, not at the proxy).
Variants#
| Variant | Mechanism | When it fits |
|---|---|---|
| Count-based | Trip after N failures in a row | Simple cases; bursty traffic |
| Rate-based | Trip when failure ratio over window exceeds threshold | Steady-state services |
| Latency-based | Trip when p99 over window exceeds threshold | Catches slow-failure (often the most damaging) |
| Adaptive | Threshold adjusts based on baseline | High-volume services where steady-state failure rate is non-zero |
| Forced-open / forced-closed | Manual override via flag | Incident response — operator can pre-emptively open or pin closed |
Trade-offs#
What circuit breakers give you:
- Cascading-failure protection. A bad downstream stays contained instead of taking the whole system with it.
- Fast-fail latency. Calls during Open take microseconds instead of timing out at 30 seconds.
- Thread-pool protection. Threads aren’t tied up waiting on a dead dependency.
- Observability. Breaker state transitions are high-signal events to monitor.
What circuit breakers cost you:
- False positives. A flaky network during a healthy window can trip the breaker, denying service when the dependency is fine.
- Tuning complexity. Four+ knobs per breaker, per dependency. Tuning is real work.
- Fallback requirement. A breaker without a fallback is just a latency optimisation.
- State coordination. In multi-replica services, each replica has its own breaker state. Either accept that (per-replica breakers) or build a shared store (extra failure mode).
- Hard to test. Reproducing the failure modes that should trip the breaker (slow dependency, partial failure) is non-trivial in CI.
Common pitfalls#
- No timeout on the wrapped call. Slow failures never trip the breaker; threads pile up. Always set an aggressive request timeout in addition to the breaker.
- Tripping on 4xx. A buggy caller triggering 400s opens the breaker for every other caller. Filter the failure predicate carefully.
- No minimum sample size. Breaker trips on the first failure (“1 of 1 = 100%”). Always set a floor.
- No fallback. Breaker open just becomes a 503. The latency is better; the user experience isn’t.
- Cooldown too short. Probes hammer a dependency that’s still recovering. 30-60 seconds is the floor; longer for slower-recovering systems.
- Cooldown too long. Dependency recovers in 10 seconds but the breaker stays open for 5 minutes. False negatives compound.
- No telemetry on state transitions. Without alerts on “breaker X opened”, you find out by reading the changelog after the postmortem.
- One breaker shared across endpoints. A failure in one endpoint trips for all. Use per-dependency or per-endpoint breakers.
Related building blocks#
- Managing Retries — the other half of the resilience pair; retries handle transient failures, breakers handle sustained ones.
- Rate Limiting — protects the server from clients; circuit breakers protect clients from the server.
- API Monitoring — breaker state transitions are the highest-signal events to alert on.
- What Causes API Failures — A Taxonomy — cascading failures are the canonical “why we use breakers” story.
- Caching at Different Layers — a populated cache is the most common fallback when a breaker is open.