Cloudflare — 2019 regex catastrophic backtracking

A single WAF rule with exponential regex backtracking burned 100% CPU across every edge node simultaneously.

Postmortem Foundational
7 min read
regex cpu waf global-deploy

Summary#

On 2 July 2019, Cloudflare suffered a 27-minute global outage. The cause was a single regular expression added to Cloudflare’s Web Application Firewall (WAF) ruleset. The regex, when evaluated against certain inputs, triggered catastrophic backtracking — its evaluation time grew exponentially with input length.

The rule was deployed globally in seconds. Every Cloudflare edge node simultaneously began consuming 100% CPU on the WAF process. HTTP traffic across the entire Cloudflare network came to a near-halt. Recovery required rolling back the rule globally and then restoring stability across the fleet.

The event is the canonical example of how a tiny change with no apparent risk profile can take down a global system if the change is deployed simultaneously to every node and the change has runtime behavior the deploy validation didn’t catch.

Timeline#

All times UTC. Reconstructed from Cloudflare’s own published postmortem.

  • 13:42 — a new WAF rule is deployed to mitigate emerging JavaScript-based attacks. The rule contains a regex with a nested quantifier (the canonical shape of catastrophic-backtracking expressions).
  • 13:42 — the rule propagates to every edge data center within seconds.
  • 13:42-13:43 — CPU usage on edge servers spikes to 100% as the regex engine spends exponential time evaluating the rule against typical HTTP request bodies.
  • 13:43 — alarms fire globally. Cloudflare engineers see CPU exhaustion across the fleet. HTTP traffic begins to fail with 502 errors and connection timeouts.
  • 13:44-13:55 — engineers triage; identify the new WAF rule as the proximate cause.
  • 13:56 — the WAF is disabled globally via a kill-switch (a separate config path from the WAF rule deployment system).
  • 13:56-14:09 — edge servers’ CPUs recover as the offending rule stops being evaluated. HTTP traffic resumes.
  • 14:09 — full recovery; total outage window approximately 27 minutes.

Root cause#

The regex (paraphrased; the exact form is described in Cloudflare’s postmortem) had the shape (?:.*?(?:....)*)*X. This pattern contains a nested quantifier (* inside another *), which can cause a backtracking regex engine to try exponentially many ways of matching when the final part X doesn’t match.

Concrete example of the problem class: matching (a+)+b against the input aaaaaaaaaaaaaaaaaaaaaaaaaaaaa (no b at the end). A naive backtracking engine tries every possible partition of the as between the inner and outer quantifier — 2^n partitions for n characters. With 30 as, that’s a billion attempts.

The rule was tested before deployment, but against inputs that didn’t trigger the worst case. The input dimension that matters for backtracking is not “does it match” but “what’s the maximum prefix that almost matches” — a property the test suite did not exercise.

Once deployed, the rule met real-world traffic. Many HTTP requests contained input bodies with the structure that triggered backtracking. Every match attempt cost seconds of CPU. With requests arriving at the rate they normally do, every CPU was saturated within ~1 second of the rule going live.

Cascading effects#

The outage’s blast radius was global because of three reinforcing factors:

  1. The WAF runs on the request path. Every HTTP request through Cloudflare passes through the WAF. A WAF that hangs hangs the request. Customer applications behind Cloudflare became unreachable even though their origin servers were healthy.

  2. The WAF deployment was simultaneous and global. New rules were pushed to every edge data center concurrently. There was no canary, no per-region rollout, no soak time. A bug in a rule was a global bug instantly.

  3. CPU exhaustion is bad in particular ways. A worker hung at 100% CPU on regex evaluation doesn’t release threads, doesn’t return errors, doesn’t fail fast. It holds onto the request slot until it completes (which it never does, for this regex) or is killed. Pipeline behind the WAF stalls.

Downstream effects included Cloudflare’s own internal tools — dashboards, status page, certain debugging utilities — all running through Cloudflare’s network. Engineers had to use out-of-band tooling to investigate (a planned-for scenario, but always more friction than the in-band path).

Cloudflare’s status page itself was reachable (it’s served via different infrastructure for exactly this reason) but the broader impact on customers was severe: many customer-facing services (Discord, Medium, news sites, Shopify storefronts, etc.) experienced outages downstream of the Cloudflare WAF.

What was fixed#

Per Cloudflare’s published postmortem:

  • Catastrophic-backtracking detection added to the WAF rule deployment pipeline. Rules with regexes that exhibit non-linear evaluation time are flagged or refused before deployment.
  • WAF engine migration: Cloudflare moved over time toward regex engines that have guaranteed linear-time evaluation (RE2-style, NFA-based without backtracking). For some classes of rule, this requires accepting reduced expressive power.
  • Staged WAF rule rollouts introduced. New rules deploy first to a subset of edge nodes; metrics are evaluated; rollout continues only if CPU and latency remain healthy.
  • CPU watchdog at the WAF level: any single request that takes more than N milliseconds is killed and the request continues with a fallback evaluation. This prevents one bad rule from monopolizing CPU.
  • Kill-switch verified: the global WAF disable path that was used to recover was already in place; the postmortem emphasizes that this kind of kill-switch must exist and be tested for every globally-deployed configuration class.

Lessons that generalize#

Concretely:

  • Catastrophic backtracking is a known shape, not bad luck. Regex patterns with nested quantifiers ((a+)+, (a|a)*, (.*)+) have a well-documented failure mode. Static analysis tools (rxxr, recheck, others) detect these patterns. Any system that accepts user-provided or operator-provided regexes should run this analysis before evaluation.

  • Linear-time regex engines exist and should be preferred for production filtering. RE2 (and the engines derived from it) guarantee O(n) evaluation by refusing features like backreferences. For 99% of WAF / log-parsing use cases, RE2 is sufficient and immune to this class of bug.

  • Global simultaneous deploys are convenient and dangerous. Anything pushed to every edge node at once has zero opportunity for canary detection. Even fast-iterating configuration systems benefit from staged rollouts. The trade-off is propagation latency — sometimes seconds matter (DDoS mitigation rules need to deploy fast) — but the design must include a way to validate during rollout, not just before.

  • Kill-switches for every globally-deployed change class. A separate code path that can disable the changed component, owned by a different deploy pipeline, accessible during the failure mode. Cloudflare had this for the WAF; it’s what enabled the 14-minute recovery instead of an hours-long one.

  • Runtime resource limits in shared workers. Any worker process serving traffic should have a CPU budget per request. If a request exceeds the budget, kill it. This converts “bad request takes forever and starves the worker” into “bad request fails fast and the worker keeps serving”. Many languages and runtimes make this easy (Go’s context.WithTimeout, Erlang processes, OS-level resource limits) but it has to be configured.

  • The blast radius of edge infrastructure is the entire internet. When an edge provider hosts a large fraction of the web (Cloudflare, Fastly, AWS CloudFront, Akamai), their incidents are widely visible. The operational discipline required scales with the blast radius — and at edge-CDN scale, treating any config push as a potential globally-correlated failure is the right mental model.

A subtler lesson: the asymmetry between change-time validation and runtime behavior is the source of most production surprises. A regex that compiles, looks fine, and passes a basic match test can still hang a CPU at runtime on specific inputs. A SQL query that executes in 10ms in test can lock a hot table in production. The way to bridge the gap is staged rollout with real-traffic exposure, not better static analysis.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.