Cloudflare 2019 — The Regex Outage
A bad WAF regex with exponential backtracking; 30 minutes of global edge downtime; the postmortem on safe deploys.
What happened#
On 2 July 2019, between 13:42 and 14:09 UTC, Cloudflare’s edge network experienced a global 27-minute outage. Every Cloudflare data centre simultaneously hit 100% CPU utilisation; HTTP and HTTPS traffic to customer origins behind Cloudflare returned 502 Bad Gateway errors or simply timed out. Sites that depended on Cloudflare for their public surface — Discord, Medium, Coinbase, Shopify storefronts, Patreon, a large fraction of news sites — went dark for the duration.
The trigger was a single new rule added to Cloudflare’s Web Application Firewall (WAF) rule set as part of routine work on JavaScript attack mitigations. The rule contained a regular expression with catastrophic backtracking; once deployed globally, the regex engine consumed all available CPU evaluating it against typical HTTP request bodies. The published postmortem on the Cloudflare blog (blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/) is the canonical source.
Context#
Cloudflare operates one of the largest CDN and edge-security networks on the Internet, with data centres in 200+ cities. Every HTTP request to a Cloudflare-fronted site is terminated at the nearest edge POP and passed through a pipeline: TLS termination, DDoS scrubbing, WAF, cache lookup, finally a fetch to the origin. The WAF inspects every request for patterns matching known attacks (SQL injection, XSS, file inclusion, malicious JavaScript) and blocks or challenges anything suspicious.
WAF rules are written in a Cloudflare-specific rule language that compiles down to regular expressions, executed by a regex engine (PCRE-compatible at the time). The WAF rule deployment pipeline was designed for speed — security researchers needed to push new mitigations globally within minutes when novel attacks emerged. The rule set was treated as data, not code: no canary deploys, no soak time, no per-region rollout.
The pipeline pushed new rule sets simultaneously to every edge data centre worldwide.
Trigger and propagation#
A new rule was authored to detect a class of JavaScript-based attacks. The regex (paraphrased in the postmortem) had the shape (?:.*?(?:....)*)*. This is a textbook example of a pattern with nested unbounded quantifiers — the * inside another * — which can cause a backtracking regex engine to try exponentially many ways of matching against inputs that almost-but-don’t-quite match.
The rule passed pre-deploy validation. It matched the test inputs the author had in mind. The validation framework did not include an analysis for backtracking complexity, and no fuzzer was throwing pathological “almost matching” inputs at it.
At 13:42 UTC the rule was deployed globally. Within seconds it propagated to every edge POP. Real-world HTTP traffic immediately began hitting the regex against bodies that triggered the worst-case backtracking. Each match attempt consumed seconds of CPU time. The Cloudflare worker processes were single-threaded per request; a worker stuck in a backtracking loop didn’t release its slot. New requests piled up behind the wedged workers. Within roughly one second of the rule going live, every CPU on every edge node was at 100%.
13:42:00 Rule deploys to all POPs13:42:01 CPU hits 100% globally13:42:05 HTTP error rate spikes; 502s flooding13:43-13:55 Engineers investigate, identify the WAF as the proximate cause13:56 WAF disabled globally via kill-switch13:56-14:09 CPU drains, traffic resumes14:09 Full recoveryDetection and response#
Detection was immediate — Cloudflare’s monitoring dashboard lit up across every region within seconds of the deploy. The challenge was diagnosis. With every edge node at 100% CPU, internal Cloudflare tools that ran through the same edge infrastructure (dashboards, debugging utilities) were themselves slow or unreachable. Engineers had to fall back to out-of-band access: SSH to specific machines via paths that didn’t traverse the affected workers.
Triage took ~14 minutes. The team noticed the timing correlated with a WAF deploy, identified the responsible rule, and used the global WAF kill-switch — a separate deploy path from the rule pipeline, specifically designed for exactly this scenario — to disable the entire WAF. This was the right call: rather than trying to identify the bad rule precisely and risking a wrong guess, they removed the whole problematic subsystem and let traffic flow.
CPU utilisation dropped within seconds. Within 13 more minutes of natural drain (timeouts, retries, queue clearing), traffic was fully recovered.
Root cause#
The proximate cause was a regex with catastrophic backtracking. The deeper cause was a deployment pipeline that treated WAF rules as data and pushed them globally and simultaneously, with no canary, no staged rollout, and no automated rollback triggered on edge health metrics.
A regex is a program. It has runtime behaviour, complexity properties, and failure modes invisible from inspecting the source. Treating it as static configuration meant the pipeline never asked “what happens if this consumes more CPU than expected?” — a question every code deploy pipeline asks via canaries and gradual rollouts.
A secondary contributing factor: the regex engine in use (PCRE, with backtracking semantics) admits the catastrophic-backtracking failure mode. Linear-time regex engines (RE2, derived NFA-without-backtracking implementations) cannot exhibit this failure — they refuse certain features like backreferences in exchange for guaranteed O(n) evaluation time. Cloudflare’s choice of a more-expressive engine accepted the risk; the deploy pipeline did not have guardrails to contain it.
Lessons and changes#
Per Cloudflare’s published postmortem, the following changes were made:
- Static analysis for backtracking patterns. Tools like rxxr and recheck flag regexes with nested quantifiers. Added to the rule deploy pipeline; rules failing analysis are rejected.
- Migration toward linear-time regex engines. For new rules and rewritten existing rules where feature compatibility allows. RE2 cannot exhibit catastrophic backtracking.
- Staged WAF rule rollouts. New rules deploy first to a single canary POP, then to a region, then globally — with metrics gating each step.
- Per-request CPU watchdog at the WAF layer. A request that exceeds a CPU budget (single-digit milliseconds) is killed and the request continues with a fallback evaluation path. Converts “one bad rule pegs every CPU” into “one bad request fails fast”.
- Kill-switch made explicit and tested. The WAF kill-switch existed and worked, but the team formalised it as a required capability for every globally-deployed subsystem: a separate path, owned by a separate pipeline, that can disable the component during a failure mode.
- Better out-of-band tooling. Internal tools that needed to work during an edge-wide outage were moved off the affected path.
What it teaches in general#
Generalising:
- Global simultaneous deploys are convenient and dangerous. Anything pushed to every node at once has zero opportunity for canary detection. Even fast-iterating systems need staged rollout. The trade-off is propagation latency, sometimes seconds matter, but the design must include some way to validate during rollout, not just before.
- Catastrophic backtracking is a known shape, not bad luck. Patterns with nested quantifiers (
(a+)+,(.*)+,(a|a)*) have a documented failure mode. Static analysis catches them. Any system that accepts operator-provided or user-provided regexes should run this analysis before evaluation. - Linear-time regex engines exist; prefer them for production filtering. RE2 (Google) and engines derived from it guarantee
O(n)evaluation by refusing backreferences and some other features. For 95% of WAF and log-parsing use cases, RE2 is sufficient and immune to this class of bug. - Runtime resource limits in shared workers. Any worker process serving traffic should have a per-request CPU budget. If a request exceeds it, kill it. This converts “bad input takes forever and starves the worker” into “bad input fails fast and the worker keeps serving”.
- The blast radius of edge infrastructure is the Internet. When an edge provider hosts a large fraction of the web (Cloudflare, Fastly, AWS CloudFront, Akamai), their incidents are widely visible. Operational discipline scales with blast radius.
Was the recovery time of 27 minutes good or bad?
Mixed. The fact that a global outage was recovered without manual rollback of the offending rule — by using a pre-built kill-switch — is excellent. The 14 minutes to identify the WAF as the cause is mediocre; many monitoring systems would correlate “WAF rule deploy at 13:42” with “CPU saturation at 13:42:05” automatically. The fact that internal tools required out-of-band access to function during the event is a known cost of running monitoring on the same infrastructure you monitor.
Related material#
- Cloudflare’s published postmortem on the company blog is the canonical source.
- Facebook 2021 — The BGP Withdrawal — different layer, similar shape: a single global change with no soak time.
- AWS us-east-1 2017 — The S3 Outage — also small-trigger / large-blast-radius, but the trigger was a typo’d command rather than a bad rule.
- HTTP — Requests, Responses, Status Codes, Headers
- TCP Fundamentals — Header, Handshake, Teardown