Bug Hunting and Root-Cause Analysis — Claude Code

The workflow#

Reproduce the bug deterministically. Isolate the smallest failing input. Trace until you understand why it fails, not just where. Fix the cause, not the symptom. Pin the fix with a regression test that fails on the old code and passes on the new. Commit.

Five steps. The order is load-bearing. The most common failure mode in bug hunting — with Claude or without — is jumping from “I see a symptom” to “I changed something and the symptom stopped.” That’s not a fix; that’s a coincidence with a commit hash.

When to reach for it#

This is the workflow for any bug serious enough that you’d be embarrassed to see it again:

Production incidents. A regression test is non-negotiable. If you can’t reproduce it locally, you can’t trust the fix.
Heisenbugs — intermittent, race-condition, timing-dependent. These are exactly the ones you must root-cause, because “I made the symptom go away” usually means you made it less frequent, not absent.
Cross-cutting bugs. Anything that involves more than one module, or where the symptom is far from the cause. Auth bug surfacing as a 500 in a search handler, that kind of thing.
Bugs in code you’re about to refactor. Fix the bug first, with a test pinning the behaviour, then refactor with the test as your safety net.

Skip the heavy version of this workflow for trivial typos (off-by-one in a log message), purely cosmetic issues (wrong color), or tests that are flaky for known reasons (CI runner under load). In those cases, an inline edit and a quick verification is fine.

Step-by-step#

1. Get a deterministic reproducer#

Before anything else, you need to be able to reliably reproduce the bug. This is the single most-skipped step in bug hunting and the one that pays the highest dividends.

For a backend bug, that’s a curl invocation, a unit test, or a script that exhibits the failure every run. For a frontend bug, a sequence of clicks and a URL. For a heisenbug, a reproduction script that runs 100 iterations and fails on at least one.

Hand the reproducer to Claude:

Running pnpm test src/api/search.test.ts -t "empty query" fails on main with the error below. Reproduce it locally and confirm you see the same failure before we start digging.

If Claude can’t reproduce, the environment differs from yours in a way that matters. Fix that first. Most often it’s missing env vars, a different node version, or a stale database snapshot.

2. Minimize the reproducer#

Once you have something that reliably fails, shrink it. Remove inputs, headers, fields, request bodies — anything not strictly needed to provoke the failure. Each removal is a hypothesis: “I don’t think this matters.” When the failure goes away, the thing you just removed does matter.

A minimized reproducer of three lines tells you more than a reproducer of two hundred. Claude is excellent at this — ask it to bisect by half, repeatedly, until the reproducer is irreducible.

3. Trace, don’t guess#

Now you understand the symptom. You don’t yet understand the cause. The temptation — for humans and for models — is to skip from “I see X” to “I bet it’s Y” to “I changed Y.” That’s how you get a fix that works on your machine and breaks in production.

Instead, trace. Ask Claude to walk the call path from the entry point to the failing line, reading every function, listing every state mutation. Insert log statements at suspicious points. Set breakpoints if you’re in an IDE. Watch what actually happens, not what you assume happens.

Claude is excellent at this kind of patient reading — it doesn’t get bored. Let it.

4. Form a hypothesis, then disprove it#

When you have a candidate root cause, try to prove it wrong. State the hypothesis explicitly:

Hypothesis: the bug happens because normalizeQuery returns null when the input is the empty string, and the downstream caller in runSearch doesn’t null-check before calling .toLowerCase().

Then write a test that fails if the hypothesis is true — call normalizeQuery('') and assert the return value. If you see null, you’ve confirmed the hypothesis. If you see something else, your hypothesis was wrong and you would have spent the next hour fixing nothing.

This is the step where Claude’s habit of “looks plausible” becomes dangerous. Always check the hypothesis with a real probe, not by re-reading the code.

5. Fix the cause, not the symptom#

Symptom fixes look like: “throw a try/catch around the failing line.” Cause fixes look like: “make normalizeQuery return an empty string instead of null when the input is empty, since that’s the documented contract.”

The test for whether you’re fixing the cause: would a different downstream caller, that doesn’t yet exist, hit the same bug? If yes, you’re patching one site of the symptom and leaving the cause in place. Fix the cause.

There’s a real cost to cause-fixes — they sometimes touch more files, change a contract, ripple. That’s usually right. The alternative is a graveyard of try/catches around every call site.

6. Pin the fix with a regression test#

Write a test that fails on the old code and passes on the new code. Run it against the broken commit (git stash, or run the test before applying the fix) to verify it actually fails. Run it against the fixed commit to verify it passes.

If you don’t pin it, the bug will come back. Maybe not from your hand — from a refactor six months from now by someone who never knew the original failure mode. The regression test is the institutional memory.

7. Read the surrounding code for sibling bugs#

When you’ve root-caused a bug, you’ve often discovered a pattern — a wrong assumption, a missing null check, a confused contract. Ask Claude to search the codebase for the same pattern. There’s usually one or two more instances of the same root cause hiding elsewhere.

This is where bug hunting transitions to small refactor, and where you save your future self from filing the same bug twice.

8. Commit with a message that explains the cause#

The commit message is your second piece of institutional memory after the regression test. It should answer: what was broken, why, and what changed. Not “fix bug” — describe the cause in one sentence so the next person reading git blame understands the why, not just the what.

Anti-patterns#

The shapes that don’t work:

“It works now, ship it.” You made a change. The bug went away. You don’t know why. This is the worst possible state — you’ve spent the debug budget and bought no understanding. Six months later the bug returns and you have no notes.
Diff-storm fixing. Changing five things at once and seeing the symptom disappear. Now you don’t know which of the five mattered. Change one thing at a time when probing; bundle only after you understand.
Skipping the minimization step. A 200-line reproducer hides the cause in noise. Five minutes of bisecting reveals which five lines matter, and the cause is usually obvious once the noise is gone.
Letting Claude propose fixes before reproducing. The model is excellent at pattern-matching to plausible causes, but plausible is not actual. Without a reproducer, you have no way to distinguish “guess that turned out to be right” from “guess that papered over the symptom.”
Symptom suppression. A try/catch swallowing the actual error so the surface goes quiet. Logs you’ll never read because the exception was eaten. This is debt that compounds — every silent failure makes the next bug harder to find.
No regression test. “The fix is obvious, I don’t need a test.” Maybe. But the next person to touch this code doesn’t know it’s obvious, and they will absolutely reintroduce the bug if there’s no test guarding it.
Trusting Claude’s confidence. Claude will sometimes state a cause definitively when the evidence is circumstantial. Always have a probe — a print statement, a test, a debugger value — that confirms the hypothesis before you accept it.
Bug-hunting in plan mode. Plan mode is for designing changes, not for forming and testing hypotheses. You need to run things to debug. Use the normal conversation loop with appropriate verification cadence.

Evaluation#

How do you know the workflow is working?

Time-to-reproducer is short. Five to fifteen minutes for most bugs. If you’re an hour in and still can’t reproduce, the rest of the workflow is blocked — focus there.
Hypothesis-to-evidence ratio is one-to-one. Every claim about the cause is backed by a concrete probe value, not just code-reading intuition. If you’re forming three hypotheses and disproving zero, you’re guessing.
Fixes touch the cause, not the call site. Most fixes change a contract, a default, or an invariant — not a single try/catch at the symptom location. If your fix commits are mostly try/catches, you’re symptom-suppressing.
Regression tests exist for every shipped fix. Always. If the test would be hard to write, the bug is in something you don’t understand yet — keep digging.
You discover sibling bugs. Roughly half the time, root-causing one bug reveals one or two more instances of the same pattern. If that’s never happening, you’re not really finding the root cause.
The same bug doesn’t return. The truest evaluation is the longest one: a year later, no one has reopened the issue. If old bugs keep coming back, the regression tests aren’t pinning what you thought they were pinning.

Root-cause-driven hunting. Reproduce, minimize, trace, prove the cause, fix it, pin it. Slow on the first bug; fast on the second instance, because you find sibling bugs along the way. Builds permanent regression coverage.

Make-it-go-away hunting. See symptom, guess cause, change thing, symptom gone, ship it. Fast on the surface; slow over time, because the cause is still there and surfaces somewhere else. No regression coverage; same bug returns.

The specific prompt I use to get Claude into bug-hunt mode

When opening a bug-hunt session, I paste the failing test or error output and then explicitly say: “Before proposing any fix, I want you to (1) reproduce the failure locally and confirm you see the same symptom, (2) minimize the reproducer, (3) trace the failing call path and tell me what’s actually happening, and (4) state a hypothesis I can probe with a print statement. Don’t change any production code until I confirm the hypothesis.” That single paragraph is worth a 10x in bug-fix quality because it routes Claude through the same five steps that work for humans, instead of letting it shortcut to a guess-and-patch.