Refactoring with Claude Code — Claude Code

The workflow#

Pin behaviour with tests. State the refactor’s intent (the shape, not the steps). Let Claude restructure while the suite stays green. Commit at every green state. If the change is too large for a single PR, split along natural seams — one PR per seam — and merge in order.

The unlock with Claude Code is that the typing cost of a large refactor collapses. What used to be “two days of mechanical edits I’ll never finish” becomes “an afternoon of designing the new shape and watching the model apply it.” But the mechanical-cost collapse changes nothing about the review cost or the risk cost. Those still scale with diff size, and they’re where most refactoring projects die.

When to reach for it#

Refactoring with Claude Code shines on:

Renames at scale. A function, type, or module renamed across hundreds of call sites. Claude finds and updates them; you review the diff for collateral.
Pattern migrations. Switching from callbacks to async/await, from class components to function components, from one logging library to another. The shape is mechanical but tedious.
Extracting modules. Pulling 800 lines of related logic out of a god-file into its own module, with a clean public surface. Claude does the surgery; you choose the surface.
Tightening contracts. Adding stricter types, narrowing return types, replacing any with concrete types. Mechanically uniform, semantically careful — exactly the model’s strength.
Test refactors. Replacing a brittle test pattern with a better one across the suite. Coverage is preserved by design.

Skip — or de-scope — the workflow for:

Architectural rewrites with no tests. No tests means no safety net. Don’t refactor until you have one, or until you accept the risk explicitly.
Code you don’t understand. “I’ll refactor this and then I’ll understand it.” Wrong order. Read first, refactor second. Claude can help with the reading.
Refactors masking bug fixes. A refactor PR is for behaviour-preserving changes. If you’re also fixing a bug, fix the bug in its own PR with its own regression test.

Step-by-step#

1. Establish behavioural pinning#

Before changing a line of structure, you need a test suite that pins the observable behaviour. Not the implementation — the behaviour. Inputs in, outputs out. Side effects you care about, observed at the boundary.

If you don’t have that suite, write it first. This is the single highest-leverage step in refactoring and the one most often skipped. Claude is excellent at writing characterization tests — point it at the module and ask it to test “what the code currently does,” not “what the code should do.” See Test-Driven Development with Claude Code for the tactical mechanics.

The output of this step is a green suite that will go red if the refactor changes behaviour.

2. State the target shape, not the steps#

Bad: “First rename processOrder to handleOrder, then move it to src/orders/, then update all call sites, then update the tests.”

Good: “The order-processing logic should live in src/orders/, exposed as handleOrder. Current name is processOrder and it lives in src/utils.ts. Rest of the codebase imports it.”

You’re describing the destination, not the route. Claude plans the route. This matters because the steps you’d guess are often the wrong steps — the model often has better information about call sites and dependencies than you do at the moment you start.

3. Start in plan mode for anything non-trivial#

If the refactor is bigger than a single file, drop into plan mode first. Have Claude survey the call sites, identify dependencies, flag risky areas (dynamic imports, string-keyed lookups, generated code), and propose a sequence.

The plan is a contract. Read it. Edit it. Approve it. Then leave plan mode and execute. The 15 minutes spent on a plan saves the two hours spent on a refactor that took the wrong path.

4. Take the first slice#

A refactor of any size is a sequence of small, individually-green steps. The first step should be the smallest verifiable change that makes meaningful progress — typically a rename, an extraction, or a single call-site update done as a proof of concept.

Run the suite. Confirm it’s green. Read the diff. Commit.

This first slice is also a probe: if it was harder than expected, the rest of the refactor will be harder than expected too. Re-plan now, not after five wrong slices.

5. Repeat one slice at a time#

Each subsequent slice extends the refactor by another natural seam — one module, one type, one pattern. Each slice has the same invariants: suite stays green, diff is reviewable, commit lands cleanly.

Resist the urge to “just do all the call sites in one go.” A 30-file diff might be technically correct but is unreviewable. Reviewers approve unreviewable diffs by skim-reading, and that’s how regressions ship.

If a slice goes red, do not move on. Either revert the slice and re-plan, or fix the failure right now. Red commits during a refactor turn into “I’ll fix it later” and “later” never comes.

6. Decide: one PR or many#

The decision tree for splitting:

Pure mechanical, fewer than ~30 files, suite stays green throughout. One PR. Reviewers can scan the pattern, sample a few sites, approve.
Mechanical but touches > ~30 files. Split by seam — e.g., one PR per package, or one PR for the rename and a second for the move. Reviewers can hold each PR in their head.
Semantic + mechanical (e.g., changing a contract and updating all call sites). Two PRs minimum: PR 1 introduces the new shape alongside the old (additive, fully backward-compatible). PR 2 migrates call sites. PR 3 (optional) removes the old shape.
Cross-team or cross-service. Always split. The first PR introduces the new interface; downstream teams migrate at their own pace; a final PR removes the old.

The rule of thumb: if a reviewer would say “I can’t tell if this is right without running it myself,” the PR is too big. Split.

7. Run the long-tail checks#

After the bulk refactor is done, the long tail begins: things that don’t show up in unit tests but show up in production.

Type-checker on the full repo, not just touched files.
Lint and format, with the strict ruleset you actually deploy.
End-to-end or integration tests if they exist.
Build artefacts diffed if you have a way (bundle size, generated SQL, OpenAPI spec). Unexpected diffs in artefacts often catch bugs unit tests miss.
Manual smoke test of the main user paths, especially for UI refactors.

Claude can run all of these and report. Don’t skip the manual smoke test for UI — the model can’t see “the button is now misaligned by 4 pixels.”

8. Write a migration note for the team#

If the refactor changes a public surface — even an internal-but-widely-used one — write a one-paragraph note in the PR description: what changed, why, how to migrate, deadline if any. Past-you will thank present-you when the same question arrives via Slack three weeks later.

For team-wide refactors, this often becomes a small doc in docs/ or a changelog entry, not just a PR description.

Anti-patterns#

The shapes that don’t work:

Refactoring without tests. “I’ll be careful.” You won’t be. Or you will be, but someone reviewing won’t, and the regression ships. Pin behaviour first, refactor second.
Bundling refactor with feature. “While I’m in here, I’ll also add the new endpoint.” Now reviewers can’t tell which diff is structure-preserving and which is new behaviour. Two PRs, always.
Bundling refactor with bug fix. Same problem. The bug fix needs its own regression test; the refactor needs its own behavioural-pinning suite. They don’t belong together.
Mega-PRs because “it’s all the same change.” Reviewers approve big PRs by skimming. The big-PR refactor is roughly correct but ships subtle regressions because the review was shallow. Split.
Skipping plan mode on multi-file refactors. You start typing, you get five files in, you discover a snag, you partially-revert, you re-plan in flight. Twice as slow as planning up front.
Treating Claude’s first plan as final. The first plan is a draft. Read it for dependencies it missed, ordering it got wrong, risks it didn’t flag. Push back. Iterate. Then execute.
Letting the diff grow across many slices without committing. Same as conversation-driven development: each green state is a save point. Don’t lose them.
Refactoring under deadline pressure. Refactoring is investment; deadline-driven work is delivery. Mixing them is how refactors get rushed and ship regressions. Carve refactor time out of the schedule explicitly, or defer.
Refactoring code you’re about to delete. Sometimes the answer is “delete the module and write the replacement.” Look at the destination before polishing the path.

Evaluation#

How do you know refactoring with Claude Code is working?

Suite stays green commit-to-commit. Always. If you have red commits in the refactor’s history, you’ve lost the safety net that makes refactoring tractable.
Each commit is reviewable in under five minutes. If a reviewer can’t hold the commit in their head, the slice was too big. Split next time.
Public behaviour is unchanged. No bug reports from the refactor PR. No “this used to work” tickets. The whole point of behavioural pinning is to make this true.
The refactor actually finished. Counter-intuitive evaluation: most refactor projects don’t ship because they grow beyond the energy budget. With Claude doing the typing, that’s no longer the bottleneck — but the planning, splitting, and reviewing still are. Did you ship?
The code is better, by some measurable standard. Fewer lines, lower cyclomatic complexity, fewer dependencies, clearer module boundaries, faster tests. If you can’t point at a metric that improved, the refactor wasn’t worth running.
You can describe the new shape in one sentence. “The order logic now lives in src/orders/ as a single module exposing handleOrder.” If you can’t, the new shape isn’t sharp enough — keep refactoring.

Sliced refactor, multiple PRs. Each PR small and reviewable. Behaviour pinned by tests. Each slice independently revertable. Slower-feeling; ships reliably. Best for anything cross-module or cross-team.

Big-bang refactor, one PR. All changes bundled. Reviewers skim. Some slices are tested, others aren’t. Faster-feeling; ships unreliably. Acceptable only for pure mechanical renames under ~30 files.

The size-of-PR heuristic I actually use

Roughly: if the diff has more than ~500 lines of non-test changes, or touches more than ~30 files, split. Mechanical exceptions exist — a pure rename can be 2000 lines and one PR because reviewers spot-check rather than read every line. But anything semantic over those numbers gets split, even if the splitting itself takes an extra hour. The cost of an unreviewable PR shipping a regression is much higher than the cost of an extra hour of splitting.