Test-Driven Development with Claude Code — Claude Code

The workflow#

Write the failing test. Hand the test to Claude as the spec. Let Claude implement until the test passes. Read the implementation. Add the next failing test. Repeat until the feature is whole.

The discipline is unchanged from human-only TDD; the leverage is different. Claude writes the boring implementation while you write the contract. When TDD works with Claude Code, it works much better than vibes-driven prompting, because the test is an unambiguous oracle the model can re-check itself against without you in the loop.

When to reach for it#

Test-driven is the right default whenever the task has a clear input/output contract and verification is fast. Specifically:

Pure functions and data transforms. Parsing, serialization, calculation, format conversion. The shape is “given X, return Y” — perfect for a test-first cycle.
Bug fixes with a reproducer. Once you have a failing test that captures the bug, the fix is just “make this green without breaking the others.” See Bug Hunting and Root-Cause Analysis for how to get to the reproducer.
API endpoints and route handlers. Request goes in, response comes out. The integration test is the spec.
Algorithm-shaped work. Sorting, searching, graph walks, state machines. The test cases double as the design document.
Refactors with behaviour-preserving intent. The pinning tests guarantee Claude doesn’t change observable behaviour while restructuring.

Don’t reach for it when verification is slow (multi-minute e2e suites, manual visual checks), when the contract is genuinely fuzzy (“make the dashboard feel snappier”), or when the test would be more code than the implementation (one-line config changes). For those, conversation-driven development without a test-first frame is fine.

Step-by-step#

1. Write the failing test yourself#

The single highest-leverage thing you do in this workflow is the first test. Write it before you ever ask Claude to implement anything.

Keep it small. One assertion is enough. A failing test that says expect(parseDuration('1h30m')).toBe(5400) is a better spec than three paragraphs of English describing how duration parsing should work. The test is unambiguous; the prose is not.

If you genuinely don’t know what the test should look like, you’re not ready to implement. Drop into plan mode and design the surface first.

2. Run the test and confirm it fails for the right reason#

Before bringing Claude in, run the suite and read the failure. You want to see “function not defined” or “expected 5400, got undefined” — not “module not found” or “syntax error.” The failure should be a missing implementation, not a broken harness.

This step takes 10 seconds and saves the next five minutes of Claude chasing a phantom failure.

3. Hand the test to Claude as the spec#

Now open the conversation. Point Claude at the failing test and state the intent:

The test in src/lib/duration.test.ts is failing. Implement parseDuration in src/lib/duration.ts so it passes. Don’t change the test.

The “don’t change the test” clause matters. Without it, Claude will sometimes “fix” a failing test by relaxing the assertion to match a wrong implementation. Pin the test as the source of truth.

4. Let Claude implement and run the test itself#

Claude Code can run pnpm test (or npm test, pytest, go test, whichever) and see the result. Let it iterate against the test until it passes. You’re not editing in this step — you’re watching.

This is the part of TDD that’s qualitatively different with Claude Code: the inner loop runs without you. You see the diff at the end of a green test, not after each character.

5. Read the implementation before adding the next test#

Don’t skip this. When the test goes green, read the diff. Look for:

Hardcoded values. Did Claude write if (input === '1h30m') return 5400? That passes one test and fails everything else. Catch it now, not three tests later.
Over-fitting to the test. Implementation that handles exactly the cases in the test and falls off a cliff outside them.
Missing edge cases. The test didn’t cover negative numbers or empty strings, so the implementation doesn’t either. Decide whether that’s fine or whether the next test should cover it.
Wrong abstraction. A function that’s correct but lives in the wrong file, uses the wrong type, or duplicates an existing helper.

If anything’s off, redirect with a single sentence (see Conversation-Driven Development) before adding more tests.

6. Add the next failing test#

The triangulation principle still holds: write a test that the current implementation fails, picking the case that forces the most useful generalization. If your first test was '1h30m' -> 5400, the next might be '45m' -> 2700 (forces handling missing hours) or '2h' -> 7200 (forces handling missing minutes), not '1h31m' -> 5460 (just one more arithmetic case).

Each new test should add a capability, not a data point.

7. Commit on green#

Every time the full suite is green, commit. Small green commits are the save points TDD with Claude depends on. If a later test sends the implementation in a bad direction, you can reset to the last green state without losing earlier work.

A reasonable cadence: one commit per 1–3 tests. Not one commit per feature. Not one commit per character.

8. Refactor with the suite as your safety net#

Once the feature is whole and the suite is green, ask Claude to clean up the implementation. Inline duplicated logic. Pull out helpers. Rename for clarity. The tests will catch any behaviour change instantly.

This is where TDD with Claude really earns its keep — the refactor step is fast because Claude does the typing and the tests do the verifying. You stay in the loop just enough to keep taste.

Anti-patterns#

The shapes that don’t work:

Asking for “tests and implementation in one shot.” “Write tests and implement parseDuration.” This generates plausible-looking tests that always pass because they were co-designed with the implementation. They cover nothing. Always write the first test yourself.
Letting Claude pick the assertions. When the model writes both the test and the implementation, the test becomes a snapshot of “what the model produced,” not “what the function should do.” Sometimes useful as documentation, never useful as a spec.
Tests that re-state the implementation. expect(fn).toHaveBeenCalledWith(...) over and over with no actual behaviour check. These tests pin implementation details and break on any refactor. Test the contract, not the call graph.
Too-coarse tests. One giant test that exercises ten cases. When it fails, you don’t know which case broke. Prefer narrow tests with one assertion each — TDD-with-Claude loves them.
Mocking everything. If your test stubs the file system, the database, the clock, the network, and three internal modules, the implementation Claude produces will be correct against the mocks and broken against reality. Mock the boundary, not the body.
Letting the suite stay red across commits. Red commits poison the workflow. You lose the save-point property, and Claude (next session) can’t tell whether the red is “your fault” or “the previous state was already red.” Always commit on green.
Skipping the read-the-implementation step. If you only look at the test result, you never notice the hardcoded value or the wrong abstraction. The diff is part of the workflow, not an optional extra.
Testing through the UI when you could test the unit. Slow e2e tests for things a unit test would catch instantly. The feedback loop is the whole point; protect it.

Evaluation#

How do you know TDD with Claude Code is working?

Test failures match the bug. When a test fails, the message describes the actual problem in one line. If you’re reading the failure and going “what does that even mean,” your tests aren’t telling Claude (or you) anything useful.
Red-to-green cycle is under two minutes. From writing the failing test to seeing the green checkmark. Longer than that and either the test is too coarse, the suite is too slow, or the contract is too fuzzy to TDD against.
Commits map one-to-one with capabilities added. Looking at git log should read like a spec: “parses hours,” “parses minutes,” “handles missing hours,” “rejects empty input.” Not “stuff” or “wip” or “more fixes.”
The suite catches regressions Claude introduces. Occasionally Claude will refactor and break a case. A good suite catches it on the next run. If you only notice in code review, the test was missing.
You write fewer tests than you would solo. Counter-intuitive but real: when Claude does the implementation, the cost of not having a test is lower in time terms, so you write the high-value tests and skip the trivial ones. Net coverage usually goes up because the high-value tests get written more often.
Refactors take minutes, not afternoons. Because the suite is comprehensive and Claude does the typing, structural changes that used to scare you become routine. If refactors still feel risky, the suite isn’t doing its job.

TDD with Claude. Human writes the test. Claude implements. Human reads diff. Human writes next test. Inner loop runs without human. Strong contract, fast iteration, comprehensive coverage. Best for clear input/output work.

No-test prompting. Human describes feature in prose. Claude generates implementation and maybe-tests. Human reviews at the end. No save points along the way. Fast for trivial work, accumulates risk on anything non-trivial.

A concrete first-test recipe I use for new features

When starting on a feature where I’m not sure what the function shape should be, I write a single test with the simplest possible happy path, no error cases, no edge cases. The act of naming the function, choosing the argument shape, and picking the return type forces 80% of the design decisions before I write a line of implementation. I then hand that test to Claude and ask for the simplest implementation that passes — explicitly “no error handling yet.” That gives me a vertical slice in five minutes that I can refine, instead of an hour of speculative design.