Pass a prompt non-interactively. Constrain permissions so unattended runs can’t do harmful things. Parse the structured output. Check the exit code. Set a timeout. Log everything. Fail loudly on unexpected states.
Headless Claude Code is the same engine as interactive Claude Code with two structural differences: there is no human in the loop, and there is no conversation across runs (unless you explicitly thread one). Those two facts shape everything about how the workflow is designed. The mode unlocks CI checks, scheduled audits, bulk file transformations, and autonomous agent loops — but it also strips away every safety net that interactive mode quietly provides.
CI checks. A nightly review of opened PRs, a pre-merge lint that needs reasoning beyond regex, a doc-freshness check that runs against changed files.
Scheduled audits. Weekly scan of the codebase for stale TODOs, missing test coverage on critical paths, drift from architectural conventions.
Bulk transformations. Applying the same change across hundreds of files where the change is too semantic for sed/awk but too uniform to warrant interactive iteration.
Triage automations. First-pass labeling of issues, first-pass severity assignment for alerts, first-pass categorization of bug reports.
Cron-style loops. Periodic check-and-act: poll a queue, process one item, write back a result, repeat. Often paired with the agent SDK rather than the CLI directly — see The Claude Agent SDK.
Don’t reach for headless mode when:
You’d benefit from in-flight redirection. Most non-trivial coding work falls here. The interactive loop exists for a reason.
The risk of a bad action is high and the cost of an alert is low. If a wrong action could ship a bug, leak data, or burn money, the human-in-the-loop friction is buying you something worth more than it costs.
You haven’t run the workflow interactively at least once. Headless mode is for workflows you understand. If you can’t predict roughly what the run will do, run it interactively first and codify only once you trust the shape.
Before scripting anything, run the workflow as an interactive session. Get it to the point where the same prompt, against the same starting state, produces the right result reliably. Note the prompt, the verification step, and any edge cases.
Skipping this step is the most common headless failure mode. You write a CI script for a workflow you’ve never actually completed by hand, deploy it, and find out at 3am why the workflow doesn’t survive a particular input shape.
2. Convert the prompt into a parameterized template#
A good headless prompt is terse, complete, and parameterized. Terse because verbose prompts get expensive at scale. Complete because there’s no human to clarify ambiguities. Parameterized because the same prompt should serve many invocations with different inputs.
A useful pattern: pull the prompt template into its own file (e.g., prompts/review-pr.md) and have the script interpolate the parameters. Two benefits: the prompt is version-controlled, and you can iterate on it without redeploying the script.
Headless runs should never have full permissions. The principle of least privilege applies the same here as anywhere else.
Allowlist the tools the workflow needs. If it only reads, deny edit and write. If it only edits in one directory, scope to that directory.
Allowlist specific bash commands. Deny everything by default; permit the exact commands the workflow needs (pnpm test, git diff, gh pr list).
Use the bypass permission mode only inside well-fenced environments. Sandboxed containers, ephemeral CI runners — yes. Your laptop with credentials mounted — no.
Emit no other JSON. Do not wrap it in markdown fences.
Then in your script, parse only the last JSON object and ignore the prose. This makes downstream automation reliable in a way that grepping for ”✓” or “approved” never is.
Every headless invocation needs a wall-clock timeout. Without one, a single stuck run can occupy a CI slot forever.
Set a hard timeout per invocation. A reasonable default: 5-15 minutes for review-style workflows, 30-60 for bulk transformations. Tune to your workflow’s typical runtime + 50%.
Map exit codes to outcomes. 0 = success, 1 = workflow ran but flagged a problem, 2 = workflow failed to run. Don’t conflate “found issues” with “ran badly” — they need different handling.
Log the full transcript. Even on success. When something goes weird later, the transcript is the only forensic evidence.
A headless run that returns malformed JSON is a failure, not a success-with-a-warning. Fail the script, alert, and investigate. The most insidious failure mode for headless agents is “kinda worked, output was garbage, downstream consumer silently shipped wrong data.”
Validate the output shape before any downstream action. If the validation fails, halt and surface the raw transcript.
Headless mode can run repeatedly. If the workflow takes a destructive action, two consecutive runs must not duplicate the damage.
Idempotent operations preferred. “Set this label” instead of “add this label.” “Upsert this comment” instead of “comment again.”
Audit-logged when not idempotent. Every state change writes a record. If the same record appears twice in the audit log within a short window, the script bails — you’ve detected the duplicate before it compounds.
Dry-run mode as a first-class feature. Build the workflow with a --dry-run flag from day one. The flag should produce the same output as a real run minus the side effects. You’ll use it constantly.
Deploying a workflow you’ve never run interactively. You don’t know what it does at the edge cases. You will find out in production. Always prototype interactively first.
Wide-open permissions. “It’s just a script, it’ll be fine” until the day an injected prompt tells the script to run rm -rf and the script can. Scope permissions tightly.
Parsing prose output. Looking for the word “approved” in stdout. Brittle, breaks the day Claude phrases it differently. Use structured output.
No timeout. A stuck run can hold a CI slot, a database connection, a queue lease indefinitely. Always set a timeout.
Treating “no output” as success. Output-empty is at least as suspicious as output-malformed. If a run produces nothing, that’s a failure mode — alert and investigate.
Logging only the final result. When something goes wrong, the final result alone tells you nothing about why. Log the full transcript; keep it for at least a week.
Non-idempotent state changes without an audit log. Two consecutive runs comment twice. Three runs send three emails. Build idempotency in from the start, or you’ll spend a Friday afternoon manually cleaning up.
Cron loops without a kill switch. A misbehaving loop can do a lot of damage in an hour. Always have a way to disable the loop without redeploying — a config file, an env var, a feature flag.
Trusting headless output the way you’d trust interactive output. In interactive mode, you saw what Claude was doing as it happened. In headless mode, you didn’t — you have only the artefact. Validate before trusting.
Treating headless as a way to “save time.” Sometimes true. Often it shifts the work — from interactive iteration to script maintenance, prompt engineering, and incident response. Choose deliberately.
The script does the same thing twice on the same input. Determinism is the headless equivalent of “the suite stays green.” If two runs of the same workflow give materially different results, the workflow isn’t ready.
Alerts fire only on real signal. If you’re muting your own alerts, the noise-to-signal ratio is wrong. Tighten the alert conditions.
Mean time to detect a bad run is under one cycle. If a misbehaving headless run runs for a week before anyone notices, your observability isn’t sufficient.
You can roll back a bad action. Whether by idempotent rerun, audit-log replay, or manual undo — there’s a path. If a headless mistake is permanently destructive, the workflow shouldn’t be running headless.
Permission scope matches the actual usage. Periodically audit: what tools did the workflow actually use last month? If it’s narrower than the allowlist, tighten. If it tried to use something blocked, you have evidence about an edge case to investigate.
Cost is predictable. Per-invocation token cost stays within a band; total monthly cost stays within budget. Surprises here usually mean a stuck loop, a runaway expansion of context, or a model-routing change you didn’t notice.
Humans don’t override the workflow regularly. If humans are correcting the headless output more than ~10% of the time, the workflow isn’t ready for automation. Move it back to interactive until you understand why.
Interactive Claude Code. Human in the loop. Mid-flight redirection. Visible reasoning. Safe by default — the human notices wrong turns. Best for non-trivial work and anything novel.
Headless Claude Code. No human in the loop. Structured output, exit codes, timeouts. Tight permissions required. Unsafe by default — only as safe as the constraints you wrap around it. Best for prototyped, well-understood, idempotent workflows running at scale.
A minimal CI-review scaffold I recommend starting from
A starter shape, in pseudocode: load the PR diff from the CI environment, render a parameterized prompt template, invoke Claude Code with a 10-minute timeout and a read-only permission scope, capture the structured JSON output, validate the shape, and post a single PR comment summarizing findings. On non-zero exit or invalid output, fail the CI job and skip the comment. Idempotency on the comment is achieved by editing a pinned comment ID rather than posting fresh each time. From this scaffold, every team’s customization is just changing the prompt template and the output schema — the infrastructure scaffolding is fixed. The mistake is starting with a custom infrastructure stack and a casual prompt; the right order is rigid infrastructure, iterable prompt.