Mock Interview — Agent System Design

A 45-minute mock loop for an agent-system-design interview. The prompt, the rubric, and the common follow-up questions.

Exercise Intermediate
14 min read
exercise mock-interview rubric design

Scenario#

This is a 45-minute mock interview loop for an agent-system-design round. Treat it as a single sitting: an interviewer hands you the prompt, you have a whiteboard, and you have to drive the conversation. The prompt below is the kind of thing a real round opens with — broad enough to leave you room, narrow enough that hand-waving gets caught fast.

The prompt:

You’re joining a mid-sized SaaS company that sells a developer-productivity platform. Engineering leadership wants to ship an “AI engineer” — an agent that takes a Jira ticket describing a small bug or refactor, and lands a working pull request against the codebase that resolves the ticket. The agent should read the ticket, explore the relevant code, write the fix, run the tests, iterate on failures, and open a PR with a clear description. Human engineers review and merge.

Design the system. Architecture, agent shape, tool surface, memory, evaluation, failure handling. We have 45 minutes. Drive.

The interviewer will sit back for the first 8–10 minutes and let you talk. After that they will probe, push back, suggest reframes, and ask follow-ups. The grade isn’t whether your design is perfect — it’s whether you handle the probes well, surface trade-offs without being prompted, and know which decisions are load-bearing.

This page walks through what a strong candidate does in each phase, the rubric the interviewer is filling in mentally, and the common follow-ups you should be ready for.

Constraints#

What the interview prompt fixes — these are explicitly in the brief, so leaning on them is fair game:

  • The scope is small bugs and refactors, not greenfield features. The agent does not architect; it executes scoped changes.
  • A human reviews and merges. No autonomous merges. The agent’s success state is “PR opened with passing CI.”
  • The codebase is the company’s own — call it a Python or TypeScript monorepo of moderate size (say 500k LOC, several hundred contributors).
  • Tests exist and are runnable. Coverage isn’t perfect but the CI pipeline is the source of truth for “does this change break anything.”
  • One agent run per ticket. The agent isn’t running continuously; it’s triggered by a Jira state transition or an engineer’s “send to AI” button.

What the interview prompt leaves variable — these are the axes you get to choose, and the interviewer will probe each:

  • Agent shape: single-agent ReAct loop vs hierarchical planner+executor vs multi-agent (planner, coder, reviewer).
  • Model selection: hosted frontier vs hosted small + RAG vs hybrid.
  • Tool surface: how broad, how dangerous, how typed.
  • Memory: per-run only, or across runs in the same repo.
  • Evaluation: how you’d measure “this thing is actually useful.”
  • Cost and latency budgets: you should propose numbers, not be handed them.

What’s wishful — the things you should not assume away if you want to be taken seriously:

  • That the ticket is well-written. Many are not. The agent will get “broken pls fix” half the time.
  • That the agent will get it right first try. Median run will involve 2–5 iterations on failing tests.
  • That every fix is small. Some “small bugs” are entangled with three other systems. The agent must detect when it’s over its head.

Approach#

What a strong candidate says in the first 10 minutes, in order — this is the opening you should aim to deliver:

Minutes 0–2: Clarify the bounds. Don’t dive in. Ask 2–3 sharp questions, even if the brief seems clear:

  • “Is the agent expected to handle tickets that span multiple services, or strictly single-repo?”
  • “What’s our success metric — PR-opened rate, PR-merged-as-is rate, or post-merge bug-free rate?”
  • “What’s the cost ceiling per ticket? Is there a latency expectation?”

The interviewer will answer with whatever they want pinned. The questions themselves are scoring — they show you know what to scope before you build.

Minutes 2–4: Sketch the architecture. Draw, talking out loud:

┌──────────────────────────────────────────────┐
│ Jira ticket (trigger) │
└───────────────────────┬──────────────────────┘
┌─────────▼──────────┐
│ Planner agent │ ← reads ticket,
│ (capable model) │ scopes the work
└─────────┬──────────┘
│ plan
┌────────────────────┐
│ Coder agent │ ← ReAct loop:
│ (capable model) │ read → edit →
│ │ test → repeat
└─────────┬──────────┘
┌─────────▼──────────┐
│ Reviewer agent │ ← self-critique
│ (different model │ pass before PR
│ where possible) │
└─────────┬──────────┘
┌────────────────────┐
│ PR + description │ (human reviews)
└────────────────────┘

Three sub-agents because the responsibilities don’t share well:

  • Planner. Reads the ticket, scopes the change (“which files? which tests? what’s the change shape?”), produces a structured plan. Bails out if the ticket is too vague or the change too large.
  • Coder. Executes the plan. Standard ReAct loop with a code-aware tool surface. The bulk of token spend lives here.
  • Reviewer. Reads the final diff with fresh eyes, runs an explicit checklist (does this match the ticket? are there tests? does the diff scope creep?), proposes the PR description or rejects the change.

Minutes 4–7: Specify the tool surface. This is where weak candidates wave hands and strong ones name things:

  • read_ticket(ticket_id)
  • search_code(query, file_pattern?)
  • read_file(path)
  • read_file_range(path, start, end)
  • apply_patch(path, diff) — structured patch, not free-form rewrites
  • run_tests(test_pattern?)
  • run_lint()
  • list_recent_changes(path, limit)
  • open_pr(branch, title, description)
  • abort(reason) — the agent’s escape hatch

apply_patch is intentionally not “write_file.” Bounded edits via structured diffs are vastly more recoverable and reviewable than free-form file rewrites.

Minutes 7–10: Surface the load-bearing trade-offs without being asked. Strong candidates volunteer:

  • “I’m using planner+coder+reviewer because the reviewer needs to be independent — if I let the coder critique itself, it’ll rubber-stamp its own work.”
  • “I’m capping the coder’s step budget at 30 tool calls — past that, it almost certainly gets worse, not better.”
  • “I’d separate the model used for planning from the one used for coding if budget allows, because the planning quality matters disproportionately.”
  • “Cost-wise, I’d target 2perattemptedticketand2 per attempted ticket and 5 per merged-PR ticket — I’d want those numbers validated before promising leadership a roll-out.”

If you’ve done all this in 10 minutes, the rest of the interview is the interviewer probing your design. That’s the goal — you want to be defending, not improvising.

Design decisions to make#

The 3–5 axes the interviewer will probe, and what a strong defence looks like for each:

  1. Single-agent vs multi-agent. Many candidates default to multi-agent because it sounds sophisticated. The interviewer’s probe: “Why not just one capable agent with a long system prompt?” The defensible answer: planner/coder split exists because the failure modes differ — planner failures mean wasted effort on the wrong scope, coder failures mean a buggy diff; you want different prompts, different eval signals, and ideally different cost tiers. Reviewer separation exists for independence — same model, same prompt, won’t catch its own errors.

  2. Step budget and loop control. Probe: “What if the coder agent gets stuck in a test-fix-test-fix loop?” Defensible answer: hard step cap (30 calls), stagnation detection (3 consecutive identical failures on the same test = abort), and an explicit abort(reason) tool the agent can call. Letting the agent run forever is the single most expensive failure mode in this design.

  3. Tool safety and blast radius. Probe: “What stops the agent from rm -rfing the repo?” Defensible answer: the tool surface is deliberately narrow — no shell, no arbitrary file write, only structured apply_patch. The agent runs in a sandboxed working copy (a clone, a worktree, a container) — never against the canonical checkout. PRs are opened on branches; nothing the agent does is destructive at the repo level.

  4. Memory across runs. Probe: “Should the agent remember what it’s seen in this repo before?” Defensible answer: a retrieval layer over prior tickets and their PRs is high-value (e.g. “the auth flow has these gotchas, last fixed in PR-123”), but a free-form long-term memory the agent writes to is dangerous — it amplifies its own errors over time. Indexed retrieval over an external store, not a self-mutating memory.

  5. Evaluation. Probe: “How do you know this thing is working?” Defensible answer: three buckets — a frozen suite of ~50 historical tickets where you know the right fix and can grade exactly; a shadow-mode run on incoming tickets where the agent’s PR is compared to the human’s; live tracking of merge-without-edit rate, time-to-merge, and post-merge bug rate. Punting on this question is a fail signal.

The interviewer may also probe:

  • Cost. “How do you keep this under $X per ticket?” — model tiering, step cap, exemplar-budgeted retrieval, aborting cheap on out-of-scope tickets.
  • Auth. “How does the agent authenticate to GitHub / Jira?” — a service account with the narrowest possible scopes, audit logging of every action, no human credential reuse.
  • Bias / regression. “What if the agent’s median fix is worse than the median junior engineer’s?” — that’s the empirical question the eval harness answers; if true, scope down what the agent attempts.

Trade-offs to discuss#

The follow-up debates the interviewer pushes into. Be ready for both sides of each.

Planner + coder split. Each agent has a focused prompt, focused eval signal, separable failure attribution. Coordination cost is real — the plan has to be machine-readable enough for the coder to follow, the coder has to be permitted to diverge from the plan when reality contradicts it.
Single coder agent that plans inline. Simpler architecture, fewer moving parts, less prompt engineering. Failure attribution is harder — was the bad PR because the agent misread the ticket, or because it wrote buggy code? Step-count usage tends to balloon.
Self-reflection by the coder before PR. Cheaper than a separate reviewer agent — one model, two passes. Costs: the model is biased toward its own outputs; meaningful errors get rubber-stamped. Useful as a first filter, not the only one.
Independent reviewer agent. Genuinely different inputs (sees only the final diff and the ticket, not the trajectory) catches a different class of errors. Cost is roughly +20% per run. The dollar value of preventing one bad PR being merged usually justifies it.
Frontier model end-to-end. Best raw capability, simpler to evaluate, fewer tier boundaries. Per-ticket cost dominates; many tickets are mundane and don’t need this much horsepower.
Tiered models per role. Cheap model for ticket classification + planner shortlist; capable model for coding; different capable model for review. Saves meaningful budget. Adds tier-boundary failures (the cheap classifier sends easy tickets to the wrong path).

Other axes the interviewer may push you on:

  • In-context retrieval vs filesystem traversal. Pre-indexing the codebase and retrieving top-K relevant chunks vs letting the agent traverse the filesystem with search_code / read_file. Retrieval is faster but misses context that’s not lexically obvious; traversal is slower but more faithful. Combine — retrieval gives the agent its starting points, traversal lets it expand outward.

  • Synchronous vs queued. The agent could run synchronously when an engineer clicks “send to AI” or asynchronously when a ticket transitions state. Sync gives faster perceived turnaround for the engineer who triggered it; async gives better batching, easier rate limiting, more recoverable failures. Sync for engineer-triggered, async for auto-triggered.

  • What counts as success. PR-opened? PR-merged-as-is? PR-merged-with-edits? Post-merge bug-free for 30 days? Each is a different objective. “PR-merged-as-is rate” is the most honest near-term metric; “post-merge bug rate vs baseline” is the only one that proves real value, but it’s slow to measure.

  • Failure recovery and ticket commenting. When the agent aborts — out of scope, can’t find the bug, tests keep failing — it should comment back on the Jira ticket with what it tried and what it concluded. Silence on failure is worse than failure. The “what I tried” log becomes input for the next human triaging it.

  • Cross-run memory and exemplars. A retrieval index over (past ticket, past PR, eventual outcome) tuples lets the planner pull “this is what tickets that look like this usually need” without any model weight updates. Powerful and cheap. The exemplar quality decays as the codebase evolves — need a freshness signal.

Evaluation criteria#

The rubric the interviewer is filling in, roughly weighted:

  • Architecture clarity (25%) — components named, data flow drawn, sub-agent responsibilities distinct.
  • Tool design (15%) — concrete tools with realistic signatures, blast radius bounded, escape hatches present.
  • Trade-off awareness (20%) — candidate surfaces axes without being asked, defends both sides of choices, doesn’t sell their design as obviously correct.
  • Failure handling (15%) — step caps, stagnation detection, abort path, ticket-comment-on-failure, sandboxed working copy.
  • Evaluation strategy (15%) — frozen + shadow + live metrics; doesn’t punt the question.
  • Driving the room (10%) — uses the time well, doesn’t over-design one section and skip another, responds to probes without losing the thread.

Common reasons candidates fail this interview:

  • Skipping the clarifying questions. Diving straight into a multi-agent architecture without asking what success looks like. Always pin scope first.
  • Naming frameworks instead of designing. “I’d use LangChain with AutoGen on top of Pinecone with OpenSearch” tells the interviewer you’ve read marketing pages. Name a tool surface, not a stack.
  • No evaluation strategy. If you can’t say how you’d know the agent is working, you can’t design improvements to it. This single omission has tanked many otherwise-solid interviews.
  • No failure handling. Designs that assume the agent succeeds are designs that haven’t shipped. Step caps, abort paths, sandboxes — these aren’t garnishes, they’re the design.
  • Over-confidence about cost and latency. Claiming “this will run in 30 seconds for $0.50” without a number-by-number breakdown reads as undisciplined.

What a passing answer covers#

  • A multi-agent or thoughtfully-justified single-agent architecture, drawn clearly.
  • A concrete tool surface with bounded blast radius and an explicit abort tool.
  • Step caps, stagnation detection, and a sandboxed working copy.
  • Independent review before PR (different prompt at minimum, different model ideally).
  • An evaluation harness with at least frozen + shadow buckets.
  • A failure path that produces a Jira comment, not silence.
  • Cost and latency targets stated explicitly, even if approximate.

What a strong answer adds#

  • A phased rollout — v1 read-only “AI explains the ticket” comments; v2 PRs on a narrow ticket-class allowlist; v3 wider scope after eval gates clear. Not “ship it all at once.”
  • Independence as a measurable property — the reviewer’s errors are decorrelated from the coder’s, demonstrated by overlap analysis on the frozen eval set. Not just “I used a different model.”
  • Cross-run retrieval over an external index, with a freshness signal, used by the planner. Not a self-mutating memory.
  • A specific stance on what the agent will never do: never push to main, never delete branches, never touch CI config, never invoke rm. Stated upfront, not as a footnote.
  • Awareness of the socio-technical layer: engineers won’t merge AI PRs they don’t trust; trust takes time; the rollout plan needs to account for engineer adoption, not just model accuracy.
  • Honest reflection on which part of their own design they’d test first if they had a week — that self-awareness reads as senior even when parts of the design have gaps.
The interviewer's single favourite follow-up question

“Walk me through what happens when the agent’s first attempt produces a PR with a failing test.”

Weak answer: “It retries with the failure output as feedback.”

Strong answer: “The coder’s ReAct loop catches run_tests returning failures and feeds the failure output back as the next observation. The agent gets one or two iterations to fix it. If the same test fails three times in a row, stagnation detection trips and the agent calls abort('repeated test failure on test_X — root cause may be outside scope'), which posts back to Jira with the trajectory summary. We don’t open a PR with red CI. The PR is only opened when tests pass and the reviewer agent has signed off. If the test was flaky rather than failing on this change, the agent has no way to know — we accept that ~3% of aborts will be due to flake, and the on-call engineer retries those manually.”

Notice what the strong answer does: names the specific mechanisms (stagnation detection, abort tool, reviewer gate, Jira comment), acknowledges the limitation (flake misattribution), and gives a number for it. That’s the texture of a passing answer.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.