Design Your First Agent — Agentic · Engineering Playbook

Scenario#

The brief is deliberately open-ended: pick a real workflow you do daily, and design the agent that would automate it.

Not a toy problem from a textbook. Not a generic “customer support chatbot.” Something you actually do, that costs you real minutes per day, that you’d genuinely want a competent assistant to handle. The constraint is realism — the workflow you pick must have specifics: actual systems, actual inputs, actual decisions you make, actual mistakes you sometimes regret.

A few that work well for this exercise:

Triaging your inbox each morning — sorting promo from work, drafting two-line replies to the obvious ones, flagging the ones that need real attention.
Processing customer-support tickets — pulling context from the user’s history, classifying the issue, drafting a response, deciding whether to escalate.
Writing your weekly status update — gathering this week’s merged PRs, tickets closed, slack threads, into a 5-bullet summary for your manager.
Booking travel for a recurring trip — same airline preferences, same hotel chain, same expense-policy rules.
Reviewing pull requests for a specific kind of change — schema migrations, dependency upgrades, copy edits.

Pick one. Then design the agent. Walk through it as if presenting on a whiteboard: what inputs the agent perceives, what tools it has, what memory it carries, what decisions it makes autonomously vs defers to you, how you’d know it’s working.

For the rest of this exercise we’ll use a worked example — processing customer-support tickets — so the structure stays concrete. Your own choice can substitute in.

Constraints#

What’s fixed in your scenario — this part you specify based on the workflow you chose. For the customer-support example:

Tickets arrive in Zendesk. The agent can read tickets via API; writes must go through Zendesk’s standard reply/macro/tag endpoints.
There’s a public help-centre with 400 articles. The agent can retrieve from these.
There’s an internal runbook for known issues. Markdown files in a Notion workspace.
The agent must not auto-send any reply with a refund, account credit, or escalation commitment. Those need human approval.
Cost is bounded. Roughly $0.10 per ticket of inference budget; the team handles ~5,000 tickets a week.
Latency. Drafting a reply within 30 seconds of ticket arrival is the target; the agent does not need to be sub-second.
Privacy. Customer email addresses are PHI-adjacent here (we’re in a healthcare-adjacent SaaS); the agent must use a vendor with a signed DPA.

What you should always articulate, regardless of the workflow:

What the agent reads. What systems, what data shapes, what time horizon.
What the agent writes. What it commits, what it drafts for review, what it never touches.
What you’d consider “success” for a ticket / email / PR / trip. Be specific. “Resolved” is not a metric; “customer marked the reply helpful or didn’t reply within 48 hours” is.
Where the agent should hand off — when the case is too weird, too high-stakes, or too uncertain to handle.

What’s wishful (the constraints you should not let yourself assume away):

That your inputs are clean. Tickets contain rage-typing, multilingual snippets, attached screenshots, forwarded threads.
That the agent will be right most of the time. For a foundational design, target “right or honestly uncertain” — never “wrong with confidence.”
That you’ll have time to babysit it. Whatever the agent does, you should be able to walk away for an hour and come back to a sane state.

Approach#

A worked example for the customer-support agent, in broad strokes:

                ┌──────────────────────────────────────────────┐
                │       New ticket arrives in Zendesk          │
                └───────────────────────┬──────────────────────┘
                                        │
                              ┌─────────▼──────────┐
                              │   Classifier       │  ← what kind of
                              │   (small/fast)     │     ticket is this?
                              └─────────┬──────────┘
                                        │
              ┌─────────────────────────┼──────────────────────────┐
              ▼                         ▼                          ▼
     ┌────────────────┐       ┌─────────────────┐        ┌────────────────────┐
     │  "How do I X"  │       │ "It's broken"   │        │ "I want a refund"  │
     │  → KB answer   │       │ → runbook check │        │ → assemble context │
     │     agent      │       │   → diagnose    │        │  → draft + human   │
     └───────┬────────┘       └────────┬────────┘        └─────────┬──────────┘
             │                         │                            │
             └─────────────────────────┼────────────────────────────┘
                                       ▼
                          ┌──────────────────────┐
                          │   Reply composer     │  (draft only)
                          └──────────┬───────────┘
                                     ▼
                          ┌──────────────────────┐
                          │   Human review queue │  (you click send)
                          └──────────┬───────────┘
                                     ▼
                          ┌──────────────────────┐
                          │   Outcome tracker    │  (CSAT, reply-rate)
                          └──────────────────────┘

The single-agent shape is genuinely fine for v1 — it would look like this:

One ReAct loop, one prompt, calls a handful of tools, drafts a reply, hands to you.

The reason the diagram has three branches is that the behaviour differs enough by ticket class that splitting saves prompt complexity and lets you tune each path. But that’s a v1.5 move — start with one agent, one prompt, and let the failure modes tell you where to split.

The tool surface, even at v1:

get_ticket(ticket_id)
get_customer_history(customer_id, limit)
search_help_centre(query)
search_runbook(query)
get_account_info(customer_id)
draft_reply(ticket_id, body) (stages a draft; does not send)
add_internal_note(ticket_id, body) (visible to teammates, not the customer)
request_human_handoff(ticket_id, reason)

No send_reply. No issue_refund. No escalate_to_account_manager. Those go to you, and you click.

Design decisions to make#

The canonical decisions every first-agent design must surface — these are the ones you’ll be asked about in any conversation about your design:

What does the agent actually decide? This is the most important question. An agent that “drafts a reply for you to send” makes essentially one decision: what words to put in the draft. An agent that “tags, drafts, and routes” makes three. Each decision is a chance to be wrong. Start with the smallest decision surface that delivers value.
What’s the autonomy boundary? What does the agent commit on its own? What does it propose for you to approve? What does it not touch at all? Draw a three-column table:
- Auto-commit: applying obvious tags, adding internal notes, retrieving context.
- Propose-for-approval: the reply draft itself, suggested macros, tone changes.
- Never-touch: refunds, account credit, deletions, anything involving money or legal commitment.
What’s the memory model? For most first agents: short-term working memory per ticket / per email / per PR — gone when the task ends. Long-term (“this customer always returns refunds, lower the priority”) is tempting but adds dependencies. Defer to v2.
What’s the tool surface? Concrete, named, schema-typed. “The agent reads Zendesk” is not a tool. get_ticket(ticket_id) returning a Ticket object is. The more concrete the tool list, the easier the design conversation goes.
What’s the prompt structure? The agent gets: a role / system prompt, the task description, the tools, and the working context. Decide where domain knowledge goes — in the system prompt (cheap, always present, no retrieval), in retrieved exemplars (richer, more brittle), or in a knowledge-base tool (decoupled, slower). For a first agent, lean toward stuffing the most-important 200 lines into the system prompt and using tools for the long tail.
What’s “success” on a single run? For the support agent: the drafted reply, when sent as-is, resolves the ticket without a follow-up reply from the customer within 48 hours. That’s a measurable outcome, not “the reply was good.” Pick something equivalent for your workflow.
What’s the evaluation set? A frozen sample of 50 real tickets (anonymised, with PII scrubbed). Run the agent on them every time you change the prompt. Track: % drafts you sent as-is, % drafts you heavily edited, % drafts you discarded. If you can’t meaningfully edit those numbers, you can’t tell whether you’re improving.
What’s the failure path? When the agent doesn’t know what to do — when the classifier returns low confidence, when no relevant runbook entry exists, when the customer’s request involves money — the agent calls request_human_handoff() with a reason. The failure path is itself a feature, not an edge case.

Trade-offs to discuss#

Auto-send drafts when the agent is confident. Faster customer experience, less of your time spent reviewing. Costs: the agent’s confidence is rarely well-calibrated; one wrong auto-send to a high-value customer can outweigh many small wins.

Always queue for human review. Slower, more of your time, but no autonomous mistakes. Each review takes seconds once the draft is good; the queue compounds smaller than you’d expect once accuracy is reasonable.

Use one capable model for everything. Simpler prompts, fewer moving parts, easier to evaluate end-to-end. Costs more per call. Probably overspends on trivial classifications.

Tier the models. Small/cheap for classification and routing, larger/capable for draft composition. Cuts cost meaningfully. Adds an integration surface, two latency profiles, two failure modes.

Other axes to surface:

System-prompt density vs retrieval. A 200-line system prompt that fits everything in once costs more per call but never misses the right context. RAG-style retrieval per call costs less in total but introduces the “retrieved the wrong article” failure mode. For first agents, lean into the system prompt; you can split later when call volume justifies the engineering.
Synchronous vs queued execution. A synchronous “draft this reply now” pattern matches user-driven workflows (you triggered it; you wait). A queued “process all overnight tickets and have drafts ready by morning” pattern matches background workflows (you didn’t trigger it; you arrive to outputs). Pick based on whether the agent is triggered by you or by an event.
Where errors live. With auto-commit, errors are visible to the customer / external system instantly. With propose-for-approval, errors are visible only to you and editable before commit. The first pattern surfaces failures faster (you’ll get angry replies) but at the cost of customer trust; the second is slower to surface but contained.
Drafting one option vs three. “Here are three reply candidates, pick one” is a UX pattern that pushes the decision back to you in a useful way for ambiguous tickets. It also doubles or triples the cost. Useful for high-stakes drafts (legal, refund-adjacent); wasteful for routine ones.
Tagging and metadata as a separate cheap pass. Before composing the reply, a fast pass tags the ticket and pulls context. This separation makes the reply prompt simpler and gives you intermediate observables — even if the reply is poor, you can tell whether the tagging was right.

Evaluation criteria#

A passing answer:

Names a real workflow with real systems and real specifics. “An agent for customer support” is too vague; “an agent for inbound Zendesk tickets in a healthcare-adjacent SaaS” is right-sized.
Has an architecture diagram with components named and data flow drawn.
Lists the tools concretely — get_ticket, draft_reply, request_human_handoff. Not “the agent looks at the ticket.”
Draws the autonomy boundary — auto / propose / never — explicitly.
Has a success metric that is measurable, not vibey.
Names a failure path — what happens when the agent doesn’t know what to do.

A strong answer adds:

A phased rollout — v1 read-only drafts for the easiest ticket class, v2 expand to more classes, v3 selective auto-send on high-confidence trivial replies, each with its own success gate.
An eval harness — frozen ticket sample, manually-rated, tracked over time.
An acknowledgement of cost — what a ticket costs at scale, what the latency budget is, whether the unit economics actually work.
The candidate’s own reflection: which decision in this design they’re least sure about and what they’d test first to find out. That self-awareness reads as senior even when the design itself is foundational.

The single most underrated decision in first-agent design

Where to not let the agent act. Most designs over-index on what the agent can do — the tool list, the autonomy ceiling, the auto-send threshold. The senior move is to start by listing what the agent will never do: never send a reply without your click, never issue any credit, never delete data, never modify account state, never commit to a timeline on the company’s behalf. That list becomes the safety perimeter. Then you design forward from inside it. Designers who start from the perimeter ship faster than designers who start from the ambition, because the perimeter rarely needs revisiting and the ambition always does.