LLM-Powered Customer Support Bot
RAG over knowledge bases, conversation memory, escalation handoff, guardrails.
Step 1 — Clarify Requirements#
Functional
- A customer asks a question in chat; the bot answers using the company’s knowledge base.
- The bot can take account-specific actions: look up an order, check a balance, initiate a refund (within policy limits).
- When the bot is unsure or the user asks, it hands off to a human agent with full conversation context.
- Out of scope: voice channels, training the LLM (treated as a hosted API), the agent dashboard UI.
Non-functional
- 99.9% availability — outage of the bot falls back to a queue for human agents (degraded but functional).
- First-token latency under 1.5 s p95.
- Hallucination rate (answers not grounded in the KB) below 1%, ideally below 0.1% for high-stakes domains (financial, medical).
- Auditable: every answer must be tied to the KB documents it was derived from.
Step 2 — Capacity Estimation#
- Conversations/day: 1 M (mid-large SaaS company at scale).
- Concurrent active: ~10 K at peak.
- Messages per conversation: ~6 (3 user, 3 bot).
- KB size: 50 K documents × 5 KB = ~250 MB raw; embeddings: 50 K × 1024 × 4 B = ~200 MB.
- Per-message LLM cost: ~30 K/day** in inference. Real cost driver.
- Retrieval QPS: ~50 K/min at peak vector-search QPS — small, fits a single replica of any modern vector DB.
The interesting design pressures are grounding (don’t hallucinate) and escalation (hand off cleanly).
Step 3 — System Interface#
POST /chat Body: { conv_id?, user_message, account_id? } → SSE stream of bot tokens, plus structured events: - cited_sources: [doc_id, ...] - tool_call: { tool, args } (e.g., look up order) - escalation_proposed: bool - escalation_confirmed: bool (with handoff_id)
POST /escalate { conv_id, reason }GET /conversations/:idPOST /feedback { conv_id, rating, free_text } // for evals
GET /kb/sync_status // KB indexing freshnessStep 4 — High-Level Design#
┌── KB ingestor (docs → chunks → embeddings) ─→ vector store │ │ │ │client (chat) ──→ API ─→ orchestrator ──┬── retriever ───────────────────────────────────┘ │ │ │ └── citation tracker │ ├── tool-use planner ── allowed-tools registry │ ├── LLM provider (streaming) │ ├── output guardrails (safety + grounding check) │ └── escalation router ── agent assignment + handoff │ ▼ human agent dashboardThe orchestrator is the brain. Each user message goes through a deterministic pipeline (retrieve → reason → answer → check), with the LLM as a powerful subroutine, not the system itself.
Step 5 — Data Model#
Conversations:
table conversations conv_id uuid PK user_id uuid account_id uuid state enum(active, escalated, resolved) created_at timestamp agent_id uuid? -- assigned human, if escalated
table messages conv_id uuid PK msg_id timeuuid CK role enum(user, bot, agent, system, tool) content text cited_docs list<doc_id> tools_used list<{name, args, result}> ts timestampKnowledge base:
table kb_documents doc_id uuid PK title string body text source_url string version int updated_at timestamp
table kb_chunks chunk_id uuid PK doc_id uuid content text // ~500 tokens embedding vector(1024) metadata json // category, audience, freshnessVector index is built over kb_chunks.embedding.
Tool registry:
table tools name string PK description string -- LLM-facing description input_schema json scopes list<string> -- account / role permissions required handler_uri stringStep 6 — Detailed Design#
Retrieval-augmented generation (RAG)#
The hot path:
On user message M:1. Rewrite M to a search query (often via the LLM itself, conditioned on conversation history).2. Vector search: top-K relevant chunks by cosine similarity. K = 10-20.3. Optional: re-rank with a cross-encoder for higher relevance.4. Build a prompt: [system instructions] [retrieved chunks with their doc_ids] [conversation history] [user message]5. Stream LLM completion.6. Parse output for citations; verify each cited doc_id exists in retrieved set.7. Stream tokens back to client.The orchestrator enforces “answer only from retrieved chunks” via a system prompt and a structured output check. Chunks that aren’t cited get filtered out of follow-up turns to save tokens.
Tool use#
A subset of customer questions need account-specific data:
- “Where’s my order #1234?” → call
lookup_order(order_id=1234). - “Refund my last purchase” → call
initiate_refund(charge_id=..., amount=...).
The LLM decides when to call a tool by emitting structured output (function-calling format). The orchestrator validates the call against the registry, runs it, and feeds the result back into the conversation:
loop: call LLM with conversation + tool descriptions if LLM emits tool_call: run tool, append result to convo, continue else: stream final answer to userBounded loop (max 5 tool calls per turn) to prevent runaway agents.
Permissions on tools#
Each tool declares required scopes. The orchestrator checks the user’s account and role before invoking. The LLM never has direct access to backend APIs — only the orchestrator does, and only via vetted tool handlers.
This is the security boundary. A jailbreak that gets the LLM to “decide” to call delete_all_customer_data is foiled at the registry check.
Escalation to human#
When to escalate:
- User explicitly asks for a human.
- Bot’s confidence in its answer is below threshold (a calibrated head, or a heuristic over retrieval scores).
- The query is in a high-stakes category that policy requires (cancellations, complaints, anything involving cash > $X).
- Conversation has gone N turns without resolution.
On escalation:
1. Mark conversation state = ESCALATED, queued for agent assignment.2. Generate a summary of the conversation so far (LLM call, ~100 tokens).3. Route to the agent queue: pick a free agent based on language, skill, account tier.4. Hand off: agent's dashboard shows the conversation + summary + cited KB docs.5. From this point, bot is silent unless agent invites it back.The handoff summary is the difference between “agent reads 20 messages to catch up” and “agent reads 3 sentences”. It’s a major productivity lever.
Conversation memory#
A 20-turn conversation has thousands of tokens of history; sending all of it on every turn is expensive. Strategies:
For support, sliding window of ~10 turns plus the user’s account context (name, plan, recent orders) is usually enough.
KB ingestion pipeline#
docs source (Confluence, Notion, Zendesk articles) → fetch on schedule → chunk (typically 500 tokens, with overlap to preserve context across boundaries) → embed (current embedding model) → upsert into vector store → mark old chunks of changed docs as deleted (next index rebuild reclaims)Re-embed everything when the embedding model changes. Atomic swap of the index alias.
Output guardrails#
Before streaming the bot’s response to the user:
- Safety classifier: blocks toxic / discriminatory output.
- PII redaction: strip account numbers, emails, etc. that weren’t in the input.
- Policy compliance: regex / classifier for category-specific rules (“never promise a refund without a confirmation”).
- Citation enforcement: every factual claim must point to a chunk in the retrieved set.
Failures here pause generation, escalate, or return a fallback (“Let me get a teammate to help with that”).
Evaluation#
Bots regress silently. Continuous evaluation:
- Golden test set: ~500 hand-curated questions with expected references. Run after every model / KB change. CI gate.
- Online sampling: 1% of conversations sampled for human review; results feed back into training signals and KB gaps.
- Confidence calibration: track how often “high-confidence” answers were correct; alert if calibration drifts.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: KB coverage and freshness. The bot is only as good as the KB. A question with no relevant chunks gets either a hallucinated answer or a vague refusal. Investment in KB authoring, retrieval-failure logging (“the bot couldn’t find an answer here”), and freshness alerts is what makes the system feel competent.
Bottleneck #2: latency of multi-step tool use. A conversation that calls 3 tools in sequence is 4 LLM round-trips and 3 backend round-trips — easily 10+ seconds. Most production bots cap at 2 tool calls per turn; complex multi-step flows are explicit (UI-driven), not free-form agents.
Bottleneck #3: inference cost at high message volume. $30 K/day in LLM calls is significant. Mitigations: use a smaller / cheaper model for the retrieval-query rewrite step; cache common questions (the same FAQ-class question gets the same answer); detect questions answerable from the FAQ directly and bypass LLM.
Alternative I’d push back on: a single mega-prompt that retrieves, decides, and answers in one call. Tempting because it’s simpler, but loses traceability (“why did the bot answer that?”), makes guardrails impossible to insert between steps, and makes the bot’s behavior emergent and hard to debug. The orchestrator pattern keeps each step inspectable.
What breaks first at 10× scale: the vector store. Already snug at 50 K docs; at 5 M chunks the index serving cost grows, and the recall quality of off-the-shelf ANN starts dropping. Domain-tuned embedding models + hybrid (sparse + dense) retrieval keep quality up at scale.
Companies this resembles#
Intercom Fin, Zendesk AI, Klarna’s AI assistant, Ada, Glean for internal IT support, Crisp. Cousins: ChatGPT plugins / Custom GPTs (general-purpose RAG), Microsoft Copilot for support agents (assistive rather than fully autonomous).
Related systems#
- ChatGPT-style Conversational System — the inference layer this design depends on.
- AI / ML Data Infrastructure — the vector store and embedding pipeline substrate.
- Distributed Search — hybrid retrieval (lexical + vector) often outperforms vector alone.
- Server-Side Error Monitoring — for tracking bot failures and escalations.