LLM-Powered Customer Support Bot — System Design

Step 1 — Clarify Requirements#

Functional

A customer asks a question in chat; the bot answers using the company’s knowledge base.
The bot can take account-specific actions: look up an order, check a balance, initiate a refund (within policy limits).
When the bot is unsure or the user asks, it hands off to a human agent with full conversation context.
Out of scope: voice channels, training the LLM (treated as a hosted API), the agent dashboard UI.

Non-functional

99.9% availability — outage of the bot falls back to a queue for human agents (degraded but functional).
First-token latency under 1.5 s p95.
Hallucination rate (answers not grounded in the KB) below 1%, ideally below 0.1% for high-stakes domains (financial, medical).
Auditable: every answer must be tied to the KB documents it was derived from.

Step 2 — Capacity Estimation#

Conversations/day: 1 M (mid-large SaaS company at scale).
Concurrent active: ~10 K at peak.
Messages per conversation: ~6 (3 user, 3 bot).
KB size: 50 K documents × 5 KB = ~250 MB raw; embeddings: 50 K × 1024 × 4 B = ~200 MB.
Per-message LLM cost: ~ $0.005 (back-of-envelope at frontier-model rates). 6 M messages/day = **$ 30 K/day** in inference. Real cost driver.
Retrieval QPS: ~50 K/min at peak vector-search QPS — small, fits a single replica of any modern vector DB.

The interesting design pressures are grounding (don’t hallucinate) and escalation (hand off cleanly).

Step 3 — System Interface#

POST /chat
  Body: { conv_id?, user_message, account_id? }
  → SSE stream of bot tokens, plus structured events:
       - cited_sources: [doc_id, ...]
       - tool_call: { tool, args }       (e.g., look up order)
       - escalation_proposed: bool
       - escalation_confirmed: bool       (with handoff_id)

POST /escalate { conv_id, reason }
GET  /conversations/:id
POST /feedback { conv_id, rating, free_text }    // for evals

GET  /kb/sync_status                     // KB indexing freshness

Step 4 — High-Level Design#

                                    ┌── KB ingestor (docs → chunks → embeddings) ─→ vector store
                                    │                                                    │
                                    │                                                    │
client (chat) ──→ API ─→ orchestrator ──┬── retriever ───────────────────────────────────┘
                                        │           │
                                        │           └── citation tracker
                                        │
                                        ├── tool-use planner ── allowed-tools registry
                                        │
                                        ├── LLM provider (streaming)
                                        │
                                        ├── output guardrails (safety + grounding check)
                                        │
                                        └── escalation router ── agent assignment + handoff
                                                  │
                                                  ▼
                                              human agent dashboard

The orchestrator is the brain. Each user message goes through a deterministic pipeline (retrieve → reason → answer → check), with the LLM as a powerful subroutine, not the system itself.

Step 5 — Data Model#

Conversations:

table conversations
  conv_id        uuid     PK
  user_id        uuid
  account_id     uuid
  state          enum(active, escalated, resolved)
  created_at     timestamp
  agent_id       uuid?    -- assigned human, if escalated

table messages
  conv_id        uuid       PK
  msg_id         timeuuid   CK
  role           enum(user, bot, agent, system, tool)
  content        text
  cited_docs     list<doc_id>
  tools_used     list<{name, args, result}>
  ts             timestamp

Knowledge base:

table kb_documents
  doc_id      uuid     PK
  title       string
  body        text
  source_url  string
  version     int
  updated_at  timestamp

table kb_chunks
  chunk_id    uuid     PK
  doc_id      uuid
  content     text        // ~500 tokens
  embedding   vector(1024)
  metadata    json         // category, audience, freshness

Vector index is built over kb_chunks.embedding.

Tool registry:

table tools
  name           string   PK
  description    string   -- LLM-facing description
  input_schema   json
  scopes         list<string>    -- account / role permissions required
  handler_uri    string

Step 6 — Detailed Design#

Retrieval-augmented generation (RAG)#

The hot path:

On user message M:
1. Rewrite M to a search query (often via the LLM itself, conditioned on conversation history).
2. Vector search: top-K relevant chunks by cosine similarity. K = 10-20.
3. Optional: re-rank with a cross-encoder for higher relevance.
4. Build a prompt:
     [system instructions]
     [retrieved chunks with their doc_ids]
     [conversation history]
     [user message]
5. Stream LLM completion.
6. Parse output for citations; verify each cited doc_id exists in retrieved set.
7. Stream tokens back to client.

The orchestrator enforces “answer only from retrieved chunks” via a system prompt and a structured output check. Chunks that aren’t cited get filtered out of follow-up turns to save tokens.

Tool use#

A subset of customer questions need account-specific data:

“Where’s my order #1234?” → call lookup_order(order_id=1234).
“Refund my last purchase” → call initiate_refund(charge_id=..., amount=...).

The LLM decides when to call a tool by emitting structured output (function-calling format). The orchestrator validates the call against the registry, runs it, and feeds the result back into the conversation:

loop:
   call LLM with conversation + tool descriptions
   if LLM emits tool_call: run tool, append result to convo, continue
   else: stream final answer to user

Bounded loop (max 5 tool calls per turn) to prevent runaway agents.

Permissions on tools#

Each tool declares required scopes. The orchestrator checks the user’s account and role before invoking. The LLM never has direct access to backend APIs — only the orchestrator does, and only via vetted tool handlers.

This is the security boundary. A jailbreak that gets the LLM to “decide” to call delete_all_customer_data is foiled at the registry check.

Escalation to human#

When to escalate:

User explicitly asks for a human.
Bot’s confidence in its answer is below threshold (a calibrated head, or a heuristic over retrieval scores).
The query is in a high-stakes category that policy requires (cancellations, complaints, anything involving cash > $X).
Conversation has gone N turns without resolution.

On escalation:

1. Mark conversation state = ESCALATED, queued for agent assignment.
2. Generate a summary of the conversation so far (LLM call, ~100 tokens).
3. Route to the agent queue: pick a free agent based on language, skill, account tier.
4. Hand off: agent's dashboard shows the conversation + summary + cited KB docs.
5. From this point, bot is silent unless agent invites it back.

The handoff summary is the difference between “agent reads 20 messages to catch up” and “agent reads 3 sentences”. It’s a major productivity lever.

Conversation memory#

A 20-turn conversation has thousands of tokens of history; sending all of it on every turn is expensive. Strategies:

Sliding window — keep the last K messages verbatim, drop older ones. Simple; loses information if user references something from earlier.

Running summary + window — periodically summarize older turns into a single rolling summary; keep last K verbatim. More tokens-efficient, costs an extra LLM call per summarize.

For support, sliding window of ~10 turns plus the user’s account context (name, plan, recent orders) is usually enough.

KB ingestion pipeline#

docs source (Confluence, Notion, Zendesk articles) → fetch on schedule
   → chunk (typically 500 tokens, with overlap to preserve context across boundaries)
   → embed (current embedding model)
   → upsert into vector store
   → mark old chunks of changed docs as deleted (next index rebuild reclaims)

Re-embed everything when the embedding model changes. Atomic swap of the index alias.

Output guardrails#

Before streaming the bot’s response to the user:

Safety classifier: blocks toxic / discriminatory output.
PII redaction: strip account numbers, emails, etc. that weren’t in the input.
Policy compliance: regex / classifier for category-specific rules (“never promise a refund without a confirmation”).
Citation enforcement: every factual claim must point to a chunk in the retrieved set.

Failures here pause generation, escalate, or return a fallback (“Let me get a teammate to help with that”).

Evaluation#

Bots regress silently. Continuous evaluation:

Golden test set: ~500 hand-curated questions with expected references. Run after every model / KB change. CI gate.
Online sampling: 1% of conversations sampled for human review; results feed back into training signals and KB gaps.
Confidence calibration: track how often “high-confidence” answers were correct; alert if calibration drifts.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: KB coverage and freshness. The bot is only as good as the KB. A question with no relevant chunks gets either a hallucinated answer or a vague refusal. Investment in KB authoring, retrieval-failure logging (“the bot couldn’t find an answer here”), and freshness alerts is what makes the system feel competent.

Bottleneck #2: latency of multi-step tool use. A conversation that calls 3 tools in sequence is 4 LLM round-trips and 3 backend round-trips — easily 10+ seconds. Most production bots cap at 2 tool calls per turn; complex multi-step flows are explicit (UI-driven), not free-form agents.

Bottleneck #3: inference cost at high message volume. $30 K/day in LLM calls is significant. Mitigations: use a smaller / cheaper model for the retrieval-query rewrite step; cache common questions (the same FAQ-class question gets the same answer); detect questions answerable from the FAQ directly and bypass LLM.

Alternative I’d push back on: a single mega-prompt that retrieves, decides, and answers in one call. Tempting because it’s simpler, but loses traceability (“why did the bot answer that?”), makes guardrails impossible to insert between steps, and makes the bot’s behavior emergent and hard to debug. The orchestrator pattern keeps each step inspectable.

What breaks first at 10× scale: the vector store. Already snug at 50 K docs; at 5 M chunks the index serving cost grows, and the recall quality of off-the-shelf ANN starts dropping. Domain-tuned embedding models + hybrid (sparse + dense) retrieval keep quality up at scale.

Companies this resembles#

Intercom Fin, Zendesk AI, Klarna’s AI assistant, Ada, Glean for internal IT support, Crisp. Cousins: ChatGPT plugins / Custom GPTs (general-purpose RAG), Microsoft Copilot for support agents (assistive rather than fully autonomous).

ChatGPT-style Conversational System — the inference layer this design depends on.
AI / ML Data Infrastructure — the vector store and embedding pipeline substrate.
Distributed Search — hybrid retrieval (lexical + vector) often outperforms vector alone.
Server-Side Error Monitoring — for tracking bot failures and escalations.