LLM-Powered Customer Support Bot

RAG over knowledge bases, conversation memory, escalation handoff, guardrails.

System Intermediate
8 min read
llm rag support escalation
Companies this resembles: Intercom Fin · Zendesk AI · Klarna · Ada

Step 1 — Clarify Requirements#

Functional

  • A customer asks a question in chat; the bot answers using the company’s knowledge base.
  • The bot can take account-specific actions: look up an order, check a balance, initiate a refund (within policy limits).
  • When the bot is unsure or the user asks, it hands off to a human agent with full conversation context.
  • Out of scope: voice channels, training the LLM (treated as a hosted API), the agent dashboard UI.

Non-functional

  • 99.9% availability — outage of the bot falls back to a queue for human agents (degraded but functional).
  • First-token latency under 1.5 s p95.
  • Hallucination rate (answers not grounded in the KB) below 1%, ideally below 0.1% for high-stakes domains (financial, medical).
  • Auditable: every answer must be tied to the KB documents it was derived from.

Step 2 — Capacity Estimation#

  • Conversations/day: 1 M (mid-large SaaS company at scale).
  • Concurrent active: ~10 K at peak.
  • Messages per conversation: ~6 (3 user, 3 bot).
  • KB size: 50 K documents × 5 KB = ~250 MB raw; embeddings: 50 K × 1024 × 4 B = ~200 MB.
  • Per-message LLM cost: ~0.005(backofenvelopeatfrontiermodelrates).6Mmessages/day=0.005 (back-of-envelope at frontier-model rates). 6 M messages/day = **30 K/day** in inference. Real cost driver.
  • Retrieval QPS: ~50 K/min at peak vector-search QPS — small, fits a single replica of any modern vector DB.

The interesting design pressures are grounding (don’t hallucinate) and escalation (hand off cleanly).

Step 3 — System Interface#

POST /chat
Body: { conv_id?, user_message, account_id? }
→ SSE stream of bot tokens, plus structured events:
- cited_sources: [doc_id, ...]
- tool_call: { tool, args } (e.g., look up order)
- escalation_proposed: bool
- escalation_confirmed: bool (with handoff_id)
POST /escalate { conv_id, reason }
GET /conversations/:id
POST /feedback { conv_id, rating, free_text } // for evals
GET /kb/sync_status // KB indexing freshness

Step 4 — High-Level Design#

┌── KB ingestor (docs → chunks → embeddings) ─→ vector store
│ │
│ │
client (chat) ──→ API ─→ orchestrator ──┬── retriever ───────────────────────────────────┘
│ │
│ └── citation tracker
├── tool-use planner ── allowed-tools registry
├── LLM provider (streaming)
├── output guardrails (safety + grounding check)
└── escalation router ── agent assignment + handoff
human agent dashboard

The orchestrator is the brain. Each user message goes through a deterministic pipeline (retrieve → reason → answer → check), with the LLM as a powerful subroutine, not the system itself.

Step 5 — Data Model#

Conversations:

table conversations
conv_id uuid PK
user_id uuid
account_id uuid
state enum(active, escalated, resolved)
created_at timestamp
agent_id uuid? -- assigned human, if escalated
table messages
conv_id uuid PK
msg_id timeuuid CK
role enum(user, bot, agent, system, tool)
content text
cited_docs list<doc_id>
tools_used list<{name, args, result}>
ts timestamp

Knowledge base:

table kb_documents
doc_id uuid PK
title string
body text
source_url string
version int
updated_at timestamp
table kb_chunks
chunk_id uuid PK
doc_id uuid
content text // ~500 tokens
embedding vector(1024)
metadata json // category, audience, freshness

Vector index is built over kb_chunks.embedding.

Tool registry:

table tools
name string PK
description string -- LLM-facing description
input_schema json
scopes list<string> -- account / role permissions required
handler_uri string

Step 6 — Detailed Design#

Retrieval-augmented generation (RAG)#

The hot path:

On user message M:
1. Rewrite M to a search query (often via the LLM itself, conditioned on conversation history).
2. Vector search: top-K relevant chunks by cosine similarity. K = 10-20.
3. Optional: re-rank with a cross-encoder for higher relevance.
4. Build a prompt:
[system instructions]
[retrieved chunks with their doc_ids]
[conversation history]
[user message]
5. Stream LLM completion.
6. Parse output for citations; verify each cited doc_id exists in retrieved set.
7. Stream tokens back to client.

The orchestrator enforces “answer only from retrieved chunks” via a system prompt and a structured output check. Chunks that aren’t cited get filtered out of follow-up turns to save tokens.

Tool use#

A subset of customer questions need account-specific data:

  • “Where’s my order #1234?” → call lookup_order(order_id=1234).
  • “Refund my last purchase” → call initiate_refund(charge_id=..., amount=...).

The LLM decides when to call a tool by emitting structured output (function-calling format). The orchestrator validates the call against the registry, runs it, and feeds the result back into the conversation:

loop:
call LLM with conversation + tool descriptions
if LLM emits tool_call: run tool, append result to convo, continue
else: stream final answer to user

Bounded loop (max 5 tool calls per turn) to prevent runaway agents.

Permissions on tools#

Each tool declares required scopes. The orchestrator checks the user’s account and role before invoking. The LLM never has direct access to backend APIs — only the orchestrator does, and only via vetted tool handlers.

This is the security boundary. A jailbreak that gets the LLM to “decide” to call delete_all_customer_data is foiled at the registry check.

Escalation to human#

When to escalate:

  • User explicitly asks for a human.
  • Bot’s confidence in its answer is below threshold (a calibrated head, or a heuristic over retrieval scores).
  • The query is in a high-stakes category that policy requires (cancellations, complaints, anything involving cash > $X).
  • Conversation has gone N turns without resolution.

On escalation:

1. Mark conversation state = ESCALATED, queued for agent assignment.
2. Generate a summary of the conversation so far (LLM call, ~100 tokens).
3. Route to the agent queue: pick a free agent based on language, skill, account tier.
4. Hand off: agent's dashboard shows the conversation + summary + cited KB docs.
5. From this point, bot is silent unless agent invites it back.

The handoff summary is the difference between “agent reads 20 messages to catch up” and “agent reads 3 sentences”. It’s a major productivity lever.

Conversation memory#

A 20-turn conversation has thousands of tokens of history; sending all of it on every turn is expensive. Strategies:

Sliding window — keep the last K messages verbatim, drop older ones. Simple; loses information if user references something from earlier.
Running summary + window — periodically summarize older turns into a single rolling summary; keep last K verbatim. More tokens-efficient, costs an extra LLM call per summarize.

For support, sliding window of ~10 turns plus the user’s account context (name, plan, recent orders) is usually enough.

KB ingestion pipeline#

docs source (Confluence, Notion, Zendesk articles) → fetch on schedule
→ chunk (typically 500 tokens, with overlap to preserve context across boundaries)
→ embed (current embedding model)
→ upsert into vector store
→ mark old chunks of changed docs as deleted (next index rebuild reclaims)

Re-embed everything when the embedding model changes. Atomic swap of the index alias.

Output guardrails#

Before streaming the bot’s response to the user:

  • Safety classifier: blocks toxic / discriminatory output.
  • PII redaction: strip account numbers, emails, etc. that weren’t in the input.
  • Policy compliance: regex / classifier for category-specific rules (“never promise a refund without a confirmation”).
  • Citation enforcement: every factual claim must point to a chunk in the retrieved set.

Failures here pause generation, escalate, or return a fallback (“Let me get a teammate to help with that”).

Evaluation#

Bots regress silently. Continuous evaluation:

  • Golden test set: ~500 hand-curated questions with expected references. Run after every model / KB change. CI gate.
  • Online sampling: 1% of conversations sampled for human review; results feed back into training signals and KB gaps.
  • Confidence calibration: track how often “high-confidence” answers were correct; alert if calibration drifts.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: KB coverage and freshness. The bot is only as good as the KB. A question with no relevant chunks gets either a hallucinated answer or a vague refusal. Investment in KB authoring, retrieval-failure logging (“the bot couldn’t find an answer here”), and freshness alerts is what makes the system feel competent.

Bottleneck #2: latency of multi-step tool use. A conversation that calls 3 tools in sequence is 4 LLM round-trips and 3 backend round-trips — easily 10+ seconds. Most production bots cap at 2 tool calls per turn; complex multi-step flows are explicit (UI-driven), not free-form agents.

Bottleneck #3: inference cost at high message volume. $30 K/day in LLM calls is significant. Mitigations: use a smaller / cheaper model for the retrieval-query rewrite step; cache common questions (the same FAQ-class question gets the same answer); detect questions answerable from the FAQ directly and bypass LLM.

Alternative I’d push back on: a single mega-prompt that retrieves, decides, and answers in one call. Tempting because it’s simpler, but loses traceability (“why did the bot answer that?”), makes guardrails impossible to insert between steps, and makes the bot’s behavior emergent and hard to debug. The orchestrator pattern keeps each step inspectable.

What breaks first at 10× scale: the vector store. Already snug at 50 K docs; at 5 M chunks the index serving cost grows, and the recall quality of off-the-shelf ANN starts dropping. Domain-tuned embedding models + hybrid (sparse + dense) retrieval keep quality up at scale.

Companies this resembles#

Intercom Fin, Zendesk AI, Klarna’s AI assistant, Ada, Glean for internal IT support, Crisp. Cousins: ChatGPT plugins / Custom GPTs (general-purpose RAG), Microsoft Copilot for support agents (assistive rather than fully autonomous).

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.