Retrieval-Augmented Generation (RAG) — Gen AI

Use cases#

Retrieval-Augmented Generation gives a language model access to information it doesn’t have in its weights — your documents, your codebase, your support tickets, last week’s incidents. At query time the system retrieves the relevant pieces and includes them in the prompt; the model answers with that context.

The cases where RAG is clearly the right tool:

Q&A over a corpus that updates. Internal docs, knowledge bases, policy documents. The retrieval index can re-ingest hourly without retraining anything.
Citation-heavy answers. Legal, medical, scientific assistants. The user needs to know which document the claim came from.
Long-tail factual lookup. Specific product details, customer records, historical events not in the training data.
Domain adaptation without fine-tuning. Specialized vocabulary, internal jargon, organization-specific phrasing.

When RAG is not the right tool: tasks that need deep reasoning over the whole corpus (no single chunk is enough), tasks where the corpus fits in the model’s context window directly (just paste it), and tasks where the failure mode of the retriever is unsafe (don’t RAG a model into a medical diagnosis without a human in the loop).

System overview#

A production RAG system has two pipelines — ingestion (offline, batch) and query (online, latency-sensitive):

Ingestion (offline)                Query (online)
─────────────────                  ──────────────
[Source documents]                 [User query]
       │                                 │
       ▼                                 ▼
[Loader / parser]                  [Query rewrite]
       │                                 │
       ▼                                 ▼
[Chunker]                          [Retrieve top-K]
       │                              ┌──┴──┐
       ▼                              │     │
[Embed + metadata extract]       Vector  Keyword
       │                          (dense)  (BM25)
       ▼                              │     │
[Index: vector + BM25 + meta]         └──┬──┘
       │                                 ▼
[Persisted store]                  [Hybrid merge]
                                         │
                                         ▼
                                   [Rerank top-N]
                                         │
                                         ▼
                                   [Assemble context]
                                         │
                                         ▼
                                   [LLM call]
                                         │
                                         ▼
                                  [Answer + citations]

Most “RAG isn’t working” debugging is finding which box in the right column is failing.

Key components#

Chunking#

Long documents are split into chunks before embedding. The shape of the chunks matters more than the model choice for retrieval quality.

Anchors that work in practice:

Size in the 200–800 token range. Below 200, the chunk lacks context; above 800, embeddings dilute and the retrieved chunk is too noisy to fit cleanly in the prompt.
Overlap of 10–20%. Stops information from being split across the chunk boundary unrecoverably.
Respect structure. Split on headings, paragraphs, code blocks — never mid-sentence or mid-function. Markdown-aware and code-aware chunkers are worth the engineering.
Metadata per chunk. Document title, section heading, source URL, last updated, ACL group. Used both for filtering at retrieval time and for citations at answer time.

Embedding model#

The embedding model maps each chunk to a vector. Two design knobs:

Dimension. 384, 768, 1024, 1536, 3072 are common. Higher dimensions retrieve better but cost more storage and bandwidth. Most production systems sit at 768 or 1024.
Domain. General-purpose embeddings (OpenAI text-embedding, Cohere, BGE, E5) work for most use cases. Domain-adapted embeddings (code-search, legal, biomedical) help for specialized corpora — but only if you have the eval set to prove the gain.

The embedding model used for ingestion must match the one used for queries. Re-embedding the entire corpus to switch models is a real migration; pick deliberately.

Index#

The store needs to do approximate-nearest-neighbour search at low latency over potentially hundreds of millions of vectors:

Dedicated vector databases (Pinecone, Weaviate, Qdrant, Vespa, Milvus). Best-of-breed for vectors at scale; another system in the stack.
Existing OLTP / search store with vector support (Postgres + pgvector, OpenSearch, Elastic). One fewer system; lower ceiling on scale and latency.
In-process libraries (FAISS, hnswlib). Great for small corpora and notebooks; you own the persistence and the replication.

The index data structure is almost always HNSW (graph-based) for the working set, sometimes IVF or product quantization for archival at extreme scale.

Hybrid retrieval#

Pure vector search misses on exact-match queries (product IDs, error codes, rare named entities). Pure keyword search misses on semantic queries (paraphrases, synonyms). Hybrid retrieval runs both, fuses the scores.

The standard fusion approach is Reciprocal Rank Fusion (RRF): each result’s score is sum_over_retrievers(1 / (k + rank_in_that_retriever)). RRF is parameter-light, doesn’t require score calibration between retrievers, and consistently beats either alone.

Reranking#

After hybrid retrieval returns top-K candidates (typically K = 50 to 200), a cross-encoder reranker scores (query, chunk) pairs jointly and re-orders. Cross-encoders are much more accurate than bi-encoder embeddings because they attend to query and chunk together, but they don’t scale to large corpora — which is why they only run on the K candidates the retriever already filtered down.

Top reranker choices: Cohere Rerank, BGE-reranker-v2, Voyage Rerank. Latency is 30–200 ms for K = 50.

Query rewriting#

The user’s literal query is often a bad retrieval input. Rewrite it before embedding:

HyDE (Hypothetical Document Embeddings): ask the LLM to write a hypothetical answer to the query, then embed that. Often retrieves better than the query itself because answers are closer to documents than questions are.
Decomposition: split a multi-part question into sub-questions, retrieve for each, merge.
Conversation rewrite: in multi-turn chats, rewrite the latest user message so it’s a self-contained query that doesn’t depend on history ("and what about the second one?" → "what are the dosage instructions for Drug X?").

Implementation patterns#

Citation-first prompting#

Tell the model to cite the chunk for every claim, and pass each chunk with a stable ID. The post-processor checks that every citation maps to a retrieved chunk; uncited claims get flagged or stripped. Without this, models hallucinate fluently with retrieved context — the context becomes inspiration, not constraint.

Refusal-on-no-result#

If retrieval returns nothing above a score threshold, do not hand an empty context to the model — it will fabricate. Either short-circuit to “I don’t have information about that” or escalate to a different retrieval path (web search, broader corpus).

Filtering before retrieval#

If a user is asking about their own data, filter by user ID before the ANN search, not after. Filtering after means you might lose the actual answer in the K-cutoff. Most vector databases support metadata-pre-filtered ANN search; use it.

Re-ingestion pipeline as a first-class system#

The retrieval index is only as fresh as the ingestion pipeline. Treat it as a system with its own monitoring: documents-ingested-per-hour, embedding-cost-per-day, freshness-lag-per-source, schema validation per doc type. Most “the RAG is wrong” complaints trace to ingestion gaps, not the retriever.

Trade-offs#

RAG — model knows nothing about the corpus; retrieves at query time. Knowledge is up-to-date and easy to expand. Failure mode is retrieval miss. Bounded by what fits in context.

Fine-tuning — model absorbs the corpus into weights. Knowledge is baked in; faster inference, no retrieval system. Failure mode is forgetting and stale data. Costly to refresh.

Most production systems are RAG, not fine-tuning, because corpora change. Fine-tuning is reserved for stable knowledge or for adapting style (tone, format), where the retrieval source is the model’s training data.

Other axes:

Many small chunks vs. few large chunks. Small chunks have precise retrieval but each carries less context; assembling many small chunks into a prompt costs tokens. Large chunks have noisier retrieval but are easier to consume.
Strict score threshold vs. always top-K. Strict thresholds let the system refuse cleanly when no relevant document exists; always-top-K guarantees an answer attempt at the cost of false positives.
One big index vs. many small indexes. Splitting by department / source / language gives you faster, cheaper retrieval and tighter ACLs; merging gives you cross-source answers. Most mature systems are a router on top of many small indexes.

Quality and evaluation#

Evaluation is where most teams find that their RAG is worse than they thought. Build the eval harness on day one, not on day ninety.

Three eval layers, each independently informative:

Retrieval-only. For a curated query set with known relevant chunks (or known relevant documents), measure recall@K and MRR. Tells you whether the retriever is finding the right context.
End-to-end. Run the full pipeline and grade the answer. Use a strong LLM as a judge with a rubric: correctness, groundedness (every claim cited from context), completeness, faithfulness (no contradictions with context). RAGAS, TruLens, and similar frameworks formalize this.
Production telemetry. User thumbs-up/down, follow-up question rate, refusal rate, latency, cost per query. Feed examples that fail back into the eval set.

Groundedness is often the most actionable single metric. If groundedness is 60%, the retrieval is finding documents the model isn’t using; if it’s 95% but correctness is 70%, the retrieval is missing the right document. The diagnosis is different.

Common pitfalls#

Chunking by character count, not structure. Mid-paragraph splits destroy semantics. Always respect headings, lists, code blocks.
Ignoring metadata. Chunks without source, freshness, and ACL metadata cannot be filtered, cited, or audited.
Skipping reranking. Bi-encoder retrieval alone is decent; adding a cross-encoder reranker is one of the largest quality wins in the pipeline. Worth its latency cost almost always.
Treating retrieval as a black box. When the model gives a wrong answer, log the retrieved chunks. The bug is almost always visible there.
Vector-only on exact-match queries. Pure dense retrieval misses product SKUs, error codes, names. Hybrid + RRF is the default; skipping BM25 is a self-inflicted wound.
No re-embedding plan when changing models. Switching embedding models silently breaks retrieval. Plan the migration; keep both indexes online during cutover.
Letting the context grow unbounded. Each retrieved chunk costs tokens and degrades model attention. Cap the number of chunks and cap the total token budget; trim the tail past the cap.

A one-week RAG sanity audit

Day 1: pick 50 real user queries and 50 known-answer queries. Day 2: log the top-10 retrieved chunks for each. Day 3: count how many queries had the right chunk in the top 10 (recall@10) and how many had it at rank 1 (recall@1). Day 4: for the queries where recall@10 was zero, look at the chunks — they’re either missing from the index (ingestion bug) or being out-ranked (retriever bug). Day 5: fix the most common cause. By the end of week one you’ll know whether your RAG problem is chunking, embedding, ranking, or the LLM. Most teams are surprised which one.