AI-Powered Code Assistant
Latency-sensitive completion, repo-aware context, indexing strategies, evaluation harness.
Step 1 — Clarify Requirements#
Functional
- Inline completion: as the developer types, suggest the next chunk of code.
- Chat: developer can ask questions about their code or request refactors / new functions.
- Repo-aware: completions and answers use context from across the developer’s project, not just the current file.
- Multi-language; multi-IDE (VS Code, JetBrains, terminal).
- Out of scope: training the underlying code-model, organization-wide governance (treat as input).
Non-functional
- 99.9% availability — when the assistant is down, the IDE still works, but productivity drops measurably.
- Inline completion latency: p50 under 200 ms, p99 under 800 ms. Above this, developers stop accepting suggestions.
- Chat latency: first-token under 1.5 s.
- Privacy: by default, code never leaves the developer’s machine for indexing-only purposes; for inference, code is sent to the provider but not retained for training.
Step 2 — Capacity Estimation#
- Active developers: ~10 M (Copilot-scale figures).
- Concurrent active typing: ~500 K at peak.
- Inline completions/sec: developers fire 1-5 completions/min when actively coding → ~50-100 K completions/sec at peak.
- Chat messages/sec: ~1 K/sec at peak.
- Tokens per completion: ~50 input context + ~30 output = ~80 tokens. Cheap per call but high QPS.
- Repo indexes: 10 M devs × ~5 active projects = 50 M repos indexed at any time. Average index size: ~10 MB → 500 TB of indexes if all server-side.
Latency is the hard requirement; quality at low latency is the trade-off.
Step 3 — System Interface#
POST /completions Body: { prompt: string, // already-windowed context suffix: string, // text after the cursor (fill-in-the-middle) repo_context: { snippets: [...] }, language: string, user_id, repo_id, file_id, max_tokens: int (~30 typical), } → SSE stream of tokens
POST /chat (same shape as a general LLM chat, plus repo_context)
POST /index/sync (client uploads index deltas; server returns ack)GET /index/search (called by IDE during completion to fetch repo snippets)The suffix enables fill-in-the-middle (FIM) completions — the model sees both what’s before and what’s after the cursor, producing context-appropriate insertions.
Step 4 — High-Level Design#
IDE plugin (local) │ ├── local index: ctags, embeddings of code symbols, recent file history │ ▼ secure tunnel ── API gateway ── auth + privacy controls │ ▼ context assembler ── pulls repo snippets via │ (a) IDE-provided context │ (b) server-side repo index ▼ completion model (specialized code LLM, optimized for FIM) │ ▼ output filters (license / secrets detection) │ ▼ stream back
Side track: chat pipeline reuses retrieval; routes to bigger modelThe IDE plugin is more than a thin client. It builds local context (recently-edited files, symbol references, imports) and ships only the relevant pieces to the server. This is the dominant latency win.
Step 5 — Data Model#
Local index (IDE-side)#
struct LocalIndex: files: map<path, {size, mtime, hash, embeddings: [chunk_emb]}> symbols: map<name, [{file, line, kind}]> imports: map<file, [imported_module]>Built and maintained by the plugin. Updated incrementally as files change. Used to answer “what context should I send for this completion?” in single-digit ms.
Server-side repo index (optional, for chat / search)#
Built by an indexer service when a developer enables “repo-wide chat”. Stores:
table repo_chunks repo_id uuid chunk_id uuid path string start_line int end_line int content text embedding vector(768) symbol_signatures list<string> // ctags-styleVector index per repo, sharded by repo_id. Built incrementally on push events; full rebuild on language-server change.
Completion request#
prompt = serialize: [system instructions: "complete the code"] [relevant repo snippets: 5-10 chunks of ~50 lines each] [current file before cursor: up to 2000 tokens] <FIM marker> [current file after cursor: up to 500 tokens]The total prompt fits in 4K-8K tokens — small for an LLM, large enough to make completions repo-aware.
Step 6 — Detailed Design#
Latency budget (target 200 ms p50)#
Keystroke → IDE plugin debounce: 50 msLocal context selection (plugin): 10 msNetwork to provider (warm): 20-50 msServer-side context fetch (cached): 10 msPrefill (4K tokens): 30 ms -- depends heavily on hardwareFirst decode token: 20 msStream first chunk back: 10 ms total: ~150-180 ms p50 for the first token200 ms is achievable only with:
- Per-region inference fleets (no transcontinental hop).
- Prefix-cache reuse across consecutive completions (the user types
for i in rathenfor i in ranthenfor i in range— each can reuse the previous prefill). - A model specifically trained for code completion, much smaller than a general chat model.
Local-vs-server indexing trade-off#
GitHub Copilot leans heavily local for completion context; for Chat / Workspace features, it ships repo context server-side per session. Cursor (with explicit consent) indexes server-side for full-repo search.
Context selection for completion#
The plugin doesn’t send the whole repo. It picks:
- The current file’s content before and after the cursor.
- Recently edited files (the user was just looking at them).
- Files in the import graph of the current file.
- Files with symbols referenced near the cursor (e.g., if the user is calling
MyClass.foo(), sendMyClass’s definition). - A few snippets retrieved by embedding similarity to the cursor’s surrounding code.
This selection happens locally, in single-digit ms. Selecting the right context is the dominant quality driver; a model that sees the wrong context will hallucinate APIs that don’t exist in the project.
Fill-in-the-middle (FIM)#
A code-specialized model is trained to complete in the middle of a sequence given prefix and suffix:
<PRE> def add(a, b): <SUF> return result <MID> result = a + bThe training objective teaches the model to attend to both sides. Inference at completion time sends prefix + suffix; the model fills the gap. This is qualitatively much better than left-to-right completion when the user has incomplete code with a clear ending.
Streaming + cancellation#
The IDE plugin sends the request; if the user types another character before the response arrives, cancel and resubmit. The server must:
- Honor cancellation fast (interrupt decode, free the KV slot).
- Be cheap to start (no big setup cost per request).
A typical typing session produces 10 inflight-but-cancelled requests for every accepted suggestion. This shapes server-side capacity — concurrent active inflight is high, but most are short-lived.
Output filtering#
Two filter categories on every completion:
- License: detect verbatim matches to known licensed code corpora (Black Duck-style hashing); suppress the suggestion if matched.
- Secrets: regex over the output for API keys, AWS credentials, etc.; suppress if matched.
Both run in parallel with token streaming; on a hit, the stream is truncated and the user sees nothing.
Chat path#
For chat (multi-turn, longer answers, sometimes producing big code edits):
1. Retrieve repo context as for completion (broader budget — 8K tokens).2. Apply conversation history (sliding window).3. Call the larger model with streaming output.4. Parse output for code-edit blocks; offer to apply them in IDE.Code-edit blocks are structured (e.g., diff-style with file paths) so the IDE can apply them without ambiguity. The model is fine-tuned to emit this format.
Evaluation harness#
Code completion is hard to evaluate offline:
- Static eval: held-out repos; mask a line, measure exact-match and BLEU. Cheap but doesn’t correlate well with developer experience.
- Compile / lint check: emit completion, run language server / linter; track error rate.
- Online accept rate: A/B test new model versions; measure “did the user accept the suggestion?” as the primary metric.
Online accept rate is the gold standard. Offline metrics are CI gates to prevent obvious regressions.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: latency under load. A spike in concurrent inflight (everyone typing at once during US morning hours) can overload the inference fleet. Mitigations: autoscale on inflight tokens (not just requests), per-user rate limiting, and graceful degradation to a smaller / cheaper model when load is extreme. Better a worse completion than no completion.
Bottleneck #2: context quality. Almost all “the model gave a bad answer” complaints trace to the wrong context being sent. Investment in context-selection heuristics (and treating it as a tuned subsystem) pays more than larger models.
Bottleneck #3: server-side indexing cost. A 100 MB repo, embedded into 5K chunks at 1024-dim float32, is ~20 MB of vectors. 10 M devs × 5 repos = 50 M repos × 20 MB = 1 PB of vector storage. Tiered: hot repos in fast vector DB; cold repos with lighter / lazy indexing.
Alternative I’d push back on: indexing the whole repo server-side for every developer. Privacy concerns aside, the bandwidth and storage cost is unsustainable. The local-first model with selective server uploads is the right shape.
What breaks first at 10× scale: cancelled-request waste. At 1 M concurrent active developers, the rate of cancellations dwarfs accepted completions. Server compute spent on cancelled requests is real loss. Investment in faster cancellation (immediate KV release, lazy prefill) compounds.
Companies this resembles#
GitHub Copilot, Cursor, Sourcegraph Cody, JetBrains AI Assistant, Codeium, Tabnine, Replit Ghostwriter, Amazon CodeWhisperer / Q Developer.
Related systems#
- ChatGPT-style Conversational System — inference primitives for the chat path.
- AI / ML Data Infrastructure — vector store substrate for repo indexes.
- Distributed Search — sometimes used alongside vector for hybrid retrieval (BM25 for symbols, vector for semantics).
- LLM-Powered Customer Support Bot — sibling RAG architecture in a different domain.