Typeahead Suggestion
Real-time prefix matching: trie indexes, ranking, server-side throttling, personalization.
Step 1 — Clarify Requirements#
Functional
- As a user types into a search box, suggest completions after each keystroke.
- Suggestions ranked by popularity, personalization, and recency.
- Update suggestions when new search queries appear in the wild (a new movie release should be suggested within hours).
- Out of scope here: the underlying search index that completed queries hit; spelling correction (assume queries are typed correctly for now).
Non-functional
- 99.99% availability — typeahead being broken degrades the entire search experience.
- p95 suggestion latency under 100 ms (the budget is tight because every keystroke triggers a request).
- 100 K queries/sec peak; each keystroke is a request, so multiply by ~5-10 for actual server load.
- Eventual consistency on the popularity index; new queries can be missing from suggestions for a few minutes.
Step 2 — Capacity Estimation#
- Distinct query corpus: ~10 B distinct queries seen historically, but the head is heavy — top 1 M queries cover ~30% of all searches; top 10 M cover ~60%.
- Index size: top 10 M queries × ~50 B (query text + score + metadata) = 500 MB core trie. Fits in memory.
- Keystroke QPS: if 100 K searches/sec finalize, each search averages ~12 characters and a request per character (with debouncing, ~5) → 500 K-1 M typeahead QPS.
- Latency budget: 100 ms p95 = a hard cap. Network RTT alone is 30-50 ms internationally; the server has ~30 ms to think.
The system is read-mostly with a high-throughput offline indexing pipeline. The hot path is trie traversal in memory.
Step 3 — System Interface#
GET /suggest?q=<prefix>&limit=10&context=<optional context> → { suggestions: [ { text, score, kind: 'query'|'entity'|'completion' }, ... ] }
POST /log_search { user_id, query, result_clicked_id, ts } -- training signalThe endpoint is idempotent and cacheable; CDN can cache for a few seconds per prefix.
Step 4 — High-Level Design#
client (debounce 50ms) → CDN/edge cache ──→ LB ──→ suggest service ──→ trie (in-memory, replicated) │ │ │ └ personalized rerank │ └→ /log_search → Kafka → offline pipeline │ ▼ aggregate query → score │ ▼ build new trie image │ ▼ deploy to suggest service fleetThe interesting moving parts:
- Trie (or DAWG / FST): the data structure that turns a prefix into a list of completions in O(prefix_length).
- Offline indexer: aggregates query logs, ranks queries, builds a fresh trie image hourly.
- Personalization layer: optional reranking on top of the global suggestions.
Step 5 — Data Model#
Trie node#
struct TrieNode: children: map<char, TrieNode> // The K most-popular completions of the prefix ending at this node top_k: list<(score, query_id)> // size <= 10Storing top_k at every node is the key trick — it makes a prefix-lookup O(prefix_length) without traversing the subtree:
suggest("piz"): walk root → 'p' → 'i' → 'z' return node.top_k // already sorted, already top-10Build-time pipeline#
Daily / hourly:1. Aggregate Kafka logs → counts per query (or query x context bucket).2. Apply a decayed-popularity score: score = 0.95^days * count.3. Insert each query into the trie, updating top_k along the path.4. Serialize to a compact disk format (FST / minimized DAWG).5. Roll out: new trie image replicated to all suggest service instances.Online store#
The trie is held in process memory on every suggest service instance. There’s no remote read on the hot path — the entire trie is local. New images are loaded by atomic pointer swap; old image is dereferenced and freed.
Step 6 — Detailed Design#
Why trie over alternatives#
prefix → top_k directly. Conceptually simpler, but storage is O(N × avg_query_len) which blows up; only feasible for short prefixes (say, up to 4-5 chars) with fallback elsewhere. Real systems often combine: a trie / FST for the bulk, and a precomputed prefix → top_k cache for the hottest 100 K prefixes (which absorb most of the QPS).
Personalization#
After the trie returns the global top-K, an online layer reranks:
final_score(suggestion) = w1 * global_score + w2 * personal_history_score(user, suggestion) + w3 * context_score(time, location, recent_session)personal_history_score reads from a per-user feature store (last 100 queries, click history). Cached aggressively because a typing user issues many requests in succession.
Trending and freshness#
New queries (just-launched movies, breaking news terms) must surface within hours of becoming popular. Two paths:
- Hourly trie rebuild picks them up automatically.
- Real-time hot-list overlay: a streaming top-K of “queries gaining velocity in the last 30 minutes” maintained over Kafka, merged into the trie’s top_k results at read time.
The overlay handles minute-scale freshness; the hourly rebuild handles hour-scale.
Latency budget (target p95 100 ms)#
Client debounce: ~50 ms (not counted in server budget)CDN/edge cache hit: 2 msLB + TLS (warm conn): 5 msSuggest service trie walk + top_k: 1 msPersonalized rerank (feature lookup): 15 msSerialize + network back: 30 ms total: ~50-60 ms typical, < 100 ms p95Edge caching for very common prefixes (“a”, “b”, “ne”, “fa”) is where the latency win comes from — those prefixes have a fixed top_k that doesn’t depend on the user.
Internationalization#
Tries become awkward for languages with rich morphology (Finnish, Turkish) and infeasible for ideographic scripts (Chinese, Japanese without pinyin). Variants:
- Pinyin trie for Chinese (user types pinyin, suggest Chinese characters).
- Substring index (FM-index, suffix array) for languages where word stems matter more than prefixes.
- Per-language tries with locale-based routing.
Throttling and abuse#
Without per-IP rate limits, an attacker can issue requests faster than humans type, costing CPU. The suggest service rate-limits by IP and by user (token bucket — see /system-design/rate-limiter). Clients debounce keystrokes by ~50 ms so a typist’s normal pace doesn’t trigger the limit.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: trie memory. 500 MB is fine on one host; at 100 M unique queries with longer texts, the trie can grow to many GB. Sharded by prefix (a-d, e-h, …) — each shard fits comfortably; query routing decides which shard to hit based on the first character.
Bottleneck #2: rebuild latency. A full rebuild over billions of log events takes ~30 minutes on a large Spark cluster. To get faster freshness, incremental updates (top_k merge from a streaming pipeline) handle the head while the full rebuild keeps the long tail current.
Bottleneck #3: personalization feature reads. Per-keystroke feature lookup compounds quickly. Cache features per-user for 5 minutes — typing a query is fast enough that nothing meaningful changes mid-query.
Alternative I’d push back on: doing typeahead as a real-time query against a full search index (Elasticsearch match_phrase_prefix). Works at low scale, falls over at high QPS — every keystroke is a real search query, costing 10-100× more than a trie lookup. Trie-with-top-K is a 1000× efficiency win for the typeahead-specific case.
What breaks first at 10× scale (10 M QPS): the CDN cache hit rate becomes critical. Even a 90% hit rate leaves 1 M QPS for the suggest service fleet — manageable. Below 90% hit rate, the fleet needs to scale fast.
Companies this resembles#
Google Search autocomplete, Amazon product search, LinkedIn member search, Bing, GitHub repo search. Pattern variants in IDE autocomplete (LSP) and code editors.
Related systems#
- Distributed Search — the substrate for actual query execution after suggestion is selected.
- Distributed Cache — for edge caching of popular prefixes.
- Rate Limiter — per-user / per-IP throttling.
- Quora — uses typeahead heavily for question lookup.