Typeahead Suggestion — System Design

Step 1 — Clarify Requirements#

Functional

As a user types into a search box, suggest completions after each keystroke.
Suggestions ranked by popularity, personalization, and recency.
Update suggestions when new search queries appear in the wild (a new movie release should be suggested within hours).
Out of scope here: the underlying search index that completed queries hit; spelling correction (assume queries are typed correctly for now).

Non-functional

99.99% availability — typeahead being broken degrades the entire search experience.
p95 suggestion latency under 100 ms (the budget is tight because every keystroke triggers a request).
100 K queries/sec peak; each keystroke is a request, so multiply by ~5-10 for actual server load.
Eventual consistency on the popularity index; new queries can be missing from suggestions for a few minutes.

Step 2 — Capacity Estimation#

Distinct query corpus: ~10 B distinct queries seen historically, but the head is heavy — top 1 M queries cover ~30% of all searches; top 10 M cover ~60%.
Index size: top 10 M queries × ~50 B (query text + score + metadata) = 500 MB core trie. Fits in memory.
Keystroke QPS: if 100 K searches/sec finalize, each search averages ~12 characters and a request per character (with debouncing, ~5) → 500 K-1 M typeahead QPS.
Latency budget: 100 ms p95 = a hard cap. Network RTT alone is 30-50 ms internationally; the server has ~30 ms to think.

The system is read-mostly with a high-throughput offline indexing pipeline. The hot path is trie traversal in memory.

Step 3 — System Interface#

GET /suggest?q=<prefix>&limit=10&context=<optional context>
    → { suggestions: [
          { text, score, kind: 'query'|'entity'|'completion' },
          ...
        ] }

POST /log_search
    { user_id, query, result_clicked_id, ts }    -- training signal

The endpoint is idempotent and cacheable; CDN can cache for a few seconds per prefix.

Step 4 — High-Level Design#

client (debounce 50ms) → CDN/edge cache ──→ LB ──→ suggest service ──→ trie (in-memory, replicated)
                                                       │                        │
                                                       │                        └ personalized rerank
                                                       │
                                                       └→ /log_search → Kafka → offline pipeline
                                                                                 │
                                                                                 ▼
                                                                          aggregate query → score
                                                                                 │
                                                                                 ▼
                                                                          build new trie image
                                                                                 │
                                                                                 ▼
                                                                          deploy to suggest service fleet

The interesting moving parts:

Trie (or DAWG / FST): the data structure that turns a prefix into a list of completions in O(prefix_length).
Offline indexer: aggregates query logs, ranks queries, builds a fresh trie image hourly.
Personalization layer: optional reranking on top of the global suggestions.

Step 5 — Data Model#

Trie node#

struct TrieNode:
  children: map<char, TrieNode>
  // The K most-popular completions of the prefix ending at this node
  top_k: list<(score, query_id)>     // size <= 10

Storing top_k at every node is the key trick — it makes a prefix-lookup O(prefix_length) without traversing the subtree:

suggest("piz"):
   walk root → 'p' → 'i' → 'z'
   return node.top_k    // already sorted, already top-10

Build-time pipeline#

Daily / hourly:
1. Aggregate Kafka logs → counts per query (or query x context bucket).
2. Apply a decayed-popularity score: score = 0.95^days * count.
3. Insert each query into the trie, updating top_k along the path.
4. Serialize to a compact disk format (FST / minimized DAWG).
5. Roll out: new trie image replicated to all suggest service instances.

Online store#

The trie is held in process memory on every suggest service instance. There’s no remote read on the hot path — the entire trie is local. New images are loaded by atomic pointer swap; old image is dereferenced and freed.

Step 6 — Detailed Design#

Why trie over alternatives#

Trie / DAWG — pre-stored top-K at each node; lookup is one path walk. Memory cost: ~10× the raw query text in a naive trie; DAWG (deduplicating common suffixes) brings it down to 2-3×.

Sorted KV store keyed by prefix — store prefix → top_k directly. Conceptually simpler, but storage is O(N × avg_query_len) which blows up; only feasible for short prefixes (say, up to 4-5 chars) with fallback elsewhere.

Real systems often combine: a trie / FST for the bulk, and a precomputed prefix → top_k cache for the hottest 100 K prefixes (which absorb most of the QPS).

Personalization#

After the trie returns the global top-K, an online layer reranks:

final_score(suggestion) = w1 * global_score
                        + w2 * personal_history_score(user, suggestion)
                        + w3 * context_score(time, location, recent_session)

personal_history_score reads from a per-user feature store (last 100 queries, click history). Cached aggressively because a typing user issues many requests in succession.

New queries (just-launched movies, breaking news terms) must surface within hours of becoming popular. Two paths:

Hourly trie rebuild picks them up automatically.
Real-time hot-list overlay: a streaming top-K of “queries gaining velocity in the last 30 minutes” maintained over Kafka, merged into the trie’s top_k results at read time.

The overlay handles minute-scale freshness; the hourly rebuild handles hour-scale.

Latency budget (target p95 100 ms)#

Client debounce:                         ~50 ms (not counted in server budget)
CDN/edge cache hit:                       2 ms
LB + TLS (warm conn):                     5 ms
Suggest service trie walk + top_k:        1 ms
Personalized rerank (feature lookup):    15 ms
Serialize + network back:                30 ms
                              total:    ~50-60 ms typical, < 100 ms p95

Edge caching for very common prefixes (“a”, “b”, “ne”, “fa”) is where the latency win comes from — those prefixes have a fixed top_k that doesn’t depend on the user.

Internationalization#

Tries become awkward for languages with rich morphology (Finnish, Turkish) and infeasible for ideographic scripts (Chinese, Japanese without pinyin). Variants:

Pinyin trie for Chinese (user types pinyin, suggest Chinese characters).
Substring index (FM-index, suffix array) for languages where word stems matter more than prefixes.
Per-language tries with locale-based routing.

Throttling and abuse#

Without per-IP rate limits, an attacker can issue requests faster than humans type, costing CPU. The suggest service rate-limits by IP and by user (token bucket — see /system-design/rate-limiter). Clients debounce keystrokes by ~50 ms so a typist’s normal pace doesn’t trigger the limit.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: trie memory. 500 MB is fine on one host; at 100 M unique queries with longer texts, the trie can grow to many GB. Sharded by prefix (a-d, e-h, …) — each shard fits comfortably; query routing decides which shard to hit based on the first character.

Bottleneck #2: rebuild latency. A full rebuild over billions of log events takes ~30 minutes on a large Spark cluster. To get faster freshness, incremental updates (top_k merge from a streaming pipeline) handle the head while the full rebuild keeps the long tail current.

Bottleneck #3: personalization feature reads. Per-keystroke feature lookup compounds quickly. Cache features per-user for 5 minutes — typing a query is fast enough that nothing meaningful changes mid-query.

Alternative I’d push back on: doing typeahead as a real-time query against a full search index (Elasticsearch match_phrase_prefix). Works at low scale, falls over at high QPS — every keystroke is a real search query, costing 10-100× more than a trie lookup. Trie-with-top-K is a 1000× efficiency win for the typeahead-specific case.

What breaks first at 10× scale (10 M QPS): the CDN cache hit rate becomes critical. Even a 90% hit rate leaves 1 M QPS for the suggest service fleet — manageable. Below 90% hit rate, the fleet needs to scale fast.

Companies this resembles#

Google Search autocomplete, Amazon product search, LinkedIn member search, Bing, GitHub repo search. Pattern variants in IDE autocomplete (LSP) and code editors.

Distributed Search — the substrate for actual query execution after suggestion is selected.
Distributed Cache — for edge caching of popular prefixes.
Rate Limiter — per-user / per-IP throttling.
Quora — uses typeahead heavily for question lookup.