Distributed Search
Inverted indexes, sharded indexing, replication, query fan-out, ranking pipelines.
Use cases#
A distributed search engine answers “find me documents matching this query” over very large corpora, ranked by relevance. Different from a database because:
- The query language is text (tokens, phrases, fuzzy match), not SQL predicates.
- Results are ranked, not just filtered.
- The index data structure is an inverted index, not a B-tree.
Canonical use cases:
- Product / catalog search — Shopify, Amazon, Etsy. The user types “blue suede running shoes size 11”; the engine returns ranked products.
- Log search — Splunk, Elasticsearch / OpenSearch indexing terabytes/day of structured log lines.
- Full-text on documents — Wikipedia, Confluence, Google Docs.
- Geo + filter search — Yelp, Airbnb (find nearby Italian restaurants open after 9pm, > 4 stars).
- Vector / semantic search — embedding-based retrieval for RAG and image similarity, increasingly merged into the same engines.
Functional requirements#
- Ingest documents with structured fields (text, keyword, numeric, geo) at high throughput.
- Build and maintain a searchable index near-real-time (NRT — sub-second from write to search).
- Execute queries with text matching, filtering, sorting, aggregation, and ranking.
- Return paginated, ranked results with highlighting.
- Optional: facets, suggestions, semantic / vector search, geo-distance, cross-cluster federation.
Non-functional requirements#
- Query latency: p99 under 100 ms for the typical 10-100M document corpus; sub-second on multi-billion-doc indexes.
- Indexing latency: NRT — under 1 s from ingest to searchable.
- Throughput: thousands of queries/sec per replica; tens to hundreds of thousands of indexes/sec per shard.
- Recall: “did we find everything we should have?” — fundamental to user satisfaction.
- Precision: “are the top 10 results all relevant?” — the user-perceived metric.
High-level design#
documents query │ │ ▼ ▼ ingest pipeline query coordinator (analyze, tokenize, enrich) (fan out to all shards) │ │ ▼ ┌─────────┼────────┐ index router ──hash──> ▼ ▼ ▼ (key, shard) shard 0 shard 1 shard 2 │ │ │ primary + primary + primary + N replicas replicas replicas │ │ │ └─────────┼────────┘ ▼ merge + re-rank │ top-K → userThe two flows — index and query — touch every shard. Indexing routes by document ID (or routing key). Queries fan out to all shards because any shard might hold matching documents; results are merged at the coordinator.
Detailed design#
Inverted index#
For each token (word stem, n-gram), store the list of document IDs containing it:
"cat" → [doc_3, doc_17, doc_42, doc_109, ...] ← postings list"dog" → [doc_2, doc_42, doc_88, ...]"running" → [doc_5, doc_19, doc_42, ...]Query "cat AND dog" intersects the two postings lists. Postings are stored sorted and compressed with delta encoding + variable-byte (or FOR/PFOR) coding, giving ~1-3 bytes per docID for typical distributions.
Each posting can also carry positions (for phrase queries) and term frequencies (for ranking).
Analyzers#
Raw text → tokens via a chain:
"Running shoes!" → tokenizer (whitespace, ICU) → ["Running", "shoes!"] → lowercase filter → ["running", "shoes!"] → punctuation strip → ["running", "shoes"] → stop-word filter (the, a, of, ...) → ["running", "shoes"] → stemmer (Porter, Snowball) → ["run", "shoe"] → synonym filter (sneaker→shoe) → ["run", "shoe"]The same chain is applied to queries — so “shoes” at query time also stems to “shoe” and matches.
Per-language pipelines are critical. English snowball stemming is wrong for Japanese (use kuromoji); Chinese needs n-gram or dictionary segmentation; etc.
Sharding and replication#
Documents are routed by hash(routing_key) % num_shards (default routing_key = doc_id).Each shard has 1 primary + N replicas. Primary writes; replicas serve reads.
A search query fans out to ALL shards (1 replica per shard, picked by load).Each shard returns its top K; coordinator merges into global top K.Cost: query latency = slowest shard. Tail latency at fan-out grows with shard count — the classic “tail at scale” problem (Dean & Barroso, 2013).
Near-real-time indexing#
Writes go to an in-memory buffer. On refresh (default 1 s), the buffer is flushed to a small immutable segment on disk and becomes searchable. Periodically, small segments are merged into larger ones — the same LSM pattern as Cassandra and RocksDB.
This is why NRT search adds ~1 s of latency from ingest to visibility. The refresh interval is tunable — 30 s drops indexing pressure significantly on write-heavy workloads.
Ranking#
The classic relevance score is BM25: a refinement of TF-IDF that handles document length. For each term:
score = IDF(term) * (tf * (k+1)) / (tf + k * (1 - b + b * |doc|/avg_doc_len))Multi-term queries sum per-term scores. The result is a relevance-ordered list.
Production systems run a multi-stage pipeline:
Stage 1 (retrieval) — BM25 / lexical match returns top 1000 from the index. Fast.Stage 2 (re-ranking) — a more expensive model (gradient-boosted trees on engagement features, learning-to-rank, neural cross-encoders) re-ranks the top 1000 down to top 10. Slow, but operating on 1000 docs not 10M.Stage 3 (business rules) — promote sponsored results, boost recent items, demote known- bad URLs. Often hard-coded.This is how Google, Amazon, Algolia all balance recall (find everything) with precision (top 10 are great).
Vector / semantic search#
Modern engines (Elasticsearch 8+, OpenSearch, Vespa, dedicated stores like Pinecone, Weaviate, Qdrant) index embeddings as well as text. Documents become vectors via an embedding model; queries are embedded and the nearest vectors are returned via HNSW (Hierarchical Navigable Small World) or IVF-PQ indexes.
Often combined with BM25 via hybrid search: take top 100 from each, re-rank by a weighted blend. Critical for RAG pipelines: BM25 finds exact-term matches (product SKUs, error codes), vectors find semantic matches.
Faceting and aggregation#
Faceted search returns counts alongside results: “248 results, 80 in size:11, 120 in color:blue”. Implemented with column-store-like data structures (Lucene’s DocValues). Aggregations (avg, percentile, terms, histogram) leverage the same column store and power Kibana / OpenSearch Dashboards.
Trade-offs#
Other axes:
- Lexical vs vector vs hybrid — lexical excels at exact matches and rare terms; vector excels at semantic similarity; hybrid covers both. For RAG, hybrid is the modern default.
- Refresh interval — short refresh (~1 s) for NRT search; long (~30 s) for log indexing throughput.
- Cross-shard queries — query latency is bounded by the slowest shard. Use replicas + load-aware routing; consider speculative requests for low-tail-latency needs.
Real-world examples#
- Elasticsearch — the dominant generic search engine. Powers GitHub search, Uber’s logs, Wikipedia’s search, Cloudflare’s analytics.
- OpenSearch — AWS’s fork of Elasticsearch (post-license-change). Drop-in compatible.
- Algolia — fully-managed, sub-50 ms global p99 for product / docs search. Stack Overflow, Hacker News, Twitch.
- Vespa (Yahoo, now open source) — engine behind Yahoo Mail, Spotify recommendations, real-time + structured + vector hybrid.
- Solr — Lucene-based, predates Elasticsearch; still in heavy use at Bloomberg, Bestbuy.
- Pinecone, Weaviate, Qdrant — vector-first stores for embeddings; integrate with LLM pipelines.
- Meilisearch, Typesense — typo-tolerant, fast indexing, designed for instant-search UI on mid-sized corpora.
Related building blocks#
- Databases — primary store; the search index is a derived view fed by CDC or app-level dual-writes.
- Distributed Cache — caches popular queries to avoid hitting the search cluster.
- Distributed Logging — log search is the dominant Elasticsearch workload.
- Pub-Sub — the transport for change events that update the search index.