Distributed Search — System Design

Use cases#

A distributed search engine answers “find me documents matching this query” over very large corpora, ranked by relevance. Different from a database because:

The query language is text (tokens, phrases, fuzzy match), not SQL predicates.
Results are ranked, not just filtered.
The index data structure is an inverted index, not a B-tree.

Canonical use cases:

Product / catalog search — Shopify, Amazon, Etsy. The user types “blue suede running shoes size 11”; the engine returns ranked products.
Log search — Splunk, Elasticsearch / OpenSearch indexing terabytes/day of structured log lines.
Full-text on documents — Wikipedia, Confluence, Google Docs.
Geo + filter search — Yelp, Airbnb (find nearby Italian restaurants open after 9pm, > 4 stars).
Vector / semantic search — embedding-based retrieval for RAG and image similarity, increasingly merged into the same engines.

Functional requirements#

Ingest documents with structured fields (text, keyword, numeric, geo) at high throughput.
Build and maintain a searchable index near-real-time (NRT — sub-second from write to search).
Execute queries with text matching, filtering, sorting, aggregation, and ranking.
Return paginated, ranked results with highlighting.
Optional: facets, suggestions, semantic / vector search, geo-distance, cross-cluster federation.

Non-functional requirements#

Query latency: p99 under 100 ms for the typical 10-100M document corpus; sub-second on multi-billion-doc indexes.
Indexing latency: NRT — under 1 s from ingest to searchable.
Throughput: thousands of queries/sec per replica; tens to hundreds of thousands of indexes/sec per shard.
Recall: “did we find everything we should have?” — fundamental to user satisfaction.
Precision: “are the top 10 results all relevant?” — the user-perceived metric.

High-level design#

        documents                              query
            │                                    │
            ▼                                    ▼
   ingest pipeline                       query coordinator
   (analyze, tokenize, enrich)           (fan out to all shards)
            │                                    │
            ▼                          ┌─────────┼────────┐
       index router  ──hash──>         ▼         ▼        ▼
       (key, shard)                 shard 0   shard 1  shard 2
                                       │         │        │
                                  primary +   primary +  primary +
                                  N replicas  replicas   replicas
                                       │         │        │
                                       └─────────┼────────┘
                                                 ▼
                                            merge + re-rank
                                                 │
                                            top-K → user

The two flows — index and query — touch every shard. Indexing routes by document ID (or routing key). Queries fan out to all shards because any shard might hold matching documents; results are merged at the coordinator.

Detailed design#

Inverted index#

For each token (word stem, n-gram), store the list of document IDs containing it:

"cat"     → [doc_3, doc_17, doc_42, doc_109, ...]      ← postings list
"dog"     → [doc_2,  doc_42, doc_88,  ...]
"running" → [doc_5,  doc_19, doc_42,  ...]

Query "cat AND dog" intersects the two postings lists. Postings are stored sorted and compressed with delta encoding + variable-byte (or FOR/PFOR) coding, giving ~1-3 bytes per docID for typical distributions.

Each posting can also carry positions (for phrase queries) and term frequencies (for ranking).

Analyzers#

Raw text → tokens via a chain:

"Running shoes!"
  → tokenizer (whitespace, ICU)         → ["Running", "shoes!"]
  → lowercase filter                    → ["running", "shoes!"]
  → punctuation strip                   → ["running", "shoes"]
  → stop-word filter (the, a, of, ...)  → ["running", "shoes"]
  → stemmer (Porter, Snowball)          → ["run", "shoe"]
  → synonym filter (sneaker→shoe)       → ["run", "shoe"]

The same chain is applied to queries — so “shoes” at query time also stems to “shoe” and matches.

Per-language pipelines are critical. English snowball stemming is wrong for Japanese (use kuromoji); Chinese needs n-gram or dictionary segmentation; etc.

Sharding and replication#

Documents are routed by hash(routing_key) % num_shards (default routing_key = doc_id).
Each shard has 1 primary + N replicas. Primary writes; replicas serve reads.

A search query fans out to ALL shards (1 replica per shard, picked by load).
Each shard returns its top K; coordinator merges into global top K.

Cost: query latency = slowest shard. Tail latency at fan-out grows with shard count — the classic “tail at scale” problem (Dean & Barroso, 2013).

Near-real-time indexing#

Writes go to an in-memory buffer. On refresh (default 1 s), the buffer is flushed to a small immutable segment on disk and becomes searchable. Periodically, small segments are merged into larger ones — the same LSM pattern as Cassandra and RocksDB.

This is why NRT search adds ~1 s of latency from ingest to visibility. The refresh interval is tunable — 30 s drops indexing pressure significantly on write-heavy workloads.

Ranking#

The classic relevance score is BM25: a refinement of TF-IDF that handles document length. For each term:

score = IDF(term) * (tf * (k+1)) / (tf + k * (1 - b + b * |doc|/avg_doc_len))

Multi-term queries sum per-term scores. The result is a relevance-ordered list.

Production systems run a multi-stage pipeline:

Stage 1 (retrieval)   — BM25 / lexical match returns top 1000 from the index. Fast.
Stage 2 (re-ranking)  — a more expensive model (gradient-boosted trees on engagement
                        features, learning-to-rank, neural cross-encoders) re-ranks
                        the top 1000 down to top 10. Slow, but operating on 1000 docs
                        not 10M.
Stage 3 (business rules) — promote sponsored results, boost recent items, demote known-
                           bad URLs. Often hard-coded.

This is how Google, Amazon, Algolia all balance recall (find everything) with precision (top 10 are great).

Vector / semantic search#

Modern engines (Elasticsearch 8+, OpenSearch, Vespa, dedicated stores like Pinecone, Weaviate, Qdrant) index embeddings as well as text. Documents become vectors via an embedding model; queries are embedded and the nearest vectors are returned via HNSW (Hierarchical Navigable Small World) or IVF-PQ indexes.

Often combined with BM25 via hybrid search: take top 100 from each, re-rank by a weighted blend. Critical for RAG pipelines: BM25 finds exact-term matches (product SKUs, error codes), vectors find semantic matches.

Faceting and aggregation#

Faceted search returns counts alongside results: “248 results, 80 in size:11, 120 in color:blue”. Implemented with column-store-like data structures (Lucene’s DocValues). Aggregations (avg, percentile, terms, histogram) leverage the same column store and power Kibana / OpenSearch Dashboards.

Trade-offs#

Generic engine (Elasticsearch, OpenSearch, Vespa) — flexible, runs many workloads in one cluster (logs, products, geo). Operational complexity grows with cluster size; mixed workloads compete for resources.

Specialized hosted (Algolia, Typesense) — tuned for one job (catalog search), ms-level latency, less knobs but higher quality defaults. Costs more per QPS; less flexibility for analytics or vector hybrid.

Other axes:

Lexical vs vector vs hybrid — lexical excels at exact matches and rare terms; vector excels at semantic similarity; hybrid covers both. For RAG, hybrid is the modern default.
Refresh interval — short refresh (~1 s) for NRT search; long (~30 s) for log indexing throughput.
Cross-shard queries — query latency is bounded by the slowest shard. Use replicas + load-aware routing; consider speculative requests for low-tail-latency needs.

Real-world examples#

Elasticsearch — the dominant generic search engine. Powers GitHub search, Uber’s logs, Wikipedia’s search, Cloudflare’s analytics.
OpenSearch — AWS’s fork of Elasticsearch (post-license-change). Drop-in compatible.
Algolia — fully-managed, sub-50 ms global p99 for product / docs search. Stack Overflow, Hacker News, Twitch.
Vespa (Yahoo, now open source) — engine behind Yahoo Mail, Spotify recommendations, real-time + structured + vector hybrid.
Solr — Lucene-based, predates Elasticsearch; still in heavy use at Bloomberg, Bestbuy.
Pinecone, Weaviate, Qdrant — vector-first stores for embeddings; integrate with LLM pipelines.
Meilisearch, Typesense — typo-tolerant, fast indexing, designed for instant-search UI on mid-sized corpora.

Databases — primary store; the search index is a derived view fed by CDC or app-level dual-writes.
Distributed Cache — caches popular queries to avoid hitting the search cluster.
Distributed Logging — log search is the dominant Elasticsearch workload.
Pub-Sub — the transport for change events that update the search index.

Use cases#

Functional requirements#

Non-functional requirements#

High-level design#

Detailed design#

Inverted index#

Analyzers#

Sharding and replication#

Near-real-time indexing#

Ranking#

Vector / semantic search#

Faceting and aggregation#

Trade-offs#

Real-world examples#

Related building blocks#