Distributed Search

Inverted indexes, sharded indexing, replication, query fan-out, ranking pipelines.

Building Block Advanced
7 min read
search inverted-index ranking
Companies this resembles: Elasticsearch · OpenSearch · Solr · Algolia · Vespa

Use cases#

A distributed search engine answers “find me documents matching this query” over very large corpora, ranked by relevance. Different from a database because:

  • The query language is text (tokens, phrases, fuzzy match), not SQL predicates.
  • Results are ranked, not just filtered.
  • The index data structure is an inverted index, not a B-tree.

Canonical use cases:

  • Product / catalog search — Shopify, Amazon, Etsy. The user types “blue suede running shoes size 11”; the engine returns ranked products.
  • Log search — Splunk, Elasticsearch / OpenSearch indexing terabytes/day of structured log lines.
  • Full-text on documents — Wikipedia, Confluence, Google Docs.
  • Geo + filter search — Yelp, Airbnb (find nearby Italian restaurants open after 9pm, > 4 stars).
  • Vector / semantic search — embedding-based retrieval for RAG and image similarity, increasingly merged into the same engines.

Functional requirements#

  • Ingest documents with structured fields (text, keyword, numeric, geo) at high throughput.
  • Build and maintain a searchable index near-real-time (NRT — sub-second from write to search).
  • Execute queries with text matching, filtering, sorting, aggregation, and ranking.
  • Return paginated, ranked results with highlighting.
  • Optional: facets, suggestions, semantic / vector search, geo-distance, cross-cluster federation.

Non-functional requirements#

  • Query latency: p99 under 100 ms for the typical 10-100M document corpus; sub-second on multi-billion-doc indexes.
  • Indexing latency: NRT — under 1 s from ingest to searchable.
  • Throughput: thousands of queries/sec per replica; tens to hundreds of thousands of indexes/sec per shard.
  • Recall: “did we find everything we should have?” — fundamental to user satisfaction.
  • Precision: “are the top 10 results all relevant?” — the user-perceived metric.

High-level design#

documents query
│ │
▼ ▼
ingest pipeline query coordinator
(analyze, tokenize, enrich) (fan out to all shards)
│ │
▼ ┌─────────┼────────┐
index router ──hash──> ▼ ▼ ▼
(key, shard) shard 0 shard 1 shard 2
│ │ │
primary + primary + primary +
N replicas replicas replicas
│ │ │
└─────────┼────────┘
merge + re-rank
top-K → user

The two flows — index and query — touch every shard. Indexing routes by document ID (or routing key). Queries fan out to all shards because any shard might hold matching documents; results are merged at the coordinator.

Detailed design#

Inverted index#

For each token (word stem, n-gram), store the list of document IDs containing it:

"cat" → [doc_3, doc_17, doc_42, doc_109, ...] ← postings list
"dog" → [doc_2, doc_42, doc_88, ...]
"running" → [doc_5, doc_19, doc_42, ...]

Query "cat AND dog" intersects the two postings lists. Postings are stored sorted and compressed with delta encoding + variable-byte (or FOR/PFOR) coding, giving ~1-3 bytes per docID for typical distributions.

Each posting can also carry positions (for phrase queries) and term frequencies (for ranking).

Analyzers#

Raw text → tokens via a chain:

"Running shoes!"
→ tokenizer (whitespace, ICU) → ["Running", "shoes!"]
→ lowercase filter → ["running", "shoes!"]
→ punctuation strip → ["running", "shoes"]
→ stop-word filter (the, a, of, ...) → ["running", "shoes"]
→ stemmer (Porter, Snowball) → ["run", "shoe"]
→ synonym filter (sneaker→shoe) → ["run", "shoe"]

The same chain is applied to queries — so “shoes” at query time also stems to “shoe” and matches.

Per-language pipelines are critical. English snowball stemming is wrong for Japanese (use kuromoji); Chinese needs n-gram or dictionary segmentation; etc.

Sharding and replication#

Documents are routed by hash(routing_key) % num_shards (default routing_key = doc_id).
Each shard has 1 primary + N replicas. Primary writes; replicas serve reads.
A search query fans out to ALL shards (1 replica per shard, picked by load).
Each shard returns its top K; coordinator merges into global top K.

Cost: query latency = slowest shard. Tail latency at fan-out grows with shard count — the classic “tail at scale” problem (Dean & Barroso, 2013).

Near-real-time indexing#

Writes go to an in-memory buffer. On refresh (default 1 s), the buffer is flushed to a small immutable segment on disk and becomes searchable. Periodically, small segments are merged into larger ones — the same LSM pattern as Cassandra and RocksDB.

This is why NRT search adds ~1 s of latency from ingest to visibility. The refresh interval is tunable — 30 s drops indexing pressure significantly on write-heavy workloads.

Ranking#

The classic relevance score is BM25: a refinement of TF-IDF that handles document length. For each term:

score = IDF(term) * (tf * (k+1)) / (tf + k * (1 - b + b * |doc|/avg_doc_len))

Multi-term queries sum per-term scores. The result is a relevance-ordered list.

Production systems run a multi-stage pipeline:

Stage 1 (retrieval) — BM25 / lexical match returns top 1000 from the index. Fast.
Stage 2 (re-ranking) — a more expensive model (gradient-boosted trees on engagement
features, learning-to-rank, neural cross-encoders) re-ranks
the top 1000 down to top 10. Slow, but operating on 1000 docs
not 10M.
Stage 3 (business rules) — promote sponsored results, boost recent items, demote known-
bad URLs. Often hard-coded.

This is how Google, Amazon, Algolia all balance recall (find everything) with precision (top 10 are great).

Modern engines (Elasticsearch 8+, OpenSearch, Vespa, dedicated stores like Pinecone, Weaviate, Qdrant) index embeddings as well as text. Documents become vectors via an embedding model; queries are embedded and the nearest vectors are returned via HNSW (Hierarchical Navigable Small World) or IVF-PQ indexes.

Often combined with BM25 via hybrid search: take top 100 from each, re-rank by a weighted blend. Critical for RAG pipelines: BM25 finds exact-term matches (product SKUs, error codes), vectors find semantic matches.

Faceting and aggregation#

Faceted search returns counts alongside results: “248 results, 80 in size:11, 120 in color:blue”. Implemented with column-store-like data structures (Lucene’s DocValues). Aggregations (avg, percentile, terms, histogram) leverage the same column store and power Kibana / OpenSearch Dashboards.

Trade-offs#

Generic engine (Elasticsearch, OpenSearch, Vespa) — flexible, runs many workloads in one cluster (logs, products, geo). Operational complexity grows with cluster size; mixed workloads compete for resources.
Specialized hosted (Algolia, Typesense) — tuned for one job (catalog search), ms-level latency, less knobs but higher quality defaults. Costs more per QPS; less flexibility for analytics or vector hybrid.

Other axes:

  • Lexical vs vector vs hybrid — lexical excels at exact matches and rare terms; vector excels at semantic similarity; hybrid covers both. For RAG, hybrid is the modern default.
  • Refresh interval — short refresh (~1 s) for NRT search; long (~30 s) for log indexing throughput.
  • Cross-shard queries — query latency is bounded by the slowest shard. Use replicas + load-aware routing; consider speculative requests for low-tail-latency needs.

Real-world examples#

  • Elasticsearch — the dominant generic search engine. Powers GitHub search, Uber’s logs, Wikipedia’s search, Cloudflare’s analytics.
  • OpenSearch — AWS’s fork of Elasticsearch (post-license-change). Drop-in compatible.
  • Algolia — fully-managed, sub-50 ms global p99 for product / docs search. Stack Overflow, Hacker News, Twitch.
  • Vespa (Yahoo, now open source) — engine behind Yahoo Mail, Spotify recommendations, real-time + structured + vector hybrid.
  • Solr — Lucene-based, predates Elasticsearch; still in heavy use at Bloomberg, Bestbuy.
  • Pinecone, Weaviate, Qdrant — vector-first stores for embeddings; integrate with LLM pipelines.
  • Meilisearch, Typesense — typo-tolerant, fast indexing, designed for instant-search UI on mid-sized corpora.
  • Databases — primary store; the search index is a derived view fed by CDC or app-level dual-writes.
  • Distributed Cache — caches popular queries to avoid hitting the search cluster.
  • Distributed Logging — log search is the dominant Elasticsearch workload.
  • Pub-Sub — the transport for change events that update the search index.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.