Proximity Service (Yelp)

Geo-indexing, dynamic segments, search-within-radius queries at city scale.

System Intermediate
7 min read
geo search indexing ranking
Companies this resembles: Yelp · Google Maps Places · Foursquare · TripAdvisor

Step 1 — Clarify Requirements#

Functional

  • Given a location and radius (or a viewport), return nearby businesses matching a query (coffee, dentist, vegan ramen).
  • Filter by category, rating, open-now, price tier.
  • Sort by distance, rating, or relevance.
  • Business profile pages (name, hours, photos, reviews) — but the read path for those is uninteresting and out of scope.
  • Out of scope: write path for reviews, photo upload, business onboarding.

Non-functional

  • 99.9% availability.
  • p99 search latency under 300 ms.
  • 100 M MAU, ~10 K searches/sec peak.
  • ~50 M businesses globally.
  • Updates to business data are infrequent (hours, address); the index can refresh nightly.

Step 2 — Capacity Estimation#

  • Businesses indexed: 50 M.
  • Search QPS: 10 K/sec peak, ~3 K/sec average.
  • Per-search work: query the geo index for businesses within radius, then rank top 20. With ~100 candidates per typical urban query, ranking cost is small.
  • Storage: business metadata averages 5 KB (name, address, hours, categories, summary). Photos and full reviews are out of band (blob store). Total: 50 M × 5 KB = 250 GB of structured data. Tiny — fits comfortably on a single shard, but partitioning by region helps with locality.
  • Geo index size: lat/lng + business_id + category bitset per row ≈ 40 B; 50 M × 40 B = 2 GB. In-memory across the fleet.

The system is read-heavy and modest in scale. The interesting parts are the spatial query, the relevance ranking, and the freshness story.

Step 3 — System Interface#

GET /search
?q=<query> (e.g., "coffee", optional)
&lat=<float>&lng=<float>
&radius_m=<int> (or &viewport=...)
&category=<id> (optional)
&open_now=<bool>
&min_rating=<float>
&sort=<distance|rating|relevance>
&limit=20
&cursor=<opaque>
→ { results: [...], next_cursor }
GET /businesses/:id (full profile)

The search response includes only enough for a list (name, photo URL, rating, distance); detail is fetched on tap.

Step 4 — High-Level Design#

┌── photo + review CDN
client → LB → API ──┬── /search ──→ search service ──→ geo index (sharded by region)
│ │ │
│ ▼ ▼
│ ranker / scorer business store (Postgres / KV)
│ │
│ └── reads metadata for candidates
└── /businesses/:id ──→ business store

The geo index and the business store can be one and the same (a Postgres database with PostGIS, partitioned by region), or split (an Elasticsearch / OpenSearch geo index + a relational store for the truth). Both shapes work at this scale; the trade-off is below.

Step 5 — Data Model#

Businesses:

table businesses
business_id uuid PK
name string
geo point // lat, lng
geohash string // for non-PostGIS implementations
categories array<string>
rating decimal
review_count int
hours jsonb // for "open now" filter
metadata jsonb

Geo index — several viable approaches:

PostGIS GiST index on geo (point) — proper R-tree / GiST, accurate distances, native to Postgres. Best when the business store and geo index are co-located. Scales to 50 M rows easily.
Geohash-based string index — convert lat/lng to a string like 9q8yyk, index as B-tree. Cheap in any KV store. Querying “within radius” requires querying multiple neighboring geohashes — coarse but very simple.

A third option, often better at scale: quadtree or H3 hex indexes. Each business lives in an H3 cell at resolution ~8 (~0.7 km²); a search expands outward, ring by ring, until it has enough candidates.

Step 6 — Detailed Design#

The geo query#

Most “search within radius” queries are answered as “search within bounding box, then post-filter”:

1. Compute bbox: (lat ± dlat, lng ± dlng) where dlat, dlng approximate the radius.
2. SELECT * FROM businesses WHERE
ST_DWithin(geo, ST_Point(lat,lng)::geography, radius)
AND categories @> [category]
AND CASE WHEN open_now THEN check_hours(hours, now())
LIMIT 200;
3. Rank the 200 candidates by ranker.
4. Return top 20.

PostGIS uses the GiST index to prune the bbox before the precise ST_DWithin filter. For 50 M rows partitioned by region, each query touches one partition and returns in 5-20 ms.

Why H3 / quadtree for huge cities#

Manhattan has ~100 K businesses in 60 km². A naive radius query returns hundreds of thousands of candidates. H3 indexing lets us bound candidate count by cell hit, not by area:

center_cell = h3_index(lat, lng, resolution=8)
candidate_cells = h3_grid_ring(center_cell, k=1) // 7 cells
candidates = SELECT * FROM businesses WHERE h3_cell IN candidate_cells AND ...
if len(candidates) < limit:
expand to k=2 (19 cells), and so on

Adaptive expansion gives k-nearest semantics — return the closest 20, regardless of how dense the area is.

Ranking#

Sort by relevance is the interesting one; distance and rating are trivial.

Relevance combines:

  • Lexical match: business name and description match the query.
  • Distance: closer is better, exponentially decaying.
  • Quality: rating × log(1 + review_count).
  • Personalization: user’s category preferences (from past clicks), if logged in.
  • Recency / freshness: very recent positive reviews bump a business.

Implementation: a lightweight scorer (linear model or boosted trees) over the candidate set. Latency budget for ranking ~50 ms over 200 candidates.

Caching#

Each search query has a natural cache key: (query, geohash_prefix, filters, sort). Identical queries from the same neighborhood hit the cache (Redis, 60-second TTL). Hit rate is modest (~20%) because users tweak filters often; cache mostly absorbs viral queries (“happy hour” on a Friday evening in a popular area).

Freshness#

Business data changes slowly: hours updated by owners, new businesses onboarded daily. The index updates run hourly (incremental) and nightly (full rebuild). For most queries, hour-old data is invisible to the user.

The exception is “open now” — opens and closes happen by the minute. Implement as a runtime filter over hours jsonb rather than precomputing an is_open boolean (which would go stale). Cost: ~1 µs per row, well within budget.

Photos and reviews#

Listed search results show one photo URL and a star rating. The photo lives in the CDN; the URL is precomputed and cached as part of the business metadata. Full review text is fetched only when the user taps into the business detail page.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: ranking cost during peak. 10 K QPS × 200 candidates × 50 ms ranking = ~5 cores fully saturated per shard. Fan out ranking work across the search-service fleet; cache rankings for popular (query, neighborhood) pairs.

Bottleneck #2: cold-cache latency in dense urban areas. First search of the day in San Francisco hits the GiST index, fans into ~5,000 candidates, picks 20. Page-warming on app launch (a background prefetch of “near you” results) makes the first interaction feel instant.

Bottleneck #3: write amplification on a viral business. A new restaurant featured in a major article gets 100× normal review-write traffic for a week. Write-only paths (post review, post photo) are out of scope here but worth noting: the rating field must be a sharded counter or a periodic recompute, not a per-write update.

Alternative I’d push back on: building this on Elasticsearch alone with no relational store. ES is fine for the geo index and lexical scoring, but it’s the wrong store for “business hours and ownership info” — those are mutable, transactional, and benefit from relational constraints. Two stores, joined at query time, is the right shape.

What breaks first at 10× scale (1 B MAU): the global business store would still fit on one DB, but write amplification on review counts and ratings would dominate. Partition by region from day one; even if the data fits centrally, the operational shape does not.

Companies this resembles#

Yelp, Google Maps Places, Foursquare, TripAdvisor, Zomato. Cousins: real-estate search (Zillow, same geo problem, different ranking signals), dating apps (geo + soft-match).

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.