Proximity Service (Yelp)
Geo-indexing, dynamic segments, search-within-radius queries at city scale.
Step 1 — Clarify Requirements#
Functional
- Given a location and radius (or a viewport), return nearby businesses matching a query (
coffee,dentist,vegan ramen). - Filter by category, rating, open-now, price tier.
- Sort by distance, rating, or relevance.
- Business profile pages (name, hours, photos, reviews) — but the read path for those is uninteresting and out of scope.
- Out of scope: write path for reviews, photo upload, business onboarding.
Non-functional
- 99.9% availability.
- p99 search latency under 300 ms.
- 100 M MAU, ~10 K searches/sec peak.
- ~50 M businesses globally.
- Updates to business data are infrequent (hours, address); the index can refresh nightly.
Step 2 — Capacity Estimation#
- Businesses indexed: 50 M.
- Search QPS: 10 K/sec peak, ~3 K/sec average.
- Per-search work: query the geo index for businesses within radius, then rank top 20. With ~100 candidates per typical urban query, ranking cost is small.
- Storage: business metadata averages 5 KB (name, address, hours, categories, summary). Photos and full reviews are out of band (blob store). Total: 50 M × 5 KB = 250 GB of structured data. Tiny — fits comfortably on a single shard, but partitioning by region helps with locality.
- Geo index size: lat/lng + business_id + category bitset per row ≈ 40 B; 50 M × 40 B = 2 GB. In-memory across the fleet.
The system is read-heavy and modest in scale. The interesting parts are the spatial query, the relevance ranking, and the freshness story.
Step 3 — System Interface#
GET /search ?q=<query> (e.g., "coffee", optional) &lat=<float>&lng=<float> &radius_m=<int> (or &viewport=...) &category=<id> (optional) &open_now=<bool> &min_rating=<float> &sort=<distance|rating|relevance> &limit=20 &cursor=<opaque>→ { results: [...], next_cursor }
GET /businesses/:id (full profile)The search response includes only enough for a list (name, photo URL, rating, distance); detail is fetched on tap.
Step 4 — High-Level Design#
┌── photo + review CDN │client → LB → API ──┬── /search ──→ search service ──→ geo index (sharded by region) │ │ │ │ ▼ ▼ │ ranker / scorer business store (Postgres / KV) │ │ │ └── reads metadata for candidates │ └── /businesses/:id ──→ business storeThe geo index and the business store can be one and the same (a Postgres database with PostGIS, partitioned by region), or split (an Elasticsearch / OpenSearch geo index + a relational store for the truth). Both shapes work at this scale; the trade-off is below.
Step 5 — Data Model#
Businesses:
table businesses business_id uuid PK name string geo point // lat, lng geohash string // for non-PostGIS implementations categories array<string> rating decimal review_count int hours jsonb // for "open now" filter metadata jsonbGeo index — several viable approaches:
geo (point) — proper R-tree / GiST, accurate distances, native to Postgres. Best when the business store and geo index are co-located. Scales to 50 M rows easily. 9q8yyk, index as B-tree. Cheap in any KV store. Querying “within radius” requires querying multiple neighboring geohashes — coarse but very simple. A third option, often better at scale: quadtree or H3 hex indexes. Each business lives in an H3 cell at resolution ~8 (~0.7 km²); a search expands outward, ring by ring, until it has enough candidates.
Step 6 — Detailed Design#
The geo query#
Most “search within radius” queries are answered as “search within bounding box, then post-filter”:
1. Compute bbox: (lat ± dlat, lng ± dlng) where dlat, dlng approximate the radius.2. SELECT * FROM businesses WHERE ST_DWithin(geo, ST_Point(lat,lng)::geography, radius) AND categories @> [category] AND CASE WHEN open_now THEN check_hours(hours, now()) LIMIT 200;3. Rank the 200 candidates by ranker.4. Return top 20.PostGIS uses the GiST index to prune the bbox before the precise ST_DWithin filter. For 50 M rows partitioned by region, each query touches one partition and returns in 5-20 ms.
Why H3 / quadtree for huge cities#
Manhattan has ~100 K businesses in 60 km². A naive radius query returns hundreds of thousands of candidates. H3 indexing lets us bound candidate count by cell hit, not by area:
center_cell = h3_index(lat, lng, resolution=8)candidate_cells = h3_grid_ring(center_cell, k=1) // 7 cellscandidates = SELECT * FROM businesses WHERE h3_cell IN candidate_cells AND ...if len(candidates) < limit: expand to k=2 (19 cells), and so onAdaptive expansion gives k-nearest semantics — return the closest 20, regardless of how dense the area is.
Ranking#
Sort by relevance is the interesting one; distance and rating are trivial.
Relevance combines:
- Lexical match: business name and description match the query.
- Distance: closer is better, exponentially decaying.
- Quality: rating × log(1 + review_count).
- Personalization: user’s category preferences (from past clicks), if logged in.
- Recency / freshness: very recent positive reviews bump a business.
Implementation: a lightweight scorer (linear model or boosted trees) over the candidate set. Latency budget for ranking ~50 ms over 200 candidates.
Caching#
Each search query has a natural cache key: (query, geohash_prefix, filters, sort). Identical queries from the same neighborhood hit the cache (Redis, 60-second TTL). Hit rate is modest (~20%) because users tweak filters often; cache mostly absorbs viral queries (“happy hour” on a Friday evening in a popular area).
Freshness#
Business data changes slowly: hours updated by owners, new businesses onboarded daily. The index updates run hourly (incremental) and nightly (full rebuild). For most queries, hour-old data is invisible to the user.
The exception is “open now” — opens and closes happen by the minute. Implement as a runtime filter over hours jsonb rather than precomputing an is_open boolean (which would go stale). Cost: ~1 µs per row, well within budget.
Photos and reviews#
Listed search results show one photo URL and a star rating. The photo lives in the CDN; the URL is precomputed and cached as part of the business metadata. Full review text is fetched only when the user taps into the business detail page.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: ranking cost during peak. 10 K QPS × 200 candidates × 50 ms ranking = ~5 cores fully saturated per shard. Fan out ranking work across the search-service fleet; cache rankings for popular (query, neighborhood) pairs.
Bottleneck #2: cold-cache latency in dense urban areas. First search of the day in San Francisco hits the GiST index, fans into ~5,000 candidates, picks 20. Page-warming on app launch (a background prefetch of “near you” results) makes the first interaction feel instant.
Bottleneck #3: write amplification on a viral business. A new restaurant featured in a major article gets 100× normal review-write traffic for a week. Write-only paths (post review, post photo) are out of scope here but worth noting: the rating field must be a sharded counter or a periodic recompute, not a per-write update.
Alternative I’d push back on: building this on Elasticsearch alone with no relational store. ES is fine for the geo index and lexical scoring, but it’s the wrong store for “business hours and ownership info” — those are mutable, transactional, and benefit from relational constraints. Two stores, joined at query time, is the right shape.
What breaks first at 10× scale (1 B MAU): the global business store would still fit on one DB, but write amplification on review counts and ratings would dominate. Partition by region from day one; even if the data fits centrally, the operational shape does not.
Companies this resembles#
Yelp, Google Maps Places, Foursquare, TripAdvisor, Zomato. Cousins: real-estate search (Zillow, same geo problem, different ranking signals), dating apps (geo + soft-match).
Related systems#
- Google Maps — same geo substrate, different problem (routing vs proximity search).
- Distributed Search — the lexical-relevance backbone.
- Uber — same H3 geo-index pattern, write-heavy variant.
- Typeahead Suggestion — what runs as the user types the query.