Design a Search Service API

Query, suggest, rank, paginate. Where latency budget lives and how to write a search API that won't go viral on the wrong page.

System Intermediate
11 min read
search api-design pagination ranking

Context#

A search service is the foundational API every product team eventually owns. E-commerce search, employee search, content search, log search, support-ticket search — the surface looks different but the contract is the same: take a query string, return a ranked list of matches, support pagination, support suggestions, expose enough filtering for the use cases that matter.

The interviewer’s hidden objectives, roughly in order:

  • Can you scope the search without spinning into an entire Google rebuild?
  • Can you write a clean read API with sensible pagination, sorting, and filtering?
  • Can you produce a latency budget that respects the 100 ms perception threshold for typeahead and the 300 ms threshold for full results?
  • Can you handle the suggest vs query distinction correctly — they have different latency budgets, different SLOs, often different backends?
  • Can you defend the ranking story without claiming to have built BM25 in the room?

This is a Foundational-tier system in the source curriculum precisely because it forces you to engage with the core API-design vocabulary: endpoints, schemas, pagination, filtering, ranking, idempotency (read APIs are inherently idempotent — but cache semantics are a real trade), and a latency budget.

Requirements (functional and non-functional)#

Functional — in scope:

  • Query by free-text string. Returns ranked results.
  • Suggest (autocomplete) endpoint optimised for sub-100 ms response.
  • Pagination via opaque cursor (not page numbers).
  • Optional filters (faceted by stable enum fields — category, date range, language).
  • Optional sort overrides (relevance is default; price-asc, recency, etc. as overrides).
  • Result hit counts (approximate, not exact at scale).

Functional — out of scope:

  • Index-management API (admin only; not part of the public search contract).
  • Personalisation (treat the query as anonymous in v1; user-aware ranking is a future axis).
  • Spell correction as a separate endpoint (server applies it transparently and reports it in the response).
  • Vector / semantic search (call out as a follow-up; not in v1 scope).

Non-functional:

  • Latency: typeahead <= 100 ms p95, full query <= 300 ms p95.
  • Throughput: 10k QPS sustained, 50k QPS burst.
  • Availability: 99.95% on the read path.
  • Freshness: writes show up in results within 60 seconds.

Use case diagram#

┌──────────────┐
│ End user │
└──────┬───────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
[type in box] [paginate] [filter / sort]
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────┐
│ Search Service API │
└────────────────┬─────────────────────┘
┌───────────────┐
│ Internal team │ … index manage, observe metrics
└───────────────┘

Two actors (End user, Internal team). The internal-team use cases are admin and out of public-API scope.

Class diagram#

┌───────────────────────┐
│ SearchService │
├───────────────────────┤
│ query(req): SearchResp│
│ suggest(req): SuggestResp │
└──────────┬────────────┘
┌───────────────────────┐ ┌─────────────────────┐
│ SearchRequest │ │ SearchResponse │
├───────────────────────┤ ├─────────────────────┤
│ q : string │ │ hits : Hit[] │
│ cursor? : string │ │ next_cursor? : str │
│ page_size : int (≤100)│ │ total_estimate : int│
│ filters? : Filter[] │ │ took_ms : int │
│ sort? : SortKey │ │ correction? : Corr │
└───────────────────────┘ └─────────────────────┘
│ │
│ ▼
│ ┌──────────────┐
│ │ Hit │
│ ├──────────────┤
│ │ id, title │
│ │ snippet │
│ │ score │
│ │ fields{...} │
│ └──────────────┘
┌───────────────────────┐
│ Filter │ facet : enum op : in | eq | range
│ SortKey │ field : enum dir : asc | desc
└───────────────────────┘

SearchService is the API. SearchRequest and SearchResponse are the wire shapes. Hit is per-result. Filter is constrained — only stable enum fields are filterable, so the URL can be reasoned about and the result cached.

Sequence diagram (key flows)#

The full query flow, client to backend and back:

Client Gateway SearchAPI Indexer Cache
│ GET /search?q=...│ │ │ │
│─────────────────►│ │ │ │
│ │ rate limit? │ │ │
│ │ auth? │ │ │
│ │─────────────►│ │ │
│ │ │ cache key │ │
│ │ │ ───────────────────────────────► │
│ │ │ hit? │ │
│ │ │ ◄─────────────────────────────── │
│ │ │ miss │ │
│ │ │ query(q,…) │ │
│ │ │───────────────►│ │
│ │ │ ranked hits │ │
│ │ │◄───────────────│ │
│ │ │ set cache │ │
│ │ │ ───────────────────────────────► │
│ │ SearchResp │ │ │
│ │◄─────────────│ │ │
│ 200 OK + body │ │ │ │
│◄─────────────────│ │ │ │

The suggest flow is identical in shape but bypasses the Cache layer (typeahead changes every keystroke; cache hit rate is poor) and runs against a separate index optimised for prefix matches.

Activity diagram (for non-trivial state)#

The query lifecycle is simple — read-only, no state machine — but the error / degradation path has structure worth showing:

[request arrives]
┌───────────────┐
│ validate q │── invalid → 400 Bad Request
└───────┬───────┘
│ valid
┌───────────────┐
│ check rate │── exceeded → 429 + Retry-After
└───────┬───────┘
│ within budget
┌───────────────┐
│ cache lookup │── hit → 200 (cached)
└───────┬───────┘
│ miss
┌───────────────┐
│ indexer │── timeout → 503 + cached-fallback?
└───────┬───────┘
│ success
┌───────────────┐
│ rank + slice │
└───────┬───────┘
200 OK

The 503 + cached-fallback branch is the interesting one — when the live indexer is slow, returning a slightly-stale cached result for the same query is almost always better than failing the request. Stripe’s Tracing-vs-Outage trade-off illustrates this principle.

API implementation#

Endpoint catalogue#

MethodPathPurpose
GET/v1/searchFull search; returns ranked hits + cursor
GET/v1/suggestTypeahead; returns top-N completions only

Two endpoints. No POSTs. Read-only. Cacheable at every layer.

OpenAPI schema (excerpt)#

OpenAPI 3.1 — Search API
paths:
/v1/search:
get:
operationId: search
parameters:
- name: q
in: query
required: true
schema: { type: string, minLength: 1, maxLength: 256 }
- name: cursor
in: query
schema: { type: string }
- name: page_size
in: query
schema: { type: integer, minimum: 1, maximum: 100, default: 20 }
- name: filter
in: query
style: deepObject
schema:
type: object
additionalProperties: { type: string }
- name: sort
in: query
schema:
type: string
enum: [relevance, recency, price_asc, price_desc]
default: relevance
responses:
'200':
description: Ranked hits
headers:
X-RateLimit-Remaining:
schema: { type: integer }
content:
application/json:
schema:
$ref: '#/components/schemas/SearchResponse'
'400': { description: Invalid query }
'429': { description: Too many requests }
components:
schemas:
SearchResponse:
type: object
required: [hits, took_ms]
properties:
hits:
type: array
items: { $ref: '#/components/schemas/Hit' }
next_cursor: { type: string, nullable: true }
total_estimate: { type: integer }
took_ms: { type: integer }
correction:
type: object
nullable: true
properties:
applied_q: { type: string }
original_q: { type: string }
Hit:
type: object
required: [id, score]
properties:
id: { type: string }
title: { type: string }
snippet: { type: string }
score: { type: number }
fields: { type: object, additionalProperties: true }

Client samples — three languages#

The same GET /v1/search call in Python, Go, and Node. Matches the multi-language convention used across this workbook.

Search client — Python
import requests
def search(q, cursor=None, page_size=20, sort="relevance", filters=None):
params = {"q": q, "page_size": page_size, "sort": sort}
if cursor:
params["cursor"] = cursor
if filters:
for key, val in filters.items():
params[f"filter[{key}]"] = val
resp = requests.get(
"https://api.example.com/v1/search",
params=params,
headers={"Authorization": "Bearer eyJhbGciOi..."},
timeout=2,
)
resp.raise_for_status()
return resp.json()
page = search("wireless mouse", filters={"category": "peripherals"})
for hit in page["hits"]:
print(hit["id"], hit["title"], hit["score"])

Latency budget#

The 300 ms p95 full-query budget breaks down as:

PhaseBudgetNotes
TLS + HTTP/2 setup0 msConnection pool keeps these warm.
Gateway (rate limit, auth)5 msIn-memory token-bucket, JWT verify cached.
Cache lookup2 msRedis in the same DC.
Indexer fan-out + merge80 ms p95The dominant cost.
Ranking + scoring30 msAfter candidate retrieval.
Serialize + transport20 msJSON encoding + ~20 KB body.
Margin for tail60 msHeadroom for slow shards.
Total197 msSafely under 300 ms.

Typeahead’s 100 ms budget halves every line — and skips the cache (low hit rate) and ranking (top-N by prefix match is enough).

Trade-offs and extensions#

DecisionWhyCost if requirements change
Cursor pagination, not page numbersStable under writes; index-friendlyCan’t jump to arbitrary page — UX trade-off
Filters as enum-only fieldsCacheable URLs; tractable indexFree-text filters require a re-design
Approximate total_estimateO(1) to compute at scaleUsers expect exact counts under 100
GET-only APICacheable at every layerNo personalisation possible without auth-aware caching
Single relevance + sort dimensionSimple model; predictableMulti-dimensional ranking needs a richer sort grammar
60 s freshness SLOIndexers can batchReal-time use cases (chat search) need < 5 s

Likely follow-up extensions and the shape of the answer:

  • Personalisation. Move from anonymous to per-user ranking. Adds an auth requirement; changes cache key shape (user-aware); forces splitting the cache into per-user buckets.
  • Vector search. A vector mode that uses dense embeddings. Same API shape; new backend. Most teams build vector and lexical in parallel and merge results.
  • Multi-tenant search. Add tenant scoping at the API level (X-Tenant-Id header); per-tenant rate limits; per-tenant indexes for isolation.
  • Webhook on index-staleness. Push notifications to clients when their last query is now stale enough to re-run.

Mock interview follow-ups#

Questions interviewers reach for and the shortest correct answer:

  • “How do you prevent a deep-paginated request from blowing up?” — Cursor-based pagination caps it naturally; we never seek past N results. Hard limit at 10k.
  • “What’s the cache invalidation story?” — Cache key on the full query; TTL of 60 s (matching freshness SLO); plus a soft “stampede” lock around concurrent misses.
  • “How do you do typeahead in 100 ms?” — Separate index optimised for prefix matches; in-memory; warm cache at the edge; degrade gracefully on miss (return fewer results, not slower).
  • “How do you rank?” — Out of scope to design the ranking algorithm in an API round. The API contract is score: number; the engine behind it is BM25 / dense retrieval / a learned-to-rank model — the API doesn’t care.
  • “What happens at 10x scale?” — Shard the index by document partition (e.g. category); fan out the query; merge top-K. Add a read-replica gateway closer to the user.
  • “How do you A/B test ranking changes?” — Header-based experiment routing; the API contract stays unchanged; observability captures experiment_id per request.
  • “What if the indexer is down?” — Return cached results with a Warning: 110 header; fail-soft. If cache is also cold, return 503 with Retry-After: 1.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.