Uber — System Design · Engineering Playbook

Step 1 — Clarify Requirements#

Functional

A rider requests a ride from point A to point B → system finds and dispatches a nearby driver.
Drivers stream their location to the server; riders see nearby drivers on a live map.
Trip lifecycle: request → match → en route to pickup → in trip → drop-off → payment.
Surge pricing when demand outpaces supply in a region.
Fare estimation before request.
Out of scope: ratings & reviews, driver onboarding, the routing engine itself (see /system-design/google-maps), pool / shared rides.

Non-functional

99.99% availability for the rider request path.
p99 match latency under 3 seconds (request to driver assigned).
25 M trips/day globally; ~5 M active drivers; ~100 M monthly riders.
Strong consistency for payments and trip state; eventual consistency acceptable for driver location and surge.

Step 2 — Capacity Estimation#

Active drivers: 5 M streaming location every 4 seconds → 1.25 M location writes/sec.
Active riders polling map: ~1 M concurrent at peak → 200 K read QPS for nearby-driver queries.
Trip requests: 25 M/day ≈ 300 trips/sec average, 2 K/sec peak.
Storage (location pings): 1.25 M writes/sec × 86 400 × 60 B ≈ 6 TB/day raw. Retain only the latest per driver for matching; archive trip routes only.
Storage (trips): 25 M × ~10 KB = 250 GB/day, 90 TB/year of trip records.
Surge calc: city-grid cells × 30-second windows; aggregation runs continuously over the location stream.

The whole problem is a geo-indexed lookup under continuous updates — a write-heavy spatial index for drivers, a read-heavy nearest-neighbor query for riders.

Step 3 — System Interface#

// Driver
POST  /driver/location          { driver_id, lat, lng, heading, ts }
POST  /driver/online            { driver_id }
POST  /driver/accept            { trip_id }

// Rider
POST  /rider/estimate           { origin, destination } → { fare, eta }
POST  /rider/request            { origin, destination, payment_method } → { trip_id }
GET   /rider/nearby             ?lat&lng → [ {driver_id, lat, lng}, ... ]
GET   /trip/:id                 (state + driver location)

// Lifecycle (server-emitted via WebSocket to both rider and driver)
TRIP_MATCHED, DRIVER_ARRIVING, TRIP_STARTED, TRIP_ENDED, PAYMENT_SETTLED

Trip requests are idempotent on (rider_id, request_id) — if the rider’s phone retries on a flaky network, the server must not create two trips.

Step 4 — High-Level Design#

  driver app                                              rider app
       │                                                       │
       ▼                                                       ▼
   location ingest ──→ Kafka ──→ geo index (Redis Geo + city shard)
       │                              ▲                        │
       │                              │                        ▼
       │                              │             nearby-drivers service
       │                              │                        │
       ▼                              │                        ▼
   trip state ◀── trip orchestrator ──┴── matching service ◀── request
   service              │                       ▲
                        ▼                       │
                  payments + ledger        surge engine (city-grid demand vs supply)

The hot loop on request is: rider → matching service → query geo index for K nearest drivers → score / pick one → dispatch via push notification → driver accepts → trip orchestrator binds state.

Step 5 — Data Model#

Geo index (Redis with GEOADD, sharded by city / H3 hex):

key: drivers:city:{city_id} → GEOSET (lng, lat, driver_id)
GEORADIUS drivers:city:42 BYRADIUS lng lat 3 km COUNT 20

Sharding by city (or an H3 hex of resolution 6, ~1 km diameter) keeps each shard’s working set small and reads cheap. Cross-city queries are rare (no rider in Manhattan needs to see Brooklyn drivers).

Trips (relational, partitioned by trip_id):

table trips
  trip_id        uuid     PK
  rider_id       uuid
  driver_id      uuid     null until matched
  origin         { lat, lng, addr }
  destination    { lat, lng, addr }
  state          enum(requested, matched, en_route_pickup, in_trip, completed, cancelled)
  fare           money    null until completed
  surge_factor   decimal
  requested_at   timestamp
  ...

Driver state (Redis, hot):

key: driver:{id}:state → { online, on_trip, current_trip_id, last_ping }

Trip events (append-only Kafka topic, persisted to OLAP):

{ trip_id, event: 'matched', ts, driver_id, ... }

Surge cells (Redis):

key: surge:{city}:{h3_cell}:{minute_bucket} → { demand: N, supply: M, factor }

Step 6 — Detailed Design#

Driver location ingest#

Drivers send a location ping every 4 seconds (configurable; less frequent if stationary). Pings go to the nearest regional gateway over a long-lived connection.

ping → gateway → kafka topic 'driver-pings' (partition by driver_id)
   ↓
   consumer updates Redis GEOSET drivers:city:{city}
   ↓
   consumer updates driver:{id}:last_ping

We absorb 1.25 M writes/sec by discarding stale pings: if a ping arrives older than the latest seen for that driver, drop it. Kafka per-driver ordering plus consumer offset tracking gives us this.

Nearest-driver query#

A rider request lands at the matching service:

1. Look up rider's home city: city_id.
2. GEORADIUS drivers:city:{city_id} BYRADIUS lat lng 3 km COUNT 20.
3. Filter: online, not on_trip, vehicle_type matches.
4. Score each candidate:
     score = w1 / eta + w2 * acceptance_rate + w3 / recent_cancellations
5. Pick top 1, send dispatch push to that driver.
6. Wait for accept (timeout 15s). If declined or timed out, try next.

The matching algorithm is one of the most-tuned pieces of any rideshare backend. Real-world variants include batched matching (every 5 seconds, optimize over the bipartite graph of pending requests × free drivers) — see below.

Surge pricing#

A separate streaming pipeline aggregates demand and supply per (city, H3 cell, 30-second bucket):

demand[city][cell][bucket] = count of requests in window
supply[city][cell][bucket] = count of online unmatched drivers
surge_factor = clip(demand / supply, 1.0, 5.0)

The surge factor is eventually consistent and is shown in the rider app on the fare estimate. Once a rider accepts at surge factor X, the factor is locked into the trip record at request time and is not revised — payments must be predictable.

Trip orchestration#

The trip state machine is implemented as a per-trip workflow (think Temporal or an in-house orchestrator):

REQUESTED → MATCHED → DRIVER_EN_ROUTE → ARRIVED → IN_TRIP → COMPLETED → PAID
                  ↓
             CANCELLED (rider or driver)

State transitions are persisted; events emit on Kafka. The orchestrator is the source of truth for “what’s the current trip state” — clients subscribe and render accordingly.

Payment path#

When the trip ends:

1. Compute final fare (distance × per-km + time × per-min + surge × base).
2. Idempotent ledger write: debit rider's saved payment, credit Uber + driver split.
3. Async settlement with the payment processor (see /system-design/payment-system).
4. If charge fails, attempt fallback method; mark trip as `payment_pending` and notify.

The double-entry ledger guarantees no money is created or lost in the accounting. Real settlement (bank rails) happens hours later.

Driver / rider matching at edge cases#

No drivers within radius: expand radius progressively (3 → 5 → 8 km); if still nothing after 30 seconds, return “no drivers available” to the rider.
Driver loses connectivity mid-trip: trip orchestrator detects stale ping (>30 s), pings the rider for confirmation, and either continues with a partially-tracked route or marks as completed at last-known location.
Multiple concurrent requests target the same driver: matching service uses Redis SETNX driver:{id}:lock to claim a driver atomically. Losing requests fall back to next-best driver.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: location write throughput on the geo index. 1.25 M writes/sec into Redis is achievable only by sharding by city (or H3 hex) and accepting that all in-city writes hit one shard. Hot cities (NYC, SF, Mumbai) get sub-sharded by neighborhood-level H3 cells.

Bottleneck #2: matching latency at high demand. During surge, the matching service is the bottleneck — a city with 50 K pending requests/min and bipartite optimization runs into solver budget limits. Real systems degrade to greedy matching during extreme spikes and accept a quality dip.

Bottleneck #3: payment processor throughput. Even at 25 M trips/day, the payment processor can throttle. Async settlement with retry-with-backoff and a durable outbox prevents the trip path from blocking on payment.

Alternative I’d push back on: storing all historical location pings forever for the live geo index. A ping older than the current one is useless for matching; archive to OLAP for analytics, not the live store. Mixing “where are drivers now” and “where has driver X been” in the same index turns hot writes into hot writes against cold data.

What breaks first at 10× scale (250 M trips/day): the trip orchestrator’s per-trip state. A naive design (one row + state column) becomes a hot-write hellscape; the workflow engine pattern (per-trip durable timer + event log) scales horizontally. Plan for that re-architecture early.

Companies this resembles#

Uber, Lyft, Ola, Didi, Grab, Bolt. Cousins: Lime / Bird (geo-indexed vehicles without a driver dispatch), food delivery (see /system-design/ride-hailing-eats).

Google Maps — the routing and ETA engine Uber depends on (or builds in-house equivalents of).
Food Delivery (Uber Eats) — three-sided variant of the same dispatch problem.
Proximity Service (Yelp) — same geo-index, read-only queries.
Payment System — substrate for the fare collection path.