Payment System — System Design · Engineering Playbook

Step 1 — Clarify Requirements#

Functional

Charge a customer’s payment method on behalf of a merchant.
Support multiple payment methods (cards, ACH, wallets, alternative methods per country).
Refund, partial refund, dispute / chargeback handling.
Idempotent API — duplicate requests never double-charge.
Reconciliation against bank / processor statements.
Out of scope here: subscription billing, invoicing UI, fraud-model training pipeline.

Non-functional

99.99% availability for the charge path; downtime directly maps to lost merchant revenue.
p99 charge latency under 1 second (including external processor call).
Strict consistency on money — no charge can be lost, double-applied, or rounded incorrectly.
30 K charges/sec peak (large-platform scale: a Black Friday at Stripe or PayPal).
7-year financial audit trail; immutable history.

Step 2 — Capacity Estimation#

Charges/day: 30 K/sec peak × ~20% peak ratio × 86 400 ≈ ~500 M charges/day at top scale (Stripe-class).
Storage per charge: ~5 KB of structured data (charge record + envelope + idempotency key + ledger entries). 500 M × 5 KB = 2.5 TB/day; ~1 PB over 5 years including ledger entries.
External calls: each charge is ~2-3 round-trips to card networks / acquirers / 3DS providers. Network is the latency dominator.
Refund rate: ~2% of charges. ~10 M refunds/day at top scale.
Disputes / chargebacks: ~0.1% of charges. The dispute lifecycle takes weeks; storage matters more than throughput.

The system isn’t large in QPS or storage — it’s enormous in correctness requirements. Every line of code is “could this cause a money error”.

Step 3 — System Interface#

POST /charges
  Idempotency-Key: <client-provided UUID>
  Body: {
    amount: int (smallest unit, e.g., cents),
    currency: string (ISO 4217),
    payment_method: { type, ... },
    customer_id: string,
    metadata: { ... }
  }
  → 200 { id, status: 'succeeded'|'pending'|'requires_action'|'failed', ... }

POST /refunds
  Idempotency-Key: <UUID>
  Body: { charge_id, amount }

GET /charges/:id
POST /charges/:id/capture          (for two-phase auth + capture)

POST /webhooks/:network            (incoming events from processors)

The Idempotency-Key is the single most important header in the entire API. It’s how the client (often retrying after a flaky network) guarantees not to double-charge.

Step 4 — High-Level Design#

  merchant API call
        │
        ▼
   API gateway ─→ idempotency check (Redis + DB)
        │            │
        │            ├─ key seen? return stored result
        │            └─ new key: claim & proceed
        ▼
   charge service ─→ fraud service (sync or async based on risk)
        │
        ▼
   processor router ─→ external network (Visa, MC, ACH, etc.)
        │
        ▼
   ledger writer (double-entry, transactional)
        │
        ▼
   event bus ─→ webhook delivery + reconciliation + analytics
                      │
                      ▼
            reconciliation engine (compares ledger vs network statements)

The ledger is the source of truth. Every other component (webhooks, reports, refunds) is derived from it.

Step 5 — Data Model#

Double-entry ledger#

Each money movement is two equal-and-opposite entries in two accounts. Total of all entries is always zero.

table accounts
  account_id    uuid     PK
  type          enum(merchant_balance, customer_payable, fees, reserve, ...)
  currency      string
  balance       int      // derived (cached); source of truth is the entries

table ledger_entries
  entry_id      uuid     PK
  txn_id        uuid             -- groups entries that compose one logical transaction
  account_id    uuid
  amount        int              -- signed; positive = credit
  currency      string
  ts            timestamp
  description   string
  CHECK: SUM(amount) OVER (PARTITION BY txn_id) = 0

A charge of $10 from customer to merchant:

txn_id = T1
+10.00 → merchant_balance[merchant_A]
-10.00 → customer_payable[customer_X]

A platform fee of $0.30 within that same charge:

txn_id = T1
-0.30 → merchant_balance[merchant_A]
+0.30 → fees[platform]

All four entries share txn_id = T1; their sum is zero. The ledger is append-only — corrections happen via new entries (reversals), never UPDATE / DELETE.

Charges#

table charges
  charge_id          uuid       PK
  idempotency_key    string     UNIQUE
  merchant_id        uuid
  customer_id        uuid
  amount             int
  currency           string
  status             enum(...)
  network_txn_id     string     // returned by Visa/MC
  ledger_txn_id      uuid       // links to ledger entries
  created_at         timestamp
  metadata           json

Idempotency#

table idempotency_records
  key             string     PK
  merchant_id     uuid       (scoped per-merchant; key uniqueness is per-merchant)
  request_hash    bytes      // hash of the full request body
  response_body   bytes
  status          enum(in_progress, completed, failed)
  created_at      timestamp
  expires_at      timestamp  // typically 24 hours

The key is the binding between a client’s “I sent this charge request” and the server’s response.

Step 6 — Detailed Design#

Idempotency in detail#

POST /charges with Idempotency-Key: abc-123:
1. Look up idempotency_records[abc-123].
2. If record exists with status=completed:
     verify request_hash matches; return stored response.
3. If record exists with status=in_progress:
     return 409 (in progress) or wait briefly for completion.
4. If no record:
     INSERT idempotency_records[abc-123] (in_progress, request_hash) -- atomic via unique constraint.
     proceed with charge.
     UPDATE idempotency_records[abc-123] (completed, response_body).

The unique constraint at step 4 is what enforces “only one writer per key” in the face of concurrent retries from the same client. PG, MySQL, DynamoDB, all support this via primary-key insert semantics.

Charge lifecycle#

1. POST /charges arrives. Idempotency check passes (new request).
2. Fraud scoring (sub-50 ms). If clear, proceed. If risky, route to manual review queue.
3. Processor router picks the right acquirer for (card brand, currency, country).
4. Send authorization request to the card network. Timeout: 30 seconds.
5. Network responds: approved (with auth_code) or declined (with reason).
6. If approved:
     write ledger entries in a transaction with charge update.
     emit `charge.succeeded` event.
7. If declined:
     update charge to status=failed, write event.
8. Return result + persist in idempotency_records.

Step 6’s “transaction” is the critical part. The ledger entries and the charge status must commit together — otherwise we could have “the network approved but our records don’t show it” (lost money) or “our records say approved but the network never charged” (free chargeback).

Two-phase: authorize then capture#

Many flows want to authorize first (place a hold on the card), then capture later (actually move the money). Used by hotels, car rentals, marketplaces with order fulfillment.

POST /charges { capture: false }    → status: authorized (hold for ~7 days)
POST /charges/:id/capture           → status: succeeded (ledger entries written)
POST /charges/:id/cancel            → status: voided (no ledger entries needed)

Authorizations expire if not captured. The system tracks expiration and auto-voids stale auths.

Reconciliation#

The card network sends nightly settlement files: “here are the transactions that cleared today, with our totals”. We compare against our ledger:

for each network_record:
   find local charge by network_txn_id
   if local says succeeded and amount matches → reconciled
   if local says succeeded but amount differs → flag for investigation
   if local says succeeded and no network record → flag (could be settlement delay)
   if local says failed but network record exists → critical (money on the line)

Discrepancies are the daily reconciliation work. Most are settlement-timing related; the rare critical ones (orphaned charges) trigger an investigation runbook.

Refunds#

POST /refunds { charge_id, amount }:
1. Idempotency check.
2. Verify the charge is refundable (status, time window, available balance).
3. Send refund request to processor.
4. On success: write reverse ledger entries (mirror image of the original charge).
   txn_id = R1 referencing original T1
   -10.00 → merchant_balance[merchant_A]
   +10.00 → customer_payable[customer_X]
5. Update charge.refunded_amount; status if fully refunded.

Refunds use the same double-entry pattern; the ledger remains balanced.

Disputes / chargebacks#

When a customer disputes with their bank, the bank sends a “chargeback” message. The merchant has a window to respond with evidence. The system:

Marks the charge as disputed (does not yet move money).
On loss: writes ledger entries reversing the original charge plus a fee.
On win: no money moves.

This part of the system is procedurally complex (network-specific rules) but mechanically simple in the ledger model.

Multi-currency#

Each entry has a currency; accounts are single-currency. Cross-currency transactions go through an explicit FX conversion step that writes both legs:

customer pays $10 USD → merchant settles in €:
   -10.00 USD → customer_payable[customer_X]
   +10.00 USD → fx_holding[platform]
   -9.10  EUR → fx_holding[platform]
   +9.10  EUR → merchant_balance[merchant_A]
   (rate snapshot kept for audit)

The platform absorbs the FX risk between when the charge happens and when it settles — usually under a minute, so very small.

Webhook delivery#

Every state change emits an event. Webhooks deliver these to merchant endpoints:

At-least-once delivery with exponential backoff.
Signed payloads (HMAC) so the merchant can verify authenticity.
Idempotency on the merchant side (we send event_id; if they’ve seen it before, they no-op).

A merchant endpoint that’s down accumulates a backlog; replay-on-recovery is the standard pattern.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: external network latency. The card network’s own response time (300-800 ms p99) dominates the charge latency. Nothing we do on our side closes that gap. Mitigation: parallelize the fraud check with the auth request when the merchant accepts the risk profile.

Bottleneck #2: ledger contention on a hot merchant. A platform with one merchant doing 1,000 charges/sec is writing 4,000+ ledger entries/sec, mostly to that merchant’s account row’s derived balance. The balance is derived, not authoritative — compute it lazily from entry sums or maintain a sharded cache (see /system-design/sharded-counters). Never lock the merchant’s balance row on every write.

Bottleneck #3: reconciliation throughput. Nightly files for 500 M charges are 100 GB+ each. Process them in parallel by merchant; the reconciliation join is the slow part. Many real systems are 12-24 hours behind on reconciliation, which is fine — it’s a check, not a live consistency requirement.

Alternative I’d push back on: writing balances as authoritative columns updated by triggers. Looks convenient, kills performance, and creates inconsistency bugs. Balances are a view over the ledger, not a source of truth.

What breaks first at 10× scale (300 K charges/sec): the idempotency record table. Per-key writes at high QPS need careful sharding by merchant_id × key_hash. Also: the network connection pool to processors — opening too many connections gets you rate-limited at the network level; pool sharing across customers becomes the next bottleneck.

Companies this resembles#

Stripe, Adyen, PayPal, Square, Braintree. In-house equivalents at Amazon (payment platform), Uber (rider payments), Airbnb (host payouts). Cousins: ledger-only systems for fintech (Modern Treasury, Increase) without the network integration.

Sequencer — for monotonic txn_id generation across regions.
Distributed Task Scheduler — webhook retry queue, reconciliation jobs.
Server-Side Error Monitoring — every failed charge gets a per-merchant alert.
Uber — depends on this design as the payment substrate for rides.