Payment System
Idempotency, double-entry ledgers, reconciliation, gateway integration, fraud signals.
Step 1 — Clarify Requirements#
Functional
- Charge a customer’s payment method on behalf of a merchant.
- Support multiple payment methods (cards, ACH, wallets, alternative methods per country).
- Refund, partial refund, dispute / chargeback handling.
- Idempotent API — duplicate requests never double-charge.
- Reconciliation against bank / processor statements.
- Out of scope here: subscription billing, invoicing UI, fraud-model training pipeline.
Non-functional
- 99.99% availability for the charge path; downtime directly maps to lost merchant revenue.
- p99 charge latency under 1 second (including external processor call).
- Strict consistency on money — no charge can be lost, double-applied, or rounded incorrectly.
- 30 K charges/sec peak (large-platform scale: a Black Friday at Stripe or PayPal).
- 7-year financial audit trail; immutable history.
Step 2 — Capacity Estimation#
- Charges/day: 30 K/sec peak × ~20% peak ratio × 86 400 ≈ ~500 M charges/day at top scale (Stripe-class).
- Storage per charge: ~5 KB of structured data (charge record + envelope + idempotency key + ledger entries). 500 M × 5 KB = 2.5 TB/day; ~1 PB over 5 years including ledger entries.
- External calls: each charge is ~2-3 round-trips to card networks / acquirers / 3DS providers. Network is the latency dominator.
- Refund rate: ~2% of charges. ~10 M refunds/day at top scale.
- Disputes / chargebacks: ~0.1% of charges. The dispute lifecycle takes weeks; storage matters more than throughput.
The system isn’t large in QPS or storage — it’s enormous in correctness requirements. Every line of code is “could this cause a money error”.
Step 3 — System Interface#
POST /charges Idempotency-Key: <client-provided UUID> Body: { amount: int (smallest unit, e.g., cents), currency: string (ISO 4217), payment_method: { type, ... }, customer_id: string, metadata: { ... } } → 200 { id, status: 'succeeded'|'pending'|'requires_action'|'failed', ... }
POST /refunds Idempotency-Key: <UUID> Body: { charge_id, amount }
GET /charges/:idPOST /charges/:id/capture (for two-phase auth + capture)
POST /webhooks/:network (incoming events from processors)The Idempotency-Key is the single most important header in the entire API. It’s how the client (often retrying after a flaky network) guarantees not to double-charge.
Step 4 — High-Level Design#
merchant API call │ ▼ API gateway ─→ idempotency check (Redis + DB) │ │ │ ├─ key seen? return stored result │ └─ new key: claim & proceed ▼ charge service ─→ fraud service (sync or async based on risk) │ ▼ processor router ─→ external network (Visa, MC, ACH, etc.) │ ▼ ledger writer (double-entry, transactional) │ ▼ event bus ─→ webhook delivery + reconciliation + analytics │ ▼ reconciliation engine (compares ledger vs network statements)The ledger is the source of truth. Every other component (webhooks, reports, refunds) is derived from it.
Step 5 — Data Model#
Double-entry ledger#
Each money movement is two equal-and-opposite entries in two accounts. Total of all entries is always zero.
table accounts account_id uuid PK type enum(merchant_balance, customer_payable, fees, reserve, ...) currency string balance int // derived (cached); source of truth is the entries
table ledger_entries entry_id uuid PK txn_id uuid -- groups entries that compose one logical transaction account_id uuid amount int -- signed; positive = credit currency string ts timestamp description string CHECK: SUM(amount) OVER (PARTITION BY txn_id) = 0A charge of $10 from customer to merchant:
txn_id = T1+10.00 → merchant_balance[merchant_A]-10.00 → customer_payable[customer_X]A platform fee of $0.30 within that same charge:
txn_id = T1-0.30 → merchant_balance[merchant_A]+0.30 → fees[platform]All four entries share txn_id = T1; their sum is zero. The ledger is append-only — corrections happen via new entries (reversals), never UPDATE / DELETE.
Charges#
table charges charge_id uuid PK idempotency_key string UNIQUE merchant_id uuid customer_id uuid amount int currency string status enum(...) network_txn_id string // returned by Visa/MC ledger_txn_id uuid // links to ledger entries created_at timestamp metadata jsonIdempotency#
table idempotency_records key string PK merchant_id uuid (scoped per-merchant; key uniqueness is per-merchant) request_hash bytes // hash of the full request body response_body bytes status enum(in_progress, completed, failed) created_at timestamp expires_at timestamp // typically 24 hoursThe key is the binding between a client’s “I sent this charge request” and the server’s response.
Step 6 — Detailed Design#
Idempotency in detail#
POST /charges with Idempotency-Key: abc-123:1. Look up idempotency_records[abc-123].2. If record exists with status=completed: verify request_hash matches; return stored response.3. If record exists with status=in_progress: return 409 (in progress) or wait briefly for completion.4. If no record: INSERT idempotency_records[abc-123] (in_progress, request_hash) -- atomic via unique constraint. proceed with charge. UPDATE idempotency_records[abc-123] (completed, response_body).The unique constraint at step 4 is what enforces “only one writer per key” in the face of concurrent retries from the same client. PG, MySQL, DynamoDB, all support this via primary-key insert semantics.
Charge lifecycle#
1. POST /charges arrives. Idempotency check passes (new request).2. Fraud scoring (sub-50 ms). If clear, proceed. If risky, route to manual review queue.3. Processor router picks the right acquirer for (card brand, currency, country).4. Send authorization request to the card network. Timeout: 30 seconds.5. Network responds: approved (with auth_code) or declined (with reason).6. If approved: write ledger entries in a transaction with charge update. emit `charge.succeeded` event.7. If declined: update charge to status=failed, write event.8. Return result + persist in idempotency_records.Step 6’s “transaction” is the critical part. The ledger entries and the charge status must commit together — otherwise we could have “the network approved but our records don’t show it” (lost money) or “our records say approved but the network never charged” (free chargeback).
Two-phase: authorize then capture#
Many flows want to authorize first (place a hold on the card), then capture later (actually move the money). Used by hotels, car rentals, marketplaces with order fulfillment.
POST /charges { capture: false } → status: authorized (hold for ~7 days)POST /charges/:id/capture → status: succeeded (ledger entries written)POST /charges/:id/cancel → status: voided (no ledger entries needed)Authorizations expire if not captured. The system tracks expiration and auto-voids stale auths.
Reconciliation#
The card network sends nightly settlement files: “here are the transactions that cleared today, with our totals”. We compare against our ledger:
for each network_record: find local charge by network_txn_id if local says succeeded and amount matches → reconciled if local says succeeded but amount differs → flag for investigation if local says succeeded and no network record → flag (could be settlement delay) if local says failed but network record exists → critical (money on the line)Discrepancies are the daily reconciliation work. Most are settlement-timing related; the rare critical ones (orphaned charges) trigger an investigation runbook.
Refunds#
POST /refunds { charge_id, amount }:1. Idempotency check.2. Verify the charge is refundable (status, time window, available balance).3. Send refund request to processor.4. On success: write reverse ledger entries (mirror image of the original charge). txn_id = R1 referencing original T1 -10.00 → merchant_balance[merchant_A] +10.00 → customer_payable[customer_X]5. Update charge.refunded_amount; status if fully refunded.Refunds use the same double-entry pattern; the ledger remains balanced.
Disputes / chargebacks#
When a customer disputes with their bank, the bank sends a “chargeback” message. The merchant has a window to respond with evidence. The system:
- Marks the charge as
disputed(does not yet move money). - On loss: writes ledger entries reversing the original charge plus a fee.
- On win: no money moves.
This part of the system is procedurally complex (network-specific rules) but mechanically simple in the ledger model.
Multi-currency#
Each entry has a currency; accounts are single-currency. Cross-currency transactions go through an explicit FX conversion step that writes both legs:
customer pays $10 USD → merchant settles in €: -10.00 USD → customer_payable[customer_X] +10.00 USD → fx_holding[platform] -9.10 EUR → fx_holding[platform] +9.10 EUR → merchant_balance[merchant_A] (rate snapshot kept for audit)The platform absorbs the FX risk between when the charge happens and when it settles — usually under a minute, so very small.
Webhook delivery#
Every state change emits an event. Webhooks deliver these to merchant endpoints:
- At-least-once delivery with exponential backoff.
- Signed payloads (HMAC) so the merchant can verify authenticity.
- Idempotency on the merchant side (we send
event_id; if they’ve seen it before, they no-op).
A merchant endpoint that’s down accumulates a backlog; replay-on-recovery is the standard pattern.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: external network latency. The card network’s own response time (300-800 ms p99) dominates the charge latency. Nothing we do on our side closes that gap. Mitigation: parallelize the fraud check with the auth request when the merchant accepts the risk profile.
Bottleneck #2: ledger contention on a hot merchant. A platform with one merchant doing 1,000 charges/sec is writing 4,000+ ledger entries/sec, mostly to that merchant’s account row’s derived balance. The balance is derived, not authoritative — compute it lazily from entry sums or maintain a sharded cache (see /system-design/sharded-counters). Never lock the merchant’s balance row on every write.
Bottleneck #3: reconciliation throughput. Nightly files for 500 M charges are 100 GB+ each. Process them in parallel by merchant; the reconciliation join is the slow part. Many real systems are 12-24 hours behind on reconciliation, which is fine — it’s a check, not a live consistency requirement.
Alternative I’d push back on: writing balances as authoritative columns updated by triggers. Looks convenient, kills performance, and creates inconsistency bugs. Balances are a view over the ledger, not a source of truth.
What breaks first at 10× scale (300 K charges/sec): the idempotency record table. Per-key writes at high QPS need careful sharding by merchant_id × key_hash. Also: the network connection pool to processors — opening too many connections gets you rate-limited at the network level; pool sharing across customers becomes the next bottleneck.
Companies this resembles#
Stripe, Adyen, PayPal, Square, Braintree. In-house equivalents at Amazon (payment platform), Uber (rider payments), Airbnb (host payouts). Cousins: ledger-only systems for fintech (Modern Treasury, Increase) without the network integration.
Related systems#
- Sequencer — for monotonic txn_id generation across regions.
- Distributed Task Scheduler — webhook retry queue, reconciliation jobs.
- Server-Side Error Monitoring — every failed charge gets a per-merchant alert.
- Uber — depends on this design as the payment substrate for rides.