Payment System

Idempotency, double-entry ledgers, reconciliation, gateway integration, fraud signals.

System Advanced
9 min read
payments ledger idempotency reconciliation
Companies this resembles: Stripe · PayPal · Adyen · Square

Step 1 — Clarify Requirements#

Functional

  • Charge a customer’s payment method on behalf of a merchant.
  • Support multiple payment methods (cards, ACH, wallets, alternative methods per country).
  • Refund, partial refund, dispute / chargeback handling.
  • Idempotent API — duplicate requests never double-charge.
  • Reconciliation against bank / processor statements.
  • Out of scope here: subscription billing, invoicing UI, fraud-model training pipeline.

Non-functional

  • 99.99% availability for the charge path; downtime directly maps to lost merchant revenue.
  • p99 charge latency under 1 second (including external processor call).
  • Strict consistency on money — no charge can be lost, double-applied, or rounded incorrectly.
  • 30 K charges/sec peak (large-platform scale: a Black Friday at Stripe or PayPal).
  • 7-year financial audit trail; immutable history.

Step 2 — Capacity Estimation#

  • Charges/day: 30 K/sec peak × ~20% peak ratio × 86 400 ≈ ~500 M charges/day at top scale (Stripe-class).
  • Storage per charge: ~5 KB of structured data (charge record + envelope + idempotency key + ledger entries). 500 M × 5 KB = 2.5 TB/day; ~1 PB over 5 years including ledger entries.
  • External calls: each charge is ~2-3 round-trips to card networks / acquirers / 3DS providers. Network is the latency dominator.
  • Refund rate: ~2% of charges. ~10 M refunds/day at top scale.
  • Disputes / chargebacks: ~0.1% of charges. The dispute lifecycle takes weeks; storage matters more than throughput.

The system isn’t large in QPS or storage — it’s enormous in correctness requirements. Every line of code is “could this cause a money error”.

Step 3 — System Interface#

POST /charges
Idempotency-Key: <client-provided UUID>
Body: {
amount: int (smallest unit, e.g., cents),
currency: string (ISO 4217),
payment_method: { type, ... },
customer_id: string,
metadata: { ... }
}
→ 200 { id, status: 'succeeded'|'pending'|'requires_action'|'failed', ... }
POST /refunds
Idempotency-Key: <UUID>
Body: { charge_id, amount }
GET /charges/:id
POST /charges/:id/capture (for two-phase auth + capture)
POST /webhooks/:network (incoming events from processors)

The Idempotency-Key is the single most important header in the entire API. It’s how the client (often retrying after a flaky network) guarantees not to double-charge.

Step 4 — High-Level Design#

merchant API call
API gateway ─→ idempotency check (Redis + DB)
│ │
│ ├─ key seen? return stored result
│ └─ new key: claim & proceed
charge service ─→ fraud service (sync or async based on risk)
processor router ─→ external network (Visa, MC, ACH, etc.)
ledger writer (double-entry, transactional)
event bus ─→ webhook delivery + reconciliation + analytics
reconciliation engine (compares ledger vs network statements)

The ledger is the source of truth. Every other component (webhooks, reports, refunds) is derived from it.

Step 5 — Data Model#

Double-entry ledger#

Each money movement is two equal-and-opposite entries in two accounts. Total of all entries is always zero.

table accounts
account_id uuid PK
type enum(merchant_balance, customer_payable, fees, reserve, ...)
currency string
balance int // derived (cached); source of truth is the entries
table ledger_entries
entry_id uuid PK
txn_id uuid -- groups entries that compose one logical transaction
account_id uuid
amount int -- signed; positive = credit
currency string
ts timestamp
description string
CHECK: SUM(amount) OVER (PARTITION BY txn_id) = 0

A charge of $10 from customer to merchant:

txn_id = T1
+10.00 → merchant_balance[merchant_A]
-10.00 → customer_payable[customer_X]

A platform fee of $0.30 within that same charge:

txn_id = T1
-0.30 → merchant_balance[merchant_A]
+0.30 → fees[platform]

All four entries share txn_id = T1; their sum is zero. The ledger is append-only — corrections happen via new entries (reversals), never UPDATE / DELETE.

Charges#

table charges
charge_id uuid PK
idempotency_key string UNIQUE
merchant_id uuid
customer_id uuid
amount int
currency string
status enum(...)
network_txn_id string // returned by Visa/MC
ledger_txn_id uuid // links to ledger entries
created_at timestamp
metadata json

Idempotency#

table idempotency_records
key string PK
merchant_id uuid (scoped per-merchant; key uniqueness is per-merchant)
request_hash bytes // hash of the full request body
response_body bytes
status enum(in_progress, completed, failed)
created_at timestamp
expires_at timestamp // typically 24 hours

The key is the binding between a client’s “I sent this charge request” and the server’s response.

Step 6 — Detailed Design#

Idempotency in detail#

POST /charges with Idempotency-Key: abc-123:
1. Look up idempotency_records[abc-123].
2. If record exists with status=completed:
verify request_hash matches; return stored response.
3. If record exists with status=in_progress:
return 409 (in progress) or wait briefly for completion.
4. If no record:
INSERT idempotency_records[abc-123] (in_progress, request_hash) -- atomic via unique constraint.
proceed with charge.
UPDATE idempotency_records[abc-123] (completed, response_body).

The unique constraint at step 4 is what enforces “only one writer per key” in the face of concurrent retries from the same client. PG, MySQL, DynamoDB, all support this via primary-key insert semantics.

Charge lifecycle#

1. POST /charges arrives. Idempotency check passes (new request).
2. Fraud scoring (sub-50 ms). If clear, proceed. If risky, route to manual review queue.
3. Processor router picks the right acquirer for (card brand, currency, country).
4. Send authorization request to the card network. Timeout: 30 seconds.
5. Network responds: approved (with auth_code) or declined (with reason).
6. If approved:
write ledger entries in a transaction with charge update.
emit `charge.succeeded` event.
7. If declined:
update charge to status=failed, write event.
8. Return result + persist in idempotency_records.

Step 6’s “transaction” is the critical part. The ledger entries and the charge status must commit together — otherwise we could have “the network approved but our records don’t show it” (lost money) or “our records say approved but the network never charged” (free chargeback).

Two-phase: authorize then capture#

Many flows want to authorize first (place a hold on the card), then capture later (actually move the money). Used by hotels, car rentals, marketplaces with order fulfillment.

POST /charges { capture: false } → status: authorized (hold for ~7 days)
POST /charges/:id/capture → status: succeeded (ledger entries written)
POST /charges/:id/cancel → status: voided (no ledger entries needed)

Authorizations expire if not captured. The system tracks expiration and auto-voids stale auths.

Reconciliation#

The card network sends nightly settlement files: “here are the transactions that cleared today, with our totals”. We compare against our ledger:

for each network_record:
find local charge by network_txn_id
if local says succeeded and amount matches → reconciled
if local says succeeded but amount differs → flag for investigation
if local says succeeded and no network record → flag (could be settlement delay)
if local says failed but network record exists → critical (money on the line)

Discrepancies are the daily reconciliation work. Most are settlement-timing related; the rare critical ones (orphaned charges) trigger an investigation runbook.

Refunds#

POST /refunds { charge_id, amount }:
1. Idempotency check.
2. Verify the charge is refundable (status, time window, available balance).
3. Send refund request to processor.
4. On success: write reverse ledger entries (mirror image of the original charge).
txn_id = R1 referencing original T1
-10.00 → merchant_balance[merchant_A]
+10.00 → customer_payable[customer_X]
5. Update charge.refunded_amount; status if fully refunded.

Refunds use the same double-entry pattern; the ledger remains balanced.

Disputes / chargebacks#

When a customer disputes with their bank, the bank sends a “chargeback” message. The merchant has a window to respond with evidence. The system:

  • Marks the charge as disputed (does not yet move money).
  • On loss: writes ledger entries reversing the original charge plus a fee.
  • On win: no money moves.

This part of the system is procedurally complex (network-specific rules) but mechanically simple in the ledger model.

Multi-currency#

Each entry has a currency; accounts are single-currency. Cross-currency transactions go through an explicit FX conversion step that writes both legs:

customer pays $10 USD → merchant settles in €:
-10.00 USD → customer_payable[customer_X]
+10.00 USD → fx_holding[platform]
-9.10 EUR → fx_holding[platform]
+9.10 EUR → merchant_balance[merchant_A]
(rate snapshot kept for audit)

The platform absorbs the FX risk between when the charge happens and when it settles — usually under a minute, so very small.

Webhook delivery#

Every state change emits an event. Webhooks deliver these to merchant endpoints:

  • At-least-once delivery with exponential backoff.
  • Signed payloads (HMAC) so the merchant can verify authenticity.
  • Idempotency on the merchant side (we send event_id; if they’ve seen it before, they no-op).

A merchant endpoint that’s down accumulates a backlog; replay-on-recovery is the standard pattern.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: external network latency. The card network’s own response time (300-800 ms p99) dominates the charge latency. Nothing we do on our side closes that gap. Mitigation: parallelize the fraud check with the auth request when the merchant accepts the risk profile.

Bottleneck #2: ledger contention on a hot merchant. A platform with one merchant doing 1,000 charges/sec is writing 4,000+ ledger entries/sec, mostly to that merchant’s account row’s derived balance. The balance is derived, not authoritative — compute it lazily from entry sums or maintain a sharded cache (see /system-design/sharded-counters). Never lock the merchant’s balance row on every write.

Bottleneck #3: reconciliation throughput. Nightly files for 500 M charges are 100 GB+ each. Process them in parallel by merchant; the reconciliation join is the slow part. Many real systems are 12-24 hours behind on reconciliation, which is fine — it’s a check, not a live consistency requirement.

Alternative I’d push back on: writing balances as authoritative columns updated by triggers. Looks convenient, kills performance, and creates inconsistency bugs. Balances are a view over the ledger, not a source of truth.

What breaks first at 10× scale (300 K charges/sec): the idempotency record table. Per-key writes at high QPS need careful sharding by merchant_id × key_hash. Also: the network connection pool to processors — opening too many connections gets you rate-limited at the network level; pool sharing across customers becomes the next bottleneck.

Companies this resembles#

Stripe, Adyen, PayPal, Square, Braintree. In-house equivalents at Amazon (payment platform), Uber (rider payments), Airbnb (host payouts). Cousins: ledger-only systems for fintech (Modern Treasury, Increase) without the network integration.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.