Amazon DynamoDB
Partition + sort key model, single-table design, on-demand vs provisioned capacity, GSIs, transactions, the cost model.
What it is#
Amazon DynamoDB is AWS’s fully-managed key-value and document database, announced in 2012 and an evolution of the Dynamo paper (DeCandia et al., 2007) that informed Cassandra, Riak, and Voldemort. It is the default operational database for AWS-native applications that need single-digit-millisecond p99 read/write at any scale: Amazon’s own retail order pipeline (which famously moved off Oracle to Dynamo in the late 2010s), Snapchat, Lyft, Disney+, Tinder, plus a long list of internal AWS services.
The model is deliberately narrow. Every item is keyed by a partition key (mandatory) and optionally a sort key. The engine guarantees consistent latency across data volumes from kilobytes to petabytes by hashing the partition key to a physical partition and scaling out by adding partitions. There are no joins, no foreign keys, no transactional SELECT ... ORDER BY across-tables. The trade is well-understood: give up the relational model, gain horizontal scalability and operational invisibility.
Architecture overview#
DynamoDB is built on a sharded distributed key-value store. The user sees a logical table; AWS sees thousands of physical partitions, each replicated three ways across availability zones in a region, behind a request router fleet.
client │ ▼ ┌── request router fleet ──┐ │ authenticate · rate- │ │ limit · route by hash │ └────────────┬──────────────┘ │ hash(partition_key) → partition_id ▼ ┌── storage partition ──┐ (one of many; 3 replicas across AZs) │ leader (Paxos) │ │ log + B-tree + memtable │ followers replicate │ └───────────────────────┘ │ ▼ S3 backupEach partition holds up to 10 GB of data and 3000 RCUs / 1000 WCUs of throughput. When a partition hits either ceiling, DynamoDB auto-splits it — the partition’s key range is divided, items rebalance, and the two new partitions take over. The split is online and invisible to clients.
A Read Capacity Unit (RCU) is 4 KB of eventually-consistent read per second (or 2 of strong read). A Write Capacity Unit (WCU) is 1 KB of write per second. These are the units of cost and throttling — the per-second budget the partition enforces against incoming requests.
The metadata service tracks the partition map, capacity allocations, and routing rules. Per the 2022 DynamoDB paper (Elhemali et al.), this metadata service is itself a critical hot path — early in DynamoDB’s life, it was a single-region MySQL deployment that the team replaced with a distributed solution after near-misses on cold-start storms.
Storage and indexing#
The on-disk format per partition is a B-tree-plus-log hybrid evolved from earlier internal storage engines; recent disclosures point at LSM-tree-like sorted-segment storage tuned for write throughput. Each item is stored as a JSON-like map; nested attributes are supported but not directly indexable.
Two index types:
- Local Secondary Index (LSI) — defined at table creation, must share the partition key, lets you query on an alternative sort key within a partition. Up to 5 per table; co-located with the base table, so writes don’t pay an extra partition hop.
- Global Secondary Index (GSI) — independent partitioning scheme, can use any attribute as partition key + sort key. Asynchronous propagation from base table; “consistent” only with eventual semantics. Up to 20 per table; effectively a second table you don’t have to maintain. Each GSI has its own RCU/WCU allocation.
The partition key choice is the central design decision in DynamoDB. Three rules of thumb:
- High cardinality. Distinct values numbering at least 10x the partition count. A user ID is good; a country code is bad.
- Even access. The hot partition is the classic DynamoDB failure. If 5% of partition keys take 50% of traffic, that 5% will throttle while the rest sits idle.
- Lookup-aligned. The most frequent query pattern should be answerable by
(partition_key, sort_key)directly. GSIs cost roughly as much as the base write throughput.
Single-table design is the canonical DynamoDB pattern: model the entire application — users, orders, line items, sessions, audit log — as one table with a generic partition key (PK) and sort key (SK) and overload them per entity type. PK = "USER#42", SK = "PROFILE" for the user record; PK = "USER#42", SK = "ORDER#2026-01-15#1234" for an order. A single query by PK then returns the user and their orders in one round-trip. Powerful, dense, harder to reason about than a relational schema — and the dominant pattern in production DynamoDB.
Query and transaction execution#
DynamoDB exposes a handful of API operations rather than a query language:
- GetItem / BatchGetItem — point lookup by full key. Strongly consistent or eventually consistent (twice as cheap eventually).
- PutItem / UpdateItem / DeleteItem — single-item writes. Conditional expressions enable atomic check-and-set without read-modify-write.
- Query — return items by partition key, optionally filtered/sorted by sort key. Up to 1 MB returned per call.
- Scan — full-table linear scan; quadratically expensive at scale, used mostly for one-off operations and exports.
- TransactWriteItems / TransactGetItems — up to 100 items, all-or-nothing across the transaction. Costs 2x normal capacity.
- PartiQL — SQL-like syntax layer atop the same operations; convenient, not a relational engine underneath.
Transactions are the 2018+ addition that closed the biggest gap with relational stores. They use two-phase commit under the hood — the request router coordinates with each partition’s leader, gathers prepare votes, then commits. Latency is higher than single-item writes (typically 2-3x); throughput per partition halves because each item is touched twice.
Consistency:
- Eventually consistent reads — may return slightly stale data; cheaper; read from any replica.
- Strongly consistent reads — return the latest committed write; routed to the leader; ~2x the cost; small latency increase.
- Global tables — multi-region active-active replication; last-writer-wins on conflict; eventually consistent across regions. The mechanism is streams + cross-region writers.
Replication and durability#
Each partition is replicated three ways across availability zones using Paxos for log consensus. Writes commit when a quorum (2 of 3) acknowledges the log append. Reads can serve from any replica (eventually consistent) or from the leader (strongly consistent).
DynamoDB Streams is the change feed: every write produces a stream record with the before/after image. Stream records are kept for 24 hours; downstream consumers (Lambda, Kinesis Data Firehose) replay them. Streams are how Global Tables, secondary integrations, and most “after-the-write” workflows are built.
Point-in-time recovery (PITR) allows restoring the table to any second in the last 35 days. The implementation continuously snapshots the change log to S3; restore creates a new table from the chosen point.
Backups — on-demand snapshots stored in S3, retained indefinitely. Restore creates a new table; the original is untouched.
Durability is multi-AZ replicated + multi-replica acked; AWS publishes 99.999999999% durability (“eleven nines”). Availability is 99.99% in-region or 99.999% with Global Tables, contractually.
Operational characteristics#
A DynamoDB table operates in this envelope:
- Latency: single-digit ms p50; ~10 ms p99 for GetItem; 20-50 ms p99 for transactions. Bounded by the partition’s local IOPS plus one network hop.
- Item size: up to 400 KB per item including all attributes. Large blobs go to S3 with a pointer in DynamoDB.
- Throughput: scales horizontally indefinitely as long as the partition key distributes load. AWS internal tables run at millions of RCU/WCU.
- Capacity modes:
- Provisioned — pay for declared RCU/WCU per hour. Autoscaling can adjust within limits. Cheapest at high steady utilisation.
- On-demand — pay per request. ~7x the per-request price of fully-utilised provisioned. Optimal for spiky or unpredictable workloads.
- Cost model — three buckets: request units (RCU/WCU/transactional), storage (per GB-month), data transfer out. Streams, PITR, Global Tables, and backups have additional per-feature line items.
Adaptive capacity (2018+) lets the engine temporarily exceed a partition’s provisioned throughput by borrowing from underutilised siblings. Burst capacity retains up to 5 minutes of unused capacity for short spikes. Together they smooth over many transient hot-partition cases — but neither extends past the 3000 RCU / 1000 WCU hard ceiling per partition.
The infrastructure-as-code story is mature: Terraform’s aws_dynamodb_table, AWS CDK, CloudFormation, the AWS SDK in every language. Schema changes are cheap (add an attribute by writing it) and expensive in tooling (rename a key requires a backfill + cutover pattern, never a single DDL).
Trade-offs and gotchas#
Where DynamoDB reliably surprises engineers from relational backgrounds:
- No joins, no aggregates, no
GROUP BY. Application code reassembles related items or denormalises at write time. Cross-table analytics moves to S3 + Athena or a separate warehouse. - The cost model rewards specific access patterns. Designing the table for the known queries is mandatory; “we’ll figure out the queries later” is the most expensive way to use DynamoDB.
- GSI eventual consistency means stale reads on writes. A write to the base table propagates to a GSI within seconds typically; building features that read-after-write on a GSI is a recurring source of bugs.
- Hot partition is invisible until it isn’t. CloudWatch metrics show throttled requests but the partition that’s throttling isn’t directly observable; debugging requires correlating throttle events with access patterns.
- Strong consistency costs 2x RCU. Defaulting to strongly-consistent reads everywhere is a common cost-creep mistake.
- Items don’t shrink. Updating a 400-byte item with a 200-byte payload doesn’t reclaim 200 bytes immediately; tombstones and version metadata persist until compaction.
- Streams have a 24-hour retention. A consumer that lags past 24 hours loses change events forever. Critical pipelines should mirror to Kinesis Firehose or S3.
- TTL is approximate. Items with a TTL attribute past expiration are deleted within ~48 hours, not immediately. Don’t rely on TTL for security or correctness.
- TransactWriteItems is limited to 100 items. Larger transactional workflows need application-level sagas with idempotent steps.
- Cross-region failover is asynchronous. Global Tables replicate eventually; an active-active strategy needs conflict-resolution logic in the application.
Why Amazon retail moved off Oracle to DynamoDB
By 2018 Amazon’s retail backend ran on roughly 7,500 Oracle databases. The migration to AWS-native services (mostly DynamoDB plus Aurora) was completed by 2019. The motivations were operational, not benchmark-driven: licensing costs in the hundreds of millions per year, DBA staff for Oracle-specific tuning, and the inability of vertically-scaled Oracle to absorb peak Prime-Day traffic without months of preparation. DynamoDB’s auto-partitioning meant capacity planning became a Terraform line item rather than a DBA project. The retail order pipeline — write a billion items per Prime Day, read trillions — runs on DynamoDB tables that look like the patterns documented in the public AWS blog series. The migration was the most ambitious internal proof that DynamoDB’s narrow API was sufficient for serious transactional workloads.
Related systems#
- LSM-Tree Storage — the storage engine paradigm DynamoDB’s partition storage is built on.
- Google Spanner — the closest direct competitor for globally-consistent scale-out storage.
- PostgreSQL — The Reference Open-Source RDBMS — the relational alternative when joins and ad-hoc queries matter more than scale.
- Write-Ahead Logging and Recovery — the durability mechanism underneath DynamoDB’s per-partition log.
- MVCC — Multi-Version Concurrency Control — DynamoDB’s transactions and conditional writes use version-stamped semantics.