Blob Store — System Design · Engineering Playbook

Use cases#

A blob store holds opaque binary objects — anything from a 1 KB user avatar to a 5 TB genomic dataset — keyed by string. Different from a filesystem (no nested directories, no mtime for fcntl semantics) and different from a database (no joins, no fine-grained query). The dominant workloads:

User uploads — photos, videos, attachments, document files. The original S3 use case (2006).
Static assets behind a CDN — JS/CSS/image bundles; the blob store is the origin, the CDN fronts it.
Backups, snapshots, archives — DB dumps, container images, ML model checkpoints. Cheap, durable, infrequently accessed.
Data lake — Parquet/ORC files queried by Athena, Snowflake, BigQuery external tables. The modern lakehouse.
Application binaries — npm tarballs, PyPI wheels, Docker layers, mobile app bundles.

A bucket is a flat namespace; the appearance of folders comes from / in keys plus prefix-listing.

Functional requirements#

PUT(key, bytes, metadata) — store an object.
GET(key) with optional byte-range — retrieve part or all.
DELETE(key) — remove.
LIST(prefix) — enumerate keys (often paginated, eventually consistent).
Multipart uploads for large objects (> 5 GB typically).
Per-object metadata (user-defined x-amz-meta-* headers, content-type).
Lifecycle rules (transition / expire by age, prefix, or tag).
Versioning (keep historical versions, support undelete).
Signed URLs for time-limited public access.

Non-functional requirements#

Durability: 11 nines (99.999999999%) is S3’s well-known target — practically zero data loss over geological timescales, achieved via replication + erasure coding.
Availability: 99.99% per region (S3 Standard SLA: 99.9% with credit at 99%).
Latency: first-byte under 100 ms for hot objects; sub-millisecond is not realistic — for that, use a Distributed Cache.
Throughput: per-prefix at least 3500 PUTs/sec and 5500 GETs/sec on S3; aggregate is essentially unlimited.
Object size: KB to 5 TB; single PUT capped at 5 GB.

High-level design#

   client                  API layer                metadata service              storage nodes
  ────────             ───────────────             ──────────────────           ─────────────────
  PUT key,bytes ──> auth, accept multipart ──> assign object ID, chunks ──> write chunks
                       ▲                        record metadata + ETag        to N hosts (RF=3
                       │                                                       or 6+3 erasure code)
  GET key      ──> lookup metadata ──> resolve to chunk hosts ──> stream chunks back
                       │                                                       to client
                       └── prefix list = scan keyspace shard

Three tiers: API (auth, parsing, routing), metadata (key → chunk locations + properties), storage (the actual bytes). All three are horizontally scaled.

Detailed design#

Chunking#

Objects larger than a chunk size (typically 4-64 MB) are split. Each chunk is independently replicated and addressable. Benefits:

Parallel upload / download — N chunks streamed simultaneously.
Resumable uploads — chunk-level retries instead of restarting an 800 GB upload from byte 0.
Deduplication — content-addressed chunks (hash(bytes)) let popular bytes (Docker layers, npm packages) be stored once and referenced many times.
Byte-range GETs — Range: bytes=1000000-2000000 resolves to just the chunks that span the range.

Multipart upload is the API surface: client calls CreateMultipartUpload → UploadPart (in any order) → CompleteMultipartUpload. The blob store stitches the parts into a single logical object.

Durability: replication vs erasure coding#

3-way replication        — store 3 copies. Survives 2 failures. Storage cost: 3×.
Reed-Solomon 6+3         — 6 data shards + 3 parity. Survives 3 failures. Storage cost: 1.5×.
Reed-Solomon 10+4        — 10 data + 4 parity. Survives 4 failures. Storage cost: 1.4×.

Erasure coding halves storage cost vs 3-way replication at the same durability. The trade-off: reads require fetching multiple shards and reconstructing on the fly (slower), and writes have higher CPU cost. Most blob stores use replication for hot data (recently written) and erasure coding for cold data after a few hours.

Metadata service#

The metadata service is the hard part. It maps (bucket, key) to {chunk_ids, etag, size, content_type, owner, acl, version_id}. For S3-scale traffic that’s hundreds of billions of keys and millions of QPS.

Sharded KV — partition the keyspace by hash(bucket + key). Common approach. Lookups are O(1).
B-tree per bucket — supports efficient prefix listing (LIST with a prefix). Sometimes layered on top of the KV.
Cached at edge — hot keys live in an in-memory cache; ETag + last-modified make conditional GETs ultra-cheap.

Consistency#

S3 is now strongly read-after-write consistent (since Dec 2020) — a successful PUT can be immediately GET’d. Pre-2020 S3 was eventually consistent, which was a constant source of surprise.

LIST operations remain eventually consistent because they fan out across many metadata shards.

Lifecycle and tiering#

S3 storage classes by access pattern:

Standard            — 99.99% availability, ~$0.023/GB/mo, ms latency
Intelligent-Tiering — auto-migrates between tiers based on access
Standard-IA         — Infrequent Access, ~$0.0125/GB/mo, retrieval fee
One Zone-IA         — single AZ, cheaper, lower durability
Glacier Instant     — archive at ~$0.004/GB/mo, ms retrieval
Glacier Flexible    — minutes-to-hours retrieval, ~$0.0036/GB/mo
Glacier Deep Archive — 12+ hour retrieval, ~$0.00099/GB/mo

Lifecycle rules transition objects automatically: transition to STANDARD_IA after 30 days, transition to GLACIER after 90 days, expire after 7 years.

Versioning and immutability#

Versioning keeps every PUT as a distinct version (?versionId=...). Useful for undelete and audit. Cost: storage grows with every overwrite.

Object Lock + Retention enforce write-once-read-many (WORM) semantics — required for compliance (SEC 17a-4, HIPAA), and a defense against ransomware encrypting your backups.

Signed URLs#

Time-bounded credentials baked into a URL:

https://example-bucket.s3.amazonaws.com/private/file.pdf?
  X-Amz-Algorithm=AWS4-HMAC-SHA256&
  X-Amz-Expires=3600&
  X-Amz-Signature=HMAC(secret, canonical_request)

Clients GET / PUT the URL directly without proxying through your servers. Lets users upload 5 TB files without consuming any of your bandwidth. The signing happens on your backend; the bytes flow directly between user and S3.

Notifications#

Object events (created, deleted, restored) publish to an event bus:

S3 → SNS / SQS / Lambda / EventBridge → consumer

Triggers the canonical pipeline: user uploads photo → S3 notification → Lambda resizes → writes derivatives back. No polling, sub-second triggering.

Trade-offs#

3-way replication (RF=3) — fast writes, fast reads, simple recovery, 3× storage cost. Standard for hot tiers.

Erasure coding (e.g. 10+4) — 1.4× storage, same durability, but every read fetches multiple shards (more network) and reconstruction on failure is CPU-heavy. Used for cold tiers and archive.

Other axes:

Strong consistency vs LIST performance — LIST is fundamentally a fan-out across shards. Strong-consistent LIST is expensive; eventual LIST is cheap.
Inline metadata vs external index — small per-object metadata on the object itself is simple; a separate inverted index (Elasticsearch) on top enables rich queries at the cost of an extra service.
Self-hosted (MinIO, Ceph) vs cloud (S3, GCS) — self-hosted is cheap at petabyte scale and avoids egress fees, but you own durability operations. S3 is the gold standard for managed durability.

Real-world examples#

Amazon S3 — the canonical blob store; powers Dropbox, Netflix, Pinterest origin storage, and a vast fraction of the web’s static assets.
Google Cloud Storage — similar shape, with stronger multi-region semantics by default; backs Spotify’s CDN origin.
Azure Blob Storage — block / page / append-blob variants; Microsoft Teams stores chat attachments here.
Cloudflare R2 — S3-compatible API with zero egress fees — the disrupting move of the early 2020s; popular for CDN origins and AI training data.
MinIO — open-source, S3-compatible; deployed on-prem and in regulated industries.
Backblaze B2 — budget cloud storage with simple pricing; popular for backups and indie media archives.
Cloudflare’s R2 + Backblaze’s B2 — together with the Bandwidth Alliance, broke S3’s egress-fee lock-in.
Tigris, Wasabi — newer entrants emphasizing edge-replicated S3 compatibility.

Content Delivery Network — typical front for blob-store origin.
Distributed Search — used to index blob metadata and enable “search across uploaded documents”.
Databases — typically holds the foreign key references to blob keys.
Distributed Messaging Queue — object notifications feed pipelines via SNS/SQS or equivalents.

Use cases#

Functional requirements#

Non-functional requirements#

High-level design#

Detailed design#

Chunking#

Durability: replication vs erasure coding#

Metadata service#

Consistency#

Lifecycle and tiering#

Versioning and immutability#

Signed URLs#

Notifications#

Trade-offs#

Real-world examples#

Related building blocks#