AI / ML Data Infrastructure — System Design

Step 1 — Clarify Requirements#

Functional

Ingest raw events (clicks, transactions, sensor data) and turn them into ML-usable features.
Serve features online (single-digit ms) for model inference.
Serve features offline (batched) for training and backfills.
Maintain lineage: which feature was derived from which source, with which transformation, at what version.
Store and search embeddings (vectors) for retrieval / similarity.
Out of scope: model training infrastructure, the inference serving layer (see /system-design/chatgpt-system).

Non-functional

Online feature serving: p99 latency under 10 ms.
Offline feature retrieval: training a model should be able to read TB-scale joined feature tables in tens of minutes.
Training/serving consistency: features used at training time and at serving time must be derived identically. Drift here is the leading cause of model regression.
Reproducibility: any prior model artifact must be re-trainable from versioned data.

Step 2 — Capacity Estimation#

Event ingestion: 10 M events/sec (top company scale).
Features computed per event: ~10-100. Total feature writes: ~1 B/sec.
Online feature reads: 1 M inference QPS × 200 features/req = 200 M feature reads/sec.
Offline feature volume: a year of events ≈ 300 PB raw; processed feature tables ≈ 100 PB.
Vector index: 1 B embeddings × 1024 dim × 4 B = 4 TB per index. Several indexes per company.
Models in production: ~1,000 across an org; each with its own feature set, retrained daily-to-monthly.

The system isn’t one piece — it’s three loosely-coupled subsystems (online features, offline features, vectors) sharing lineage and a control plane.

Step 3 — System Interface#

// Online feature serving (low latency, single-row)
GET /features/online
   ?entities={user_id: 'u-123', item_id: 'i-456'}
   &features=['user_avg_session_len', 'user_item_affinity', ...]
→ { feature_name: value, ... }

// Offline feature retrieval (batched, training)
POST /features/offline/get_historical
   Body: { entity_df: ParquetURI, features: [...], event_ts: ColumnName }
→ ParquetURI of joined table (point-in-time-correct features)

// Feature definition (registry)
POST /features/register
   Body: { name, source, transform, entities, freshness_sla, owner, ... }

// Vector
POST /vectors/:index/search
   Body: { query_vector, k, filter? }
→ list of (id, score, payload)

Step 4 — High-Level Design#

                        ┌── Kafka / Kinesis (raw events)
                        │
              raw event stream
                        │
            ┌───────────┴────────────┐
            │                        │
    streaming pipeline      batch pipeline
    (Flink / Spark Stream)   (Spark / dbt / Airflow)
            │                        │
            ▼                        ▼
     online feature store     offline feature store
     (Redis / DynamoDB)        (Parquet on object store + Iceberg / Hudi)
            │                        │
            └──────────┬─────────────┘
                       │
                       ▼
            Feature registry + lineage (control plane)
                       │
                       ├── used by training jobs ─→ model artifact store
                       │
                       └── used by inference  ─→ model serving layer

    Side track:  embedding model ─→ vector store (ANN index)

The control plane (feature registry, lineage, freshness SLAs) is the part that turns “a bunch of Spark jobs” into a real platform.

Step 5 — Data Model#

Feature definitions (registry)#

feature: user_avg_session_len_30d
  source:       events.user_sessions
  entity:       user_id
  transform:    avg(duration) over last 30d, group by user_id
  freshness:    daily
  schema:       double
  owner:        team-personalization
  version:      v3

Each feature is versioned. Existing models pin to a version; new training runs may pick up a newer version. Backwards-incompatible changes get a new feature name.

Online store#

A key-value table keyed by (feature_name, entity_id):

key:    feature:user_avg_session_len:v3:u-123
value:  { value: 12.5, ts: 2026-05-15T08:00:00Z, version: v3 }

Latency target: 1-5 ms per batch of feature lookups. Sharded by (feature_name, entity_id) hash for horizontal scale. Common: Redis, DynamoDB, Cassandra, or purpose-built (Feast, Tecton).

Offline store#

Materialized feature tables in Parquet on object storage, partitioned by date:

s3://features/user_avg_session_len/v3/dt=2026-05-15/part-*.parquet
  columns: user_id, value, ts, version

Read via Spark / Presto / Trino for training data assembly. Iceberg or Hudi provides snapshot isolation, schema evolution, and time travel.

Lineage#

feature: user_avg_session_len:v3
  source_datasets: [events.user_sessions]
  transform_code: git_sha
  dependencies: [user_session_canonical_v2]
  produced_at: timestamps of last runs
  consumed_by: [model_x_v7, model_y_v3, ...]

A graph; the registry walks it to answer “if I change this source, what features and models are affected?”.

Vector store#

Keyed by an opaque id; each entry has:

The vector (float32 or quantized).
A payload (e.g., article metadata) for filter and return.
Indexing metadata for ANN.

ANN algorithms: HNSW (hierarchical navigable small worlds) is the modern default — high recall, low query latency, manageable memory.

Step 6 — Detailed Design#

The training-serving skew problem#

The single most insidious bug in ML systems: a feature is computed one way for training and a slightly different way at serving time. Examples:

Training uses “events from yesterday joined on date” but serving uses “live events as of now” → skew.
Training computes mean over a window; serving uses an exponentially-weighted moving average → skew.

The fix is one transformation pipeline that produces both. The streaming pipeline writes to the online store; the batch pipeline writes to the offline store. The transform code is the same (same library, ideally same binary). A skew detector compares samples periodically and alerts on drift.

Online feature serving: latency#

Inference often needs 100-1000 features per call. Each lookup is 1-2 ms; sequential, that’s seconds. Strategy:

Batch lookup API: send all features at once; the store fans out internally and merges responses.
Per-entity locality: many features share the same entity (user_id). One row holds many features → one read = many values.
Co-locate hot features: features that are always requested together can be packed into a single column-family row.

Realistic budget: 200 features in 5 ms via batched lookups against a sharded Redis-class store.

Offline feature retrieval: point-in-time correctness#

A model trained on yesterday’s logs must see features as of the moment the training event happened, not as of training time. Otherwise we leak future information:

event at t=10:00:00:    user clicked
feature user_avg_session_len = "value as of 10:00:00", NOT as of training time

Implementation: every feature row carries a ts. The join is an as-of join:

SELECT * FROM events e
ASOF LEFT JOIN features f ON f.user_id = e.user_id AND f.ts <= e.event_ts

Modern frameworks (Feast, Tecton, Iceberg) support this natively.

Streaming vs batch: which features go where#

Streaming features — derived from real-time events (clicks-in-last-minute, current-cart-value). Low latency (< 5 s freshness). Higher infra cost; harder to backfill.

Batch features — derived from offline data (lifetime avg, demographics). Daily / hourly freshness. Cheap; trivially backfilled.

Most features are batch. Only the ~20% that need second-level freshness justify streaming complexity. The same feature can be dual-computed (streaming for online, batch for training catch-up) if needed.

Vector store#

write:
   embedding = embed_model(payload)
   index.insert(id, embedding, payload)

read:
   query_vec = embed_model(query)
   results = index.search(query_vec, k=10, filter=...)
   return results

ANN index choice:

HNSW: in-memory, fastest queries, high memory cost (vector × ~2-3× overhead). Best when index fits in RAM.
IVF + PQ (Faiss-style): disk-friendly, more tunable, slightly lower recall at similar memory. Best for very large indexes.
Hybrid (DiskANN / Vamana): vectors on disk, lightweight in-memory navigation graph.

Most modern systems use HNSW for indexes up to ~100 M; switch to IVF / disk-based for larger.

Embeddings pipelines#

new entity created → embed via current model → write to vector store
model upgrade      → re-embed entire corpus offline → swap indexes atomically

Re-embedding billions of items takes hours-to-days; atomic index swap (build the new index alongside the old; flip the alias) avoids downtime.

Lineage and impact analysis#

A change to a source table can ripple through hundreds of features and dozens of models. The lineage graph answers:

“If I change events.user_sessions schema, what features break?”
“Model X regressed yesterday — what upstream features changed?”
“Re-train model Y from scratch — what raw datasets and transform versions are needed?”

This is a graph database problem; storage is small (~millions of nodes), updates are infrequent.

Freshness SLAs and monitoring#

Each feature declares a freshness SLA: “this feature is at most 10 minutes stale”. The control plane monitors the actual freshness (timestamp of latest write per feature). If SLA is violated:

Alert the feature owner.
Optionally route inference to a fallback (older value, default, model-side handling).

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: online feature read fanout. A model needing 1000 features at 1 ms each is the worst case. Real systems aggressively co-locate features per entity (packed rows) and accept that the rare cross-entity feature lookup costs more. New models are constrained to feature sets within bandwidth budget.

Bottleneck #2: backfill cost. A new feature retroactively computed over 2 years of events is a multi-hour Spark job at petabyte scale. The control plane queues backfills with priority lanes so they don’t starve everyday batch jobs.

Bottleneck #3: vector index update latency. HNSW supports incremental insertion but degrades with high churn; periodic full rebuild is required. For high-churn corpora (e.g., LLM context retrieval over user docs), the rebuild cadence becomes critical.

Alternative I’d push back on: skipping the feature store and computing features inline in model code. Looks lean for one model; becomes spaghetti when 50 models share inputs. The feature store’s main job is forcing one canonical definition per feature — that’s the hard organizational win, not the technology.

What breaks first at 10× scale: lineage graph maintenance. A graph with 100s of millions of edges across thousands of teams becomes unmanageable without strict naming conventions and per-team ownership boundaries. The technical bottleneck is light; the organizational one is real.

Companies this resembles#

Uber Michelangelo, Airbnb Zipline, Meta FBLearner, Netflix’s metaflow ecosystem, LinkedIn’s Liger, Tecton (commercial Feast-like), Databricks Feature Engineering, Pinecone / Weaviate / Qdrant on the vector side.

ChatGPT-style Conversational System — depends on this design for retrieval features.
LLM-Powered Customer Support Bot — RAG on top of a vector store.
Newsfeed System — heaviest consumer of online feature serving in most companies.
Distributed Task Scheduler — substrate for batch backfills and re-embedding jobs.

Step 1 — Clarify Requirements#

Step 2 — Capacity Estimation#

Step 3 — System Interface#

Step 4 — High-Level Design#

Step 5 — Data Model#

Feature definitions (registry)#

Online store#

Offline store#

Lineage#

Vector store#

Step 6 — Detailed Design#

The training-serving skew problem#

Online feature serving: latency#

Offline feature retrieval: point-in-time correctness#

Streaming vs batch: which features go where#

Vector store#

Embeddings pipelines#

Lineage and impact analysis#

Freshness SLAs and monitoring#

Step 7 — Evaluation & Trade-offs#

Companies this resembles#

Related systems#