AI / ML Data Infrastructure
Feature stores, training data pipelines, online vs batch features, lineage, vector storage.
Step 1 — Clarify Requirements#
Functional
- Ingest raw events (clicks, transactions, sensor data) and turn them into ML-usable features.
- Serve features online (single-digit ms) for model inference.
- Serve features offline (batched) for training and backfills.
- Maintain lineage: which feature was derived from which source, with which transformation, at what version.
- Store and search embeddings (vectors) for retrieval / similarity.
- Out of scope: model training infrastructure, the inference serving layer (see
/system-design/chatgpt-system).
Non-functional
- Online feature serving: p99 latency under 10 ms.
- Offline feature retrieval: training a model should be able to read TB-scale joined feature tables in tens of minutes.
- Training/serving consistency: features used at training time and at serving time must be derived identically. Drift here is the leading cause of model regression.
- Reproducibility: any prior model artifact must be re-trainable from versioned data.
Step 2 — Capacity Estimation#
- Event ingestion: 10 M events/sec (top company scale).
- Features computed per event: ~10-100. Total feature writes: ~1 B/sec.
- Online feature reads: 1 M inference QPS × 200 features/req = 200 M feature reads/sec.
- Offline feature volume: a year of events ≈ 300 PB raw; processed feature tables ≈ 100 PB.
- Vector index: 1 B embeddings × 1024 dim × 4 B = 4 TB per index. Several indexes per company.
- Models in production: ~1,000 across an org; each with its own feature set, retrained daily-to-monthly.
The system isn’t one piece — it’s three loosely-coupled subsystems (online features, offline features, vectors) sharing lineage and a control plane.
Step 3 — System Interface#
// Online feature serving (low latency, single-row)GET /features/online ?entities={user_id: 'u-123', item_id: 'i-456'} &features=['user_avg_session_len', 'user_item_affinity', ...]→ { feature_name: value, ... }
// Offline feature retrieval (batched, training)POST /features/offline/get_historical Body: { entity_df: ParquetURI, features: [...], event_ts: ColumnName }→ ParquetURI of joined table (point-in-time-correct features)
// Feature definition (registry)POST /features/register Body: { name, source, transform, entities, freshness_sla, owner, ... }
// VectorPOST /vectors/:index/search Body: { query_vector, k, filter? }→ list of (id, score, payload)Step 4 — High-Level Design#
┌── Kafka / Kinesis (raw events) │ raw event stream │ ┌───────────┴────────────┐ │ │ streaming pipeline batch pipeline (Flink / Spark Stream) (Spark / dbt / Airflow) │ │ ▼ ▼ online feature store offline feature store (Redis / DynamoDB) (Parquet on object store + Iceberg / Hudi) │ │ └──────────┬─────────────┘ │ ▼ Feature registry + lineage (control plane) │ ├── used by training jobs ─→ model artifact store │ └── used by inference ─→ model serving layer
Side track: embedding model ─→ vector store (ANN index)The control plane (feature registry, lineage, freshness SLAs) is the part that turns “a bunch of Spark jobs” into a real platform.
Step 5 — Data Model#
Feature definitions (registry)#
feature: user_avg_session_len_30d source: events.user_sessions entity: user_id transform: avg(duration) over last 30d, group by user_id freshness: daily schema: double owner: team-personalization version: v3Each feature is versioned. Existing models pin to a version; new training runs may pick up a newer version. Backwards-incompatible changes get a new feature name.
Online store#
A key-value table keyed by (feature_name, entity_id):
key: feature:user_avg_session_len:v3:u-123value: { value: 12.5, ts: 2026-05-15T08:00:00Z, version: v3 }Latency target: 1-5 ms per batch of feature lookups. Sharded by (feature_name, entity_id) hash for horizontal scale. Common: Redis, DynamoDB, Cassandra, or purpose-built (Feast, Tecton).
Offline store#
Materialized feature tables in Parquet on object storage, partitioned by date:
s3://features/user_avg_session_len/v3/dt=2026-05-15/part-*.parquet columns: user_id, value, ts, versionRead via Spark / Presto / Trino for training data assembly. Iceberg or Hudi provides snapshot isolation, schema evolution, and time travel.
Lineage#
feature: user_avg_session_len:v3 source_datasets: [events.user_sessions] transform_code: git_sha dependencies: [user_session_canonical_v2] produced_at: timestamps of last runs consumed_by: [model_x_v7, model_y_v3, ...]A graph; the registry walks it to answer “if I change this source, what features and models are affected?”.
Vector store#
Keyed by an opaque id; each entry has:
- The vector (float32 or quantized).
- A payload (e.g., article metadata) for filter and return.
- Indexing metadata for ANN.
ANN algorithms: HNSW (hierarchical navigable small worlds) is the modern default — high recall, low query latency, manageable memory.
Step 6 — Detailed Design#
The training-serving skew problem#
The single most insidious bug in ML systems: a feature is computed one way for training and a slightly different way at serving time. Examples:
- Training uses “events from yesterday joined on date” but serving uses “live events as of now” → skew.
- Training computes
meanover a window; serving uses an exponentially-weighted moving average → skew.
The fix is one transformation pipeline that produces both. The streaming pipeline writes to the online store; the batch pipeline writes to the offline store. The transform code is the same (same library, ideally same binary). A skew detector compares samples periodically and alerts on drift.
Online feature serving: latency#
Inference often needs 100-1000 features per call. Each lookup is 1-2 ms; sequential, that’s seconds. Strategy:
- Batch lookup API: send all features at once; the store fans out internally and merges responses.
- Per-entity locality: many features share the same entity (
user_id). One row holds many features → one read = many values. - Co-locate hot features: features that are always requested together can be packed into a single column-family row.
Realistic budget: 200 features in 5 ms via batched lookups against a sharded Redis-class store.
Offline feature retrieval: point-in-time correctness#
A model trained on yesterday’s logs must see features as of the moment the training event happened, not as of training time. Otherwise we leak future information:
event at t=10:00:00: user clickedfeature user_avg_session_len = "value as of 10:00:00", NOT as of training timeImplementation: every feature row carries a ts. The join is an as-of join:
SELECT * FROM events eASOF LEFT JOIN features f ON f.user_id = e.user_id AND f.ts <= e.event_tsModern frameworks (Feast, Tecton, Iceberg) support this natively.
Streaming vs batch: which features go where#
< 5 s freshness). Higher infra cost; harder to backfill. Most features are batch. Only the ~20% that need second-level freshness justify streaming complexity. The same feature can be dual-computed (streaming for online, batch for training catch-up) if needed.
Vector store#
write: embedding = embed_model(payload) index.insert(id, embedding, payload)
read: query_vec = embed_model(query) results = index.search(query_vec, k=10, filter=...) return resultsANN index choice:
- HNSW: in-memory, fastest queries, high memory cost (vector × ~2-3× overhead). Best when index fits in RAM.
- IVF + PQ (Faiss-style): disk-friendly, more tunable, slightly lower recall at similar memory. Best for very large indexes.
- Hybrid (DiskANN / Vamana): vectors on disk, lightweight in-memory navigation graph.
Most modern systems use HNSW for indexes up to ~100 M; switch to IVF / disk-based for larger.
Embeddings pipelines#
new entity created → embed via current model → write to vector storemodel upgrade → re-embed entire corpus offline → swap indexes atomicallyRe-embedding billions of items takes hours-to-days; atomic index swap (build the new index alongside the old; flip the alias) avoids downtime.
Lineage and impact analysis#
A change to a source table can ripple through hundreds of features and dozens of models. The lineage graph answers:
- “If I change
events.user_sessionsschema, what features break?” - “Model X regressed yesterday — what upstream features changed?”
- “Re-train model Y from scratch — what raw datasets and transform versions are needed?”
This is a graph database problem; storage is small (~millions of nodes), updates are infrequent.
Freshness SLAs and monitoring#
Each feature declares a freshness SLA: “this feature is at most 10 minutes stale”. The control plane monitors the actual freshness (timestamp of latest write per feature). If SLA is violated:
- Alert the feature owner.
- Optionally route inference to a fallback (older value, default, model-side handling).
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: online feature read fanout. A model needing 1000 features at 1 ms each is the worst case. Real systems aggressively co-locate features per entity (packed rows) and accept that the rare cross-entity feature lookup costs more. New models are constrained to feature sets within bandwidth budget.
Bottleneck #2: backfill cost. A new feature retroactively computed over 2 years of events is a multi-hour Spark job at petabyte scale. The control plane queues backfills with priority lanes so they don’t starve everyday batch jobs.
Bottleneck #3: vector index update latency. HNSW supports incremental insertion but degrades with high churn; periodic full rebuild is required. For high-churn corpora (e.g., LLM context retrieval over user docs), the rebuild cadence becomes critical.
Alternative I’d push back on: skipping the feature store and computing features inline in model code. Looks lean for one model; becomes spaghetti when 50 models share inputs. The feature store’s main job is forcing one canonical definition per feature — that’s the hard organizational win, not the technology.
What breaks first at 10× scale: lineage graph maintenance. A graph with 100s of millions of edges across thousands of teams becomes unmanageable without strict naming conventions and per-team ownership boundaries. The technical bottleneck is light; the organizational one is real.
Companies this resembles#
Uber Michelangelo, Airbnb Zipline, Meta FBLearner, Netflix’s metaflow ecosystem, LinkedIn’s Liger, Tecton (commercial Feast-like), Databricks Feature Engineering, Pinecone / Weaviate / Qdrant on the vector side.
Related systems#
- ChatGPT-style Conversational System — depends on this design for retrieval features.
- LLM-Powered Customer Support Bot — RAG on top of a vector store.
- Newsfeed System — heaviest consumer of online feature serving in most companies.
- Distributed Task Scheduler — substrate for batch backfills and re-embedding jobs.