AI / ML Data Infrastructure

Feature stores, training data pipelines, online vs batch features, lineage, vector storage.

System Advanced
9 min read
ml feature-store training vector-db
Companies this resembles: Uber Michelangelo · Meta FBLearner · Tecton · Databricks

Step 1 — Clarify Requirements#

Functional

  • Ingest raw events (clicks, transactions, sensor data) and turn them into ML-usable features.
  • Serve features online (single-digit ms) for model inference.
  • Serve features offline (batched) for training and backfills.
  • Maintain lineage: which feature was derived from which source, with which transformation, at what version.
  • Store and search embeddings (vectors) for retrieval / similarity.
  • Out of scope: model training infrastructure, the inference serving layer (see /system-design/chatgpt-system).

Non-functional

  • Online feature serving: p99 latency under 10 ms.
  • Offline feature retrieval: training a model should be able to read TB-scale joined feature tables in tens of minutes.
  • Training/serving consistency: features used at training time and at serving time must be derived identically. Drift here is the leading cause of model regression.
  • Reproducibility: any prior model artifact must be re-trainable from versioned data.

Step 2 — Capacity Estimation#

  • Event ingestion: 10 M events/sec (top company scale).
  • Features computed per event: ~10-100. Total feature writes: ~1 B/sec.
  • Online feature reads: 1 M inference QPS × 200 features/req = 200 M feature reads/sec.
  • Offline feature volume: a year of events ≈ 300 PB raw; processed feature tables ≈ 100 PB.
  • Vector index: 1 B embeddings × 1024 dim × 4 B = 4 TB per index. Several indexes per company.
  • Models in production: ~1,000 across an org; each with its own feature set, retrained daily-to-monthly.

The system isn’t one piece — it’s three loosely-coupled subsystems (online features, offline features, vectors) sharing lineage and a control plane.

Step 3 — System Interface#

// Online feature serving (low latency, single-row)
GET /features/online
?entities={user_id: 'u-123', item_id: 'i-456'}
&features=['user_avg_session_len', 'user_item_affinity', ...]
→ { feature_name: value, ... }
// Offline feature retrieval (batched, training)
POST /features/offline/get_historical
Body: { entity_df: ParquetURI, features: [...], event_ts: ColumnName }
→ ParquetURI of joined table (point-in-time-correct features)
// Feature definition (registry)
POST /features/register
Body: { name, source, transform, entities, freshness_sla, owner, ... }
// Vector
POST /vectors/:index/search
Body: { query_vector, k, filter? }
→ list of (id, score, payload)

Step 4 — High-Level Design#

┌── Kafka / Kinesis (raw events)
raw event stream
┌───────────┴────────────┐
│ │
streaming pipeline batch pipeline
(Flink / Spark Stream) (Spark / dbt / Airflow)
│ │
▼ ▼
online feature store offline feature store
(Redis / DynamoDB) (Parquet on object store + Iceberg / Hudi)
│ │
└──────────┬─────────────┘
Feature registry + lineage (control plane)
├── used by training jobs ─→ model artifact store
└── used by inference ─→ model serving layer
Side track: embedding model ─→ vector store (ANN index)

The control plane (feature registry, lineage, freshness SLAs) is the part that turns “a bunch of Spark jobs” into a real platform.

Step 5 — Data Model#

Feature definitions (registry)#

feature: user_avg_session_len_30d
source: events.user_sessions
entity: user_id
transform: avg(duration) over last 30d, group by user_id
freshness: daily
schema: double
owner: team-personalization
version: v3

Each feature is versioned. Existing models pin to a version; new training runs may pick up a newer version. Backwards-incompatible changes get a new feature name.

Online store#

A key-value table keyed by (feature_name, entity_id):

key: feature:user_avg_session_len:v3:u-123
value: { value: 12.5, ts: 2026-05-15T08:00:00Z, version: v3 }

Latency target: 1-5 ms per batch of feature lookups. Sharded by (feature_name, entity_id) hash for horizontal scale. Common: Redis, DynamoDB, Cassandra, or purpose-built (Feast, Tecton).

Offline store#

Materialized feature tables in Parquet on object storage, partitioned by date:

s3://features/user_avg_session_len/v3/dt=2026-05-15/part-*.parquet
columns: user_id, value, ts, version

Read via Spark / Presto / Trino for training data assembly. Iceberg or Hudi provides snapshot isolation, schema evolution, and time travel.

Lineage#

feature: user_avg_session_len:v3
source_datasets: [events.user_sessions]
transform_code: git_sha
dependencies: [user_session_canonical_v2]
produced_at: timestamps of last runs
consumed_by: [model_x_v7, model_y_v3, ...]

A graph; the registry walks it to answer “if I change this source, what features and models are affected?”.

Vector store#

Keyed by an opaque id; each entry has:

  • The vector (float32 or quantized).
  • A payload (e.g., article metadata) for filter and return.
  • Indexing metadata for ANN.

ANN algorithms: HNSW (hierarchical navigable small worlds) is the modern default — high recall, low query latency, manageable memory.

Step 6 — Detailed Design#

The training-serving skew problem#

The single most insidious bug in ML systems: a feature is computed one way for training and a slightly different way at serving time. Examples:

  • Training uses “events from yesterday joined on date” but serving uses “live events as of now” → skew.
  • Training computes mean over a window; serving uses an exponentially-weighted moving average → skew.

The fix is one transformation pipeline that produces both. The streaming pipeline writes to the online store; the batch pipeline writes to the offline store. The transform code is the same (same library, ideally same binary). A skew detector compares samples periodically and alerts on drift.

Online feature serving: latency#

Inference often needs 100-1000 features per call. Each lookup is 1-2 ms; sequential, that’s seconds. Strategy:

  • Batch lookup API: send all features at once; the store fans out internally and merges responses.
  • Per-entity locality: many features share the same entity (user_id). One row holds many features → one read = many values.
  • Co-locate hot features: features that are always requested together can be packed into a single column-family row.

Realistic budget: 200 features in 5 ms via batched lookups against a sharded Redis-class store.

Offline feature retrieval: point-in-time correctness#

A model trained on yesterday’s logs must see features as of the moment the training event happened, not as of training time. Otherwise we leak future information:

event at t=10:00:00: user clicked
feature user_avg_session_len = "value as of 10:00:00", NOT as of training time

Implementation: every feature row carries a ts. The join is an as-of join:

SELECT * FROM events e
ASOF LEFT JOIN features f ON f.user_id = e.user_id AND f.ts <= e.event_ts

Modern frameworks (Feast, Tecton, Iceberg) support this natively.

Streaming vs batch: which features go where#

Streaming features — derived from real-time events (clicks-in-last-minute, current-cart-value). Low latency (< 5 s freshness). Higher infra cost; harder to backfill.
Batch features — derived from offline data (lifetime avg, demographics). Daily / hourly freshness. Cheap; trivially backfilled.

Most features are batch. Only the ~20% that need second-level freshness justify streaming complexity. The same feature can be dual-computed (streaming for online, batch for training catch-up) if needed.

Vector store#

write:
embedding = embed_model(payload)
index.insert(id, embedding, payload)
read:
query_vec = embed_model(query)
results = index.search(query_vec, k=10, filter=...)
return results

ANN index choice:

  • HNSW: in-memory, fastest queries, high memory cost (vector × ~2-3× overhead). Best when index fits in RAM.
  • IVF + PQ (Faiss-style): disk-friendly, more tunable, slightly lower recall at similar memory. Best for very large indexes.
  • Hybrid (DiskANN / Vamana): vectors on disk, lightweight in-memory navigation graph.

Most modern systems use HNSW for indexes up to ~100 M; switch to IVF / disk-based for larger.

Embeddings pipelines#

new entity created → embed via current model → write to vector store
model upgrade → re-embed entire corpus offline → swap indexes atomically

Re-embedding billions of items takes hours-to-days; atomic index swap (build the new index alongside the old; flip the alias) avoids downtime.

Lineage and impact analysis#

A change to a source table can ripple through hundreds of features and dozens of models. The lineage graph answers:

  • “If I change events.user_sessions schema, what features break?”
  • “Model X regressed yesterday — what upstream features changed?”
  • “Re-train model Y from scratch — what raw datasets and transform versions are needed?”

This is a graph database problem; storage is small (~millions of nodes), updates are infrequent.

Freshness SLAs and monitoring#

Each feature declares a freshness SLA: “this feature is at most 10 minutes stale”. The control plane monitors the actual freshness (timestamp of latest write per feature). If SLA is violated:

  • Alert the feature owner.
  • Optionally route inference to a fallback (older value, default, model-side handling).

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: online feature read fanout. A model needing 1000 features at 1 ms each is the worst case. Real systems aggressively co-locate features per entity (packed rows) and accept that the rare cross-entity feature lookup costs more. New models are constrained to feature sets within bandwidth budget.

Bottleneck #2: backfill cost. A new feature retroactively computed over 2 years of events is a multi-hour Spark job at petabyte scale. The control plane queues backfills with priority lanes so they don’t starve everyday batch jobs.

Bottleneck #3: vector index update latency. HNSW supports incremental insertion but degrades with high churn; periodic full rebuild is required. For high-churn corpora (e.g., LLM context retrieval over user docs), the rebuild cadence becomes critical.

Alternative I’d push back on: skipping the feature store and computing features inline in model code. Looks lean for one model; becomes spaghetti when 50 models share inputs. The feature store’s main job is forcing one canonical definition per feature — that’s the hard organizational win, not the technology.

What breaks first at 10× scale: lineage graph maintenance. A graph with 100s of millions of edges across thousands of teams becomes unmanageable without strict naming conventions and per-team ownership boundaries. The technical bottleneck is light; the organizational one is real.

Companies this resembles#

Uber Michelangelo, Airbnb Zipline, Meta FBLearner, Netflix’s metaflow ecosystem, LinkedIn’s Liger, Tecton (commercial Feast-like), Databricks Feature Engineering, Pinecone / Weaviate / Qdrant on the vector side.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.