Vectorizing Language — Gen AI · Engineering Playbook

Summary#

Models do not consume words; they consume vectors. Every NLP system in the last 70 years has had some answer to the question “how do you turn text into numbers a model can use?” The progression goes from sparse, hand-counted vectors in the 1960s, through learned dense word vectors in the 2010s, through contextual embeddings that depend on surrounding text in the late 2010s, to the unified token embeddings that drive modern foundation models.

The mechanical story is short: each token gets mapped to a vector in some space; vectors that should be semantically close end up geometrically close; downstream math (similarity search, classification, generation) operates on the geometry. The interesting part is how the vectors come to have the right geometry, and what trade-offs each generation of techniques makes.

Why it matters#

Embeddings are load-bearing in nearly every production AI system, and the choice of embedding strategy is the single biggest determinant of quality in retrieval, classification, clustering, and recommendation systems.

Two practical reasons to understand this from first principles:

First, RAG quality is mostly embedding quality. If your retriever returns the wrong chunks, no amount of prompt engineering on the generator side recovers. Most RAG failures are embedding failures — the right document was in the index, but cosine similarity did not score it high enough against the query. Knowing why this happens (mismatched query/document semantics, training-data drift, anisotropy in the embedding space) is the difference between fixing the bug and adding more retries.

Second, embeddings are how models think. A transformer’s hidden state at any layer is, for our purposes, a sequence of embeddings being mixed and refined. The output of the model is a function of how those embeddings interact through attention. Reasoning about the model’s behaviour without reasoning about its embedding space is reasoning about a black box.

How it works#

Sparse representations: bag-of-words and TF-IDF#

The earliest representation: pick a vocabulary V, represent a document as a vector in R^|V| where entry i is the count of word i. With a 50k-word vocabulary, a 1000-word document is a vector with at most 1000 nonzero entries and 49,000 zeros — a sparse vector.

TF-IDF is a refinement: weight each word’s count by how rare it is across the corpus. Common words (“the”, “and”) get downweighted; rare, distinguishing words get upweighted. The geometry now reflects topic similarity reasonably well — two documents about the same topic share rare keywords and end up close in cosine distance.

Sparse vectors have specific strengths: extremely fast retrieval (you only have to compare nonzero entries), interpretability (every dimension is a word), and graceful behaviour on rare terms. BM25 — a refinement of TF-IDF that handles document length better — is still in production at scale in 2026.

Dense embeddings: word2vec, GloVe#

The 2013 word2vec paper changed the field by making dense low-dimensional vectors easy and fast to train. The idea: predict a word from its surrounding words (CBOW) or predict the surrounding words from the word (skip-gram). The hidden layer of the network, once trained, gives a 300-ish dimensional vector per word in the vocabulary.

These vectors had the famous geometric property that vec("king") - vec("man") + vec("woman") ≈ vec("queen"). Whether or not analogies are the right test, the result demonstrated that something like semantic structure was being encoded in the geometry. GloVe (2014) achieved comparable results via matrix factorization of a co-occurrence matrix.

The big advance over sparse representations: dense vectors have a useful similarity metric (cosine) on which semantically related words are close, not just words that share rare terms. The big limitation: each word has one vector, regardless of context. “Bank” is the same vector whether you mean the river or the financial institution.

Contextual embeddings: ELMo, BERT, and onwards#

The next leap was making the embedding depend on the surrounding text. ELMo (2018) ran bidirectional LSTMs over the text and used the hidden states as context-aware embeddings. BERT (2018) did the same with a transformer encoder trained on masked language modeling.

Now “bank” near “river” embeds differently from “bank” near “deposit.” The model’s representation of a token is its hidden state at some layer, after attention has mixed information across positions. Downstream tasks (classification, NER, retrieval) plug into these representations and get substantial quality improvements over static embeddings for almost every benchmark.

Sentence and document embeddings#

For retrieval, you usually want a single vector per chunk of text, not one per token. The naïve approach — average the token embeddings — works surprisingly well as a baseline but has known failure modes (it loses ordering, treats stop words and content words equally).

Better approaches: SentenceBERT, E5, BGE, and other purpose-trained models with contrastive objectives. The training data is pairs of sentences that should be close (translations, paraphrases, query-document pairs) and contrastive negatives. The result is an embedding space where the geometry is specifically tuned for retrieval, not just for token prediction.

In 2026, the dominant production embedding models are either dedicated retrieval embedders (often 768 or 1024 dimensional, trained on tens of billions of pairs) or “matryoshka” embeddings that let you truncate to lower dimensions with graceful quality loss.

Variants and trade-offs#

The biggest decision in production is what kind of embedding to use for retrieval.

Dense embeddings (neural) — handle synonyms and paraphrases well, work across languages with a multilingual model, capture semantic similarity. Drawback: brittle on rare terms, exact identifiers, code, or numbers. Latency cost to compute. Index cost roughly linear in dimensionality.

Sparse embeddings (BM25 / TF-IDF / SPLADE) — strong on exact keyword matches, interpretable, fast to compute, no neural-model dependency at index time. Drawback: misses synonyms and paraphrases, struggles with semantic similarity, language-specific tokenization required.

Production retrieval systems usually combine both. Hybrid retrieval runs dense and sparse in parallel, takes the union of candidates, and reranks. The dense side finds semantically related content; the sparse side anchors on exact terms (names, identifiers, numbers, code) that the dense side underweights. Add a reranker (a cross-encoder that scores (query, candidate) pairs jointly) on top and you have the modern stack.

Other axes worth knowing:

Dimensionality. 1536-dim was common in 2023; 768 and 1024 are common now. Higher dimensions cost more at index time and at query time (proportional in flat search, sublinear with ANN indexes), with diminishing quality gains.
Symmetric vs asymmetric. Some embedders are trained with queries and documents in the same encoder (“symmetric”); some use separate encoders (“asymmetric”). Asymmetric usually wins on retrieval; symmetric is simpler.
Re-embedding cost. When you change models, you have to re-index. If your corpus is 100B tokens, this is a multi-day, expensive operation. Plan for embedding-model changes as a quarterly event, not a daily one.

There is also the choice between hosted embedding APIs and self-hosted models. Hosted gives you the latest quality with zero ops; self-hosted gives you predictable cost, data control, and the freedom to fine-tune on your own corpus. For high-volume retrieval workloads, self-hosted is usually cheaper at scale; the break-even is typically a few million embedding calls per month.

Anisotropy: the embedding-space pathology you should know about

Empirical studies have shown that the embeddings produced by many pretrained language models occupy a narrow cone in their space rather than filling it uniformly. This makes all embeddings somewhat similar to each other (high baseline cosine), which compresses the dynamic range of useful similarity scores. Practical effect: cosine of 0.6 might mean “very similar” in one model and “barely related” in another. Always benchmark embedding similarity against known-good and known-bad pairs before setting thresholds. The fix in modern retrieval models is to use contrastive training, which explicitly spreads the embeddings out.

When this is asked in interviews#

This is a high-frequency interview topic — embeddings sit under retrieval, classification, recommendation, and any system that does similarity search.

Foundational probes:

“What is an embedding?” — the answer is a dense vector representation in a space where similar things are geometrically close. Bonus: explain what “similar” means here (cosine on the dot product is the usual metric).
“How is BERT’s embedding different from word2vec’s?” — contextual vs static. Each token’s representation depends on the whole sequence.
“Why don’t you just use the LLM’s hidden states as embeddings?” — you can, but they are usually anisotropic and not trained for retrieval. Dedicated embedding models with contrastive objectives produce better retrieval-suited geometry.

Systems probes:

“Design the retrieval layer for a RAG system over 10M documents.” Expected hits: chunk strategy, embedding model choice, hybrid retrieval (dense + sparse), an ANN index for speed, a reranker, an eval set with hard negatives.
“Your retrieval quality dropped after switching embedding models. How do you debug?” Hits: confirm both sides (queries and corpus) were re-indexed with the new model; check whether your evaluation set is still valid; check the new model’s training distribution against your domain.
“How would you reduce embedding cost?” Hits: smaller embedder, dimensionality reduction (truncation or matryoshka), batch embedding requests, cache embeddings of unchanged content.

The senior version pushes on evaluation: “How do you know your retriever actually got better?” Answer needs to mention recall@k or MRR on a labelled eval set, not just “we tried some queries and they looked good.”