Text Preprocessing Essentials — Gen AI

Summary#

Text preprocessing is the set of steps that converts raw input — UTF-8 bytes, HTML, copy-pasted prose with weird whitespace and Unicode oddities — into the integer IDs a model actually consumes. It is the least glamorous part of an NLP pipeline and the most consequential one. A misconfigured tokenizer can silently halve retrieval quality, change the meaning of an evaluation, double the inference cost of a request, or cause a model to mis-handle every name in a non-English script.

The pieces are: text normalization (Unicode, case, whitespace), tokenization (splitting into discrete units), and optionally morphological reduction (stemming or lemmatization). Modern foundation models collapse most of this into a single learned subword tokenizer; classical NLP and search systems use multiple explicit stages. Both are still in production, often in the same system.

Why it matters#

The model never sees your text. It sees integer IDs that index into an embedding table. Everything between the user’s keystroke and that integer ID is preprocessing, and every decision at that boundary is a hyperparameter that affects the entire downstream system.

Two real-world failure modes that recur in production:

Tokenizer mismatch between training and serving. A model was trained with one tokenizer and you fine-tune or evaluate with a different one. The IDs no longer mean what the model thinks they mean. Quality drops, often silently — the model still produces plausible-looking text but on the wrong distribution. This is one of the most common bugs in fine-tuning workflows.
Non-English token explosion. A tokenizer trained predominantly on English text breaks down on Hindi, Arabic, or Chinese into many more tokens per character than English. An identical-meaning prompt costs 3-5x as much and runs slower. If your product launches in a new market and the unit economics collapse, this is the first thing to check.

There are also subtler effects. The choice between stemming and lemmatization changes recall in a search system. The choice of whitespace normalization changes how code is rendered to a model. The choice of how to handle Unicode normalization (NFC vs NFD) changes whether a search for “café” matches the same word with a different accent encoding.

How it works#

Normalization: a uniform input#

Before tokenization, raw text usually goes through normalization. The exact steps vary, but most pipelines do some subset of:

Unicode normalization to NFC (composed) or NFKC (composed + compatibility folding). Without this, “café” with a precomposed é and “café” with a combining-accent é are different byte sequences that should mean the same thing.
Whitespace collapsing. Multiple spaces, tabs, and newlines normalized to single tokens. HTML and PDF inputs often have surprising whitespace patterns.
Casing. Lowercasing was standard in earlier NLP. Modern subword tokenizers usually preserve case because case is information (“Apple” vs “apple”). The trade-off is vocabulary size.
Stripping or escaping control characters. Zero-width joiners, right-to-left marks, BOMs. These survive surprisingly long in production pipelines and cause obscure bugs.

Normalization is destructive — once you lowercase, you cannot recover the original. Pipelines that need to return text to the user typically preserve the original separately and only normalize the copy used for tokenization.

Tokenization: text to IDs#

Tokenization is the central decision. There are three families.

Word-level. Split on whitespace and punctuation. Simple but has a huge vocabulary (every inflected form is a separate entry) and breaks on out-of-vocabulary words. Useful for some classical search systems; obsolete for neural models.

Character-level. One token per Unicode codepoint or byte. Vocabulary is tiny (~256 for bytes), but sequences become very long, multiplying compute. Used in some specialized models.

Subword. The current default. The vocabulary contains common whole words, common word pieces, and individual bytes as fallback. Three algorithms dominate:

BPE (Byte-Pair Encoding). Starts from characters, iteratively merges the most frequent pair until vocabulary size is hit. GPT-family models use byte-level BPE.
WordPiece. Similar to BPE but uses likelihood improvement rather than frequency as the merge criterion. BERT uses this.
SentencePiece / Unigram. Treats text as a raw stream (no whitespace pre-split), uses a probabilistic model to find an optimal segmentation. Common in multilingual models.

The output is a sequence of integer IDs. A typical English sentence is roughly 1 token per 4 characters or 0.75 tokens per word for modern tokenizers.

Morphological reduction: stemming and lemmatization#

Pre-neural NLP and many classical search systems reduce inflected forms to a canonical form. Two flavours:

Stemming crops suffixes mechanically. “running”, “runs”, “runner” all become “run” (more or less — Porter stemmer rules are not pretty). Fast, language-specific, and aggressive — sometimes too aggressive (“university” and “universe” both stem to “univers” in Porter).
Lemmatization looks up the dictionary form. “ran” becomes “run”. “better” becomes “good”. Slower, requires a lexicon, more accurate.

Modern neural pipelines mostly skip both — the model learns morphological relationships during training. But search systems still use them, RAG pipelines that combine sparse and dense retrieval still use them, and any system that does keyword matching against user queries still needs them.

Variants and trade-offs#

The biggest practical decision is whether to use a learned subword tokenizer (the default for transformer models) or a multi-stage classical pipeline (still standard for search and some retrieval workflows). They often coexist.

Learned subword tokenizer (BPE / WordPiece / Unigram) — one stage, language-agnostic in principle, no morphological knowledge needed, reversible (you can decode back to text). Drawback: non-English text can cost 3-5x more tokens, surprising splits on rare words, vocabulary is opaque.

Classical multi-stage pipeline (normalize → tokenize → stem / lemmatize) — interpretable, debugging-friendly, optimized per language, mature library support. Drawback: brittle on languages it was not designed for, lossy (you cannot reconstruct original text), high engineering cost per language.

Other axes worth knowing:

Byte-level versus codepoint-level BPE. Byte-level handles every Unicode input gracefully (never fails on unknown characters) but produces longer sequences for non-Latin scripts. Codepoint-level produces shorter sequences but needs a fallback path for unknown characters.
Pre-tokenization rules. Even byte-level BPE usually pre-splits on whitespace, digits, and punctuation. The pre-split rules are part of the tokenizer’s contract — change them and you change the IDs.
Tokenization at training time versus inference time. The tokenizer must be byte-for-byte identical between training and inference. Saving it (in the model artefact) and version-locking it is the only reliable approach.

A good rule of thumb: if your system has a model in the loop, use the model’s bundled tokenizer and do not invent your own. If your system has explicit keyword matching, BM25, or classical IR, you still need normalization, stemming, and possibly a stopword list.

A tokenizer bug that cost a company a quarter

An engineering team launched a fine-tuned chat model in a new region. Internal benchmarks were strong. After launch, customer complaints rolled in: the model would inexplicably break into garbled output on common queries. The root cause: the team had fine-tuned on a slightly newer tokenizer version that had added merges for emoji and certain Unicode ranges. The serving infrastructure was still on the older tokenizer. IDs above a certain value did not exist in the old vocabulary. Most inputs were fine because they did not hit the new ranges; the ones that did got corrupted before reaching the model. A single line in the deployment config — locking the tokenizer version — would have caught this. Three weeks of incident response did not.

When this is asked in interviews#

This shows up in two places.

Foundational ML interviews. “How does tokenization work in modern language models?” The interviewer wants you to name subword tokenization, explain why character-level is too long and word-level has out-of-vocabulary issues, and ideally describe BPE merging at a mechanical level. Bonus points: explain why tokens are not characters or words (“more or less” 4 characters per token) and what that means for prompt-design heuristics like “leave room in the context window.”

Systems and production interviews. Less about how tokenization works and more about how it breaks. Expect questions about:

Token cost optimization (what controls cost per request — input size, output size, the tokenizer’s efficiency on your input language).
Tokenizer mismatch as a deployment bug source.
How to compare two models with different tokenizers fairly (you cannot compare on token counts; convert to characters or bytes).
When you would build a custom tokenizer (rarely — usually only for a new language or domain with very different vocabulary like protein sequences or DNA).

A senior follow-up sometimes asks you to design the tokenizer for a hypothetical system — say, a code-completion model. The answer should hit: handle every byte gracefully (never reject input), preserve indentation tokens that matter for code, avoid splitting common identifiers, balance vocabulary size against sequence length. Concrete trade-offs, not just “I would use BPE.”