Bidirectional Transformers (BERT)
Masked language modeling. How BERT became the encoder of choice for classification, retrieval, and ranking.
Origin and intuition#
When the transformer paper landed in 2017, it was an encoder-decoder used for machine translation. The encoder was bidirectional — every position could attend to every other position — but the decoder was causal. The natural question, picked up by Devlin, Chang, Lee, and Toutanova at Google in 2018, was: what happens if you pretrain just the encoder stack as a general-purpose feature extractor for any downstream task?
The obvious objective — predicting the next token — doesn’t work for a bidirectional encoder. Each token in a bidirectional stack can see every other token directly, so “predict the next token” is trivial (the next token is in the input). You need an objective that’s hard because of bidirectionality, not in spite of it.
BERT’s answer was masked language modeling (MLM): randomly replace ~15% of the input tokens with a [MASK] placeholder, and train the encoder to predict the original tokens from their bidirectional context. To predict a masked word you need to look at both the words before and after it. This is the inductive bias the model picks up: every position learns to be a rich summary of its surroundings.
The result was a step-change in NLP. BERT-large (340M parameters), pretrained on Wikipedia + BookCorpus (~3.3B tokens), fine-tuned for a few epochs on labeled downstream data, beat the previous state of the art on 11 benchmark tasks simultaneously — question answering, named entity recognition, sentiment, natural language inference, semantic similarity. Fine-tuning BERT became the dominant pattern for any text classification or sequence-labeling task between 2018 and 2022, and the encoder-only architecture remained the default for retrieval, ranking, and embedding workloads even after generative models took over chat.
The deeper lesson was that pretraining objective shape constrains what the model is good at. BERT’s bidirectional context window makes it excellent at understanding (classifying, embedding, ranking) and structurally bad at generation (you can’t autoregressively decode from a model that sees the future). GPT’s causal pretraining makes the opposite trade. Both architectures are nearly identical at the layer level; the objective is what diverged them into different ecosystems.
Inputs and outputs#
BERT consumes a token sequence and emits, at every position, a d_model-dimensional contextual embedding. The input format includes special tokens that carry structural meaning:
[CLS]— prepended to every input. The final-layer hidden state at this position is conventionally used as a single sentence/passage embedding for classification or pooling.[SEP]— separates two segments (for sentence-pair tasks like NLI and QA).[MASK]— used during pretraining to mark tokens the model must predict.[PAD]— fills to the max sequence length in a batch.
A sentence-pair input looks like [CLS] sentence A tokens [SEP] sentence B tokens [SEP]. BERT adds three embedding lookups per position: token embedding + position embedding + segment embedding (0 for sentence A, 1 for sentence B). All three are summed before entering the first transformer block.
Output topology depends on the head:
- Classification. Take
h_{[CLS]}from the final layer, pass through a small classifier head. - Token-level tasks (NER, QA span prediction). Use the per-token embeddings
h_1, ..., h_ndirectly. - Embedding/retrieval. Mean-pool the final-layer hidden states (works better empirically than
[CLS]for retrieval, per Sentence-BERT). - Reranking. Concatenate query and document with
[SEP], pass through BERT, project[CLS]to a relevance score.
Sequence length is fixed at training time (typically 512 tokens for BERT-base/large). Longer inputs must be chunked or use a long-context variant like Longformer or BigBird.
Architecture diagram#
BERT is a stack of transformer encoder blocks. Same block as the original 2017 paper, just stacked deeper, with absolute learned positional embeddings and bidirectional (unmasked) self-attention:
Input tokens: [CLS] w_1 w_2 [MASK] w_4 ... [SEP]
│ ▼ Token embedding + Position embedding + Segment embedding │ ▼ ┌──────────────────────────────────────────────────┐ │ Encoder block 1 │ │ ┌──────────────────────────────────────────┐ │ │ │ Bidirectional Multi-Head Self-Attention │ │ │ └──────────────────────────────────────────┘ │ │ + residual, LayerNorm │ │ ┌──────────────────────────────────────────┐ │ │ │ Feed-Forward (d_model → 4·d_model → d) │ │ │ └──────────────────────────────────────────┘ │ │ + residual, LayerNorm │ └──────────────────────────────────────────────────┘ │ ▼ (Encoder block 2) ⋮ (Encoder block L) │ ▼Final contextual embeddings: h_{[CLS]} h_1 h_2 h_3 ...
│ │ ▼ ▼ Classification head MLM head (predict (sentence classification) masked tokens)Two flavors shipped in the original paper:
- BERT-base. 12 layers,
d_model = 768, 12 attention heads, ~110M parameters. - BERT-large. 24 layers,
d_model = 1024, 16 attention heads, ~340M parameters.
The bidirectionality is just the absence of a causal mask on attention. Every position attends to every other position. The transformer math is otherwise identical to the 2017 encoder.
Training objective#
BERT pretraining uses two objectives summed together:
Masked Language Modeling (MLM). Pick 15% of the tokens. Of those:
- 80% get replaced by
[MASK]. - 10% get replaced by a random token.
- 10% are left unchanged.
The model predicts the original token at every masked position. The 80/10/10 split was a hack to address the train/test mismatch — at fine-tuning time there are no [MASK] tokens, so the model would otherwise overfit to expecting them.
Loss is cross-entropy at masked positions only:
L_MLM = - Σ_{t in masked} log P(x_t | x_{not masked})Next Sentence Prediction (NSP). Given a pair of sentences, classify whether sentence B actually follows sentence A in the corpus (50% positive, 50% randomly sampled). The intuition was that this would teach inter-sentence relationships needed for tasks like NLI and QA.
NSP didn’t age well. RoBERTa (2019) showed that dropping NSP and training MLM longer on more data outperformed the original recipe. Most subsequent BERT-family models (RoBERTa, DeBERTa, modern embedding models) drop NSP. The MLM objective alone is enough.
The pretraining corpus was BookCorpus (800M words) + English Wikipedia (2.5B words). Training compute: 4 days on 16 TPU v3 chips for BERT-large.
Variants and refinements#
BERT spawned a dense family tree between 2018 and 2022:
- RoBERTa (Liu et al., 2019). Same architecture, better recipe. Drop NSP, train longer (~10× more data), larger batches, dynamic masking (re-mask each epoch). Substantially better at the same parameter count. RoBERTa-large was the default fine-tuning backbone for ~3 years.
- ALBERT (Lan et al., 2019). Parameter sharing across layers + factorized embedding matrix. 18× fewer parameters than BERT-large at similar quality. Practical for memory-constrained deployments.
- DistilBERT (Sanh et al., 2019). Distillation of BERT-base into a 6-layer student. 40% smaller, 60% faster, retains ~97% of GLUE performance.
- ELECTRA (Clark et al., 2020). Replace MLM with a discriminator that classifies every token as real or generated-by-a-small-generator. Much more sample-efficient — ELECTRA-small matches BERT-base at a fraction of the compute.
- DeBERTa (He et al., 2020). Disentangled attention separating content from position; enhanced mask decoder. Topped GLUE and SuperGLUE leaderboards. The “best” BERT-family encoder by most metrics through 2022.
- Multilingual BERT (mBERT) / XLM-R (Conneau et al., 2019). Pretrained on 100+ languages with the same MLM objective. XLM-R-large became the default for cross-lingual transfer.
- Domain-specific BERTs. SciBERT (scientific papers), BioBERT (biomedical), Legal-BERT, FinBERT. Continued pretraining on domain corpora consistently outperforms general BERT on in-domain tasks.
- Long-context BERT variants. Longformer (sliding-window + global attention), BigBird (random + window + global), supporting 4K-16K context.
- Sentence-BERT (Reimers and Gurevych, 2019). Bi-encoder architecture fine-tuned on NLI to produce sentence embeddings via mean-pooling. Foundational for modern semantic search.
Practical considerations#
Fine-tuning recipe. The canonical BERT fine-tuning recipe is simple: take the pretrained checkpoint, add a small task-specific head (linear layer for classification, span predictor for QA), train end-to-end on labeled data for 2-4 epochs with AdamW, learning rate 2e-5 to 5e-5, batch size 16-32, linear warmup over 10% of steps then linear decay. This recipe works astonishingly often, and it’s the reason BERT changed industry NLP — you didn’t need an ML research team to ship a state-of-the-art classifier.
Embedding extraction. For retrieval and semantic similarity, mean-pool the final hidden states (excluding padding), then L2-normalize. The [CLS] token alone is consistently worse for retrieval than mean-pooling, despite being designed for sentence-level signals. Sentence-BERT-style supervised fine-tuning on natural language inference pairs adds another large quality jump.
Inference cost shape. BERT-base inference is ~22 GFLOPs per forward pass at 512 tokens — fast enough for batch ranking workloads, marginal for sub-100ms request latencies on CPU. Most production BERT serving runs on GPU (or distilled/quantized variants on CPU). For retrieval at scale, you bi-encode the corpus offline (one BERT forward per document, embeddings stored in a vector index) and re-encode only the query at request time.
The 512-token limit. Standard BERT was pretrained with 512-token sequences. Going past that at inference fails ungracefully — positional embeddings beyond 512 don’t exist. Solutions: chunk long documents, use a long-context variant (Longformer, BigBird), or use a hierarchical approach (BERT over passages, then aggregate). The 512 limit is the single biggest practical constraint and the reason long-document retrieval often uses other architectures.
Catastrophic forgetting at fine-tuning. Fine-tuning BERT on small datasets with a high learning rate can wreck the pretrained representations. The “discriminative fine-tuning” pattern — lower learning rate for early layers, higher for later layers and the head — helps. Adapters (Houlsby et al., 2019) and LoRA-style low-rank updates avoid the problem entirely by freezing the backbone.
Real-world deployments#
BERT-family encoders are deeply embedded in production systems even after generative models took the spotlight:
- Google Search ranking (2019). BERT was integrated into Google Search to better understand query intent — affected ~10% of English queries at launch, expanded to 70+ languages within a year. Still core to modern Google ranking signals.
- Bing, Yandex, Baidu all integrated BERT-derived models into ranking around the same time. Modern web search ranking stacks are full of fine-tuned encoders.
- Recommendation systems. Item-to-item similarity, user-query understanding, content tagging across Pinterest, Etsy, eBay, and most e-commerce search stacks run on BERT-derived embedding models.
- Email spam / abuse classification at Gmail, Outlook, Facebook. Fine-tuned BERT classifiers (or distilled variants for cost) handle billions of classifications per day.
- Customer support routing and intent classification at almost every major SaaS company. The standard pattern: fine-tune BERT-base on a few thousand labeled tickets, deploy on GPU, refresh weekly.
- Modern embedding models for RAG. Despite the “everything is generative now” framing, the embedding side of every RAG pipeline is encoder-based —
text-embedding-3,voyage-3,bge-m3,nomic-embed, etc. are all encoder transformers descended from BERT, fine-tuned with contrastive objectives (InfoNCE, multiple-negatives ranking loss). - Reranking in retrieval pipelines. Cross-encoders (BERT taking
[query] [SEP] [doc]and emitting a relevance score) remain the gold standard for the final reranking stage in production search and RAG.
The BERT lineage is the quiet backbone of search and retrieval at internet scale. Generative models grab headlines; encoders ship the queries.
Related architectures#
- Attention Is All You Need (Transformer)
- Generative Pretraining (GPT)
- Encoder-Decoder Models
- Pretraining Paradigms
- Vectorizing Language
Why encoder-only models didn't scale to frontier LLMs
In 2018, BERT and GPT looked like roughly equivalent bets — same transformer, different pretraining objective, different downstream profile. By 2022, GPT’s lineage was the frontier and BERT’s was the embedding layer. Why? Two reasons. (1) The MLM objective doesn’t let you produce text. You can fine-tune for generation with masked-prediction tricks but it’s awkward; causal pretraining gives you generation for free at every scale. (2) The MLM signal per training token is only ~15% as dense as causal LM — the model only learns from masked positions, not every position. At fixed compute budget, causal LM extracts more learning per token. ELECTRA partially fixed this by making every position a training signal, but by then GPT-3 had shipped and the field had pivoted. Encoders won where their inductive bias mattered — bidirectional understanding, dense retrieval — but the dominant mass of foundation-model investment moved to the causal-decoder side and stayed there.