Encoder-Decoder Framework — Gen AI

Origin and intuition#

By 2014, neural networks could read sequences (RNNs, LSTMs) and they could classify sequences (final-hidden-state into a softmax). What they couldn’t cleanly do was map one variable-length sequence to a different variable-length sequence — machine translation, summarization, question answering, dialogue. The input “Le chat est sur le tapis” is six tokens; the output “The cat is on the mat” is also six tokens, but in general source and target lengths and alignments differ.

Sutskever, Vinyals and Le’s 2014 paper “Sequence to Sequence Learning with Neural Networks” framed the solution: two RNNs, end-to-end trained. An encoder RNN reads the source sequence and produces a single fixed-size summary vector (its final hidden state). A decoder RNN takes that summary as its initial state and autoregressively generates the target sequence, one token at a time, conditioning each token on the summary plus all previously generated tokens. Cross-entropy loss on the target tokens. Trained end-to-end.

This was a profound generalization. The encoder-decoder split decouples “understand the input” from “produce the output” and turns sequence transduction into two well-shaped subproblems. The same framework worked for translation, summarization, parsing (target = a serialized parse tree), and conversational response generation. By 2015 it was the dominant architecture for any seq2seq problem.

It had one structural flaw, immediately visible: the entire source sentence has to fit in a single fixed-size vector. For short inputs this works; for long inputs the encoder’s final hidden state becomes a lossy bottleneck — information about the first few tokens has been overwritten by the time the encoder finishes. Bahdanau, Cho and Bengio’s 2014 paper fixed this by letting the decoder attend over all encoder hidden states instead of relying on the final one. Attention turned out to be the load-bearing idea. Three years later the transformer paper kept attention and dropped the recurrence, and the encoder-decoder framework — now built entirely from attention layers — became the canonical seq2seq architecture again, this time for T5, BART, FLAN-T5, mT5, and the original GPT-3 era machine-translation systems.

Inputs and outputs#

The encoder consumes a source sequence x_1, ..., x_m (tokens, audio frames, image patches) and produces:

A sequence of contextual representations e_1, ..., e_m (for attention-based decoders).
And/or a single summary vector e (for the original sutskever-style setup).

The decoder consumes the encoder output plus a target sequence prefix y_1, ..., y_{t-1} and produces a distribution over the next target token y_t. At inference time, the decoder is run autoregressively: emit y_1 from the encoder summary, feed y_1 back in to emit y_2, and so on until a special end-of-sequence token.

Two interface points worth being precise about:

Teacher forcing at training. The decoder is fed the ground-truth target prefix y_1, ..., y_{t-1} to predict y_t. This makes training parallel across target positions. At inference, only the previously generated tokens are available — this train/test mismatch (exposure bias) is one of the structural curiosities of seq2seq training.
Cross-attention. In transformer encoder-decoders, the decoder’s self-attention attends to previously-generated target tokens (causal), and a separate cross-attention layer attends to the full encoder output. The encoder is bidirectional; the decoder is causal in the target and bidirectional over the source.

Architecture diagram#

The original RNN-based seq2seq with attention:

Source: x_1  x_2  x_3  ...  x_m

         │    │    │         │
         ▼    ▼    ▼         ▼
        ┌──┐ ┌──┐ ┌──┐      ┌──┐
        │E1│▶│E2│▶│E3│ ... ▶│Em│        Encoder RNN
        └──┘ └──┘ └──┘      └──┘
         │    │    │         │
         ▼    ▼    ▼         ▼
        e_1  e_2  e_3       e_m         Encoder hidden states

                  │
                  ▼
        ┌──────────────────┐
        │ Attention over   │ ◄──── decoder query at step t
        │ encoder states   │
        └──────────────────┘
                  │
                  ▼
                context_t

         <bos>  y_1  y_2 ...
          │      │    │
          ▼      ▼    ▼
         ┌──┐  ┌──┐  ┌──┐
         │D1│▶ │D2│▶ │D3│ ...           Decoder RNN
         └──┘  └──┘  └──┘
          │     │     │
          ▼     ▼     ▼
         y_1   y_2   y_3                Target tokens (softmax over vocab)

The transformer encoder-decoder (T5-style) is structurally similar but every recurrence is replaced by stacked self-attention + feed-forward blocks:

Source tokens ──▶ Embedding + PosEnc ──▶ [Encoder block] × L_enc ──▶ encoder_out
                                              │
                                              │ (keys, values for cross-attn)
                                              ▼
Target tokens ──▶ Embedding + PosEnc ──▶ [Decoder block] × L_dec ──▶ logits
                  (shifted right)

  Decoder block:
    ┌─────────────────────────────────────┐
    │ Causal self-attention (over target) │
    ├─────────────────────────────────────┤
    │ Cross-attention (over encoder_out)  │
    ├─────────────────────────────────────┤
    │ Feed-forward                        │
    └─────────────────────────────────────┘
       (each with residual + LayerNorm)

The cross-attention layer is the entire point of the framework. Decoder query Q comes from the target position; keys and values K, V come from the encoder output. The decoder, at every output position and every layer, can look at the full source sequence with no compression bottleneck.

Training objective#

Teacher-forced cross-entropy on target tokens:

L = - Σ_t log P(y_t | y_1, ..., y_{t-1}, x_1, ..., x_m)

The conditioning on x_1, ..., x_m enters via the encoder representations consumed by cross-attention. Training is fully parallel over target positions because of teacher forcing: you mask the future in the target self-attention, but you do not sequentially generate during training.

For pretraining (T5, BART, mT5 family), the objective is usually span corruption or denoising: corrupt the input (mask out spans, shuffle, drop tokens), feed the corrupted version to the encoder, and ask the decoder to reconstruct the missing spans. This generalizes both masked LM (BERT-style) and causal LM (GPT-style) in one objective. T5’s variant masks ~15% of tokens grouped into spans of average length 3, replaces each span with a sentinel token, and tasks the decoder with emitting <sentinel> recovered span <sentinel> recovered span ....

The pretraining objective matters enormously for what the model is good at downstream. Span-corruption pretraining produces models that handle both understanding and generation reasonably well — the encoder learns bidirectional representations, the decoder learns to generate. Pure causal LM pretraining (GPT-style) optimizes only for generation. Pure masked LM (BERT-style) optimizes only for understanding.

Encoder-decoder (T5, BART, FLAN-T5) — two stacks, bidirectional encoder, causal decoder, cross-attention bridge. Cleanly separates “understand input” from “generate output”. Stronger at tasks with a clear source/target distinction: translation, summarization, structured extraction.

Decoder-only (GPT family) — one stack, causal attention, source and target concatenated in the same sequence. Cleaner code, better at open-ended generation, easier to scale, dominant at the frontier. The encoder bidirectionality advantage shrinks as model size and instruction-tuning compensate.

Key variants of the encoder-decoder framework:

Sutskever et al. (2014). Stacked LSTM encoder + stacked LSTM decoder, no attention. Final encoder hidden state → decoder initial state. The original; bottlenecked on long source sequences.
Bahdanau attention (2014). Decoder attends over all encoder hidden states at each output step. Compatibility scored by a learned feed-forward over [h_{decoder}, h_{encoder}]. Solved the fixed-vector bottleneck.
Luong attention (2015). Same idea, dot-product compatibility instead of feed-forward. Cheaper, similar performance.
GNMT (Google Neural Machine Translation, 2016). 8-layer LSTM encoder + 8-layer LSTM decoder + attention + scheduled sampling + length-normalized beam search. Production-scale neural translation.
Transformer encoder-decoder (Vaswani et al., 2017). Drop recurrence, use stacked attention layers everywhere. The original transformer was an encoder-decoder for WMT translation.
BART (2019). Encoder-decoder transformer pretrained with denoising. Strong on summarization and generation tasks.
T5 / mT5 / FLAN-T5 (2019-2022). “Text-to-text transfer transformer”. Every task framed as text-in, text-out. Span-corruption pretraining + multi-task fine-tuning. Multilingual variants.
Whisper (2022). Encoder-decoder transformer for speech-to-text. Encoder consumes mel-spectrogram, decoder emits text tokens conditioned on encoder output and special task tokens (language ID, translate-or-transcribe).

Practical considerations#

When to choose encoder-decoder over decoder-only. The trade-off has shifted dramatically since 2020. Up through GPT-2 era, encoder-decoder dominated translation, summarization, and any task with a clean input/output split. Post-GPT-3, decoder-only models with sufficient scale and good instruction tuning have closed the gap on most of these tasks, and they’ve won frontier model development almost entirely.

Encoder-decoder still has advantages in three cases:

The input is from a different modality than the output. Speech-to-text (Whisper), image captioning (some variants), multilingual translation with very different scripts. Having a dedicated encoder optimized for the input modality is cleaner than tokenizing everything into the same stream.
The input is much longer than the output. Long-document summarization. The encoder is bidirectional and can be paged with techniques like Longformer-style sparse attention; the decoder only generates a short summary so its KV cache stays small.
You need bidirectional understanding of the source at every decoding step. Information extraction, structured parsing where the answer references arbitrary spans of the input.

For open-ended generation, dialogue, code generation, and most modern chat use cases, decoder-only is the pragmatic default.

KV cache. At inference, the decoder’s self-attention KV cache grows with the generated sequence (one new entry per generated token). The encoder output is computed once at the start and cached — it doesn’t grow. This is part of why encoder-decoder is efficient for long-input/short-output tasks.

Beam search. Encoder-decoder models, especially for translation, traditionally use beam search at inference (typically beam width 4-12). Decoder-only models almost exclusively use sampling (nucleus, temperature, top-k). The choice reflects the task: when there’s one “right” answer in a tight semantic space (translation), beam search helps; when generating diverse text, beam search produces flat repetitive outputs.

Real-world deployments#

Encoder-decoder is the working architecture under many production systems:

Machine translation at Google, Microsoft, Meta, Amazon. Internal MT engines in 2025 are transformer encoder-decoders, often with adapter layers per language pair. NLLB (No Language Left Behind, Meta 2022) is a 54B-parameter encoder-decoder covering 200 languages.
Document summarization in enterprise tools. Microsoft 365 Copilot’s summarization, Google Docs summary cards, Notion AI summaries — many run on T5/BART-derived encoder-decoders, sometimes alongside larger decoder-only models for harder cases.
OpenAI Whisper. Encoder-decoder transformer for speech-to-text, 12-layer encoder + 12-layer decoder at the large size. Open weights, runs locally, deployed widely.
Code translation and refactoring tools. Some specialized code-to-code tools (transpilers, language migration) use encoder-decoders because the input and output are both well-defined sequences with a clear relationship.
Structured extraction services. Document AI products that convert PDFs/forms to structured JSON often run an encoder-decoder with the input being the OCR’d text and the output being the schema-conformant JSON tokens.

The dominant frontier chat models (GPT-4, Claude, Gemini, Llama) are decoder-only, but the encoder-decoder shape persists in specialized production paths where its strengths still pay off.

The conceptual gift the encoder-decoder framework left behind

Even though decoder-only has won most of the frontier, the encoder-decoder framework gave the field three durable ideas. First, the split between “represent the input” and “generate the output” generalized to any multimodal architecture — modern vision-language models still effectively have a vision encoder feeding cross-attention into a language decoder. Second, cross-attention as a clean interface between two transformer stacks turned out to be the right abstraction for joining heterogeneous modalities; Flamingo, BLIP-2, and most multimodal assistants use cross-attention exactly as the 2017 paper described. Third, the discipline of teacher-forced training with autoregressive inference is the contract every generative language model still operates under. The shape is foundational even where the literal encoder-decoder stack isn’t deployed anymore.