Reconstructing Context with Sequence Models (LSTM / GRU)

Origin and intuition#

The vanilla RNN had a fatal structural problem: gradients flowing backward through the recurrence multiplied by a Jacobian at every timestep, and with bounded activations like tanh that Jacobian had a spectral radius reliably below 1. Across 50 timesteps the gradient shrank to noise. The model literally could not learn dependencies past ~10-20 tokens — not because the architecture lacked capacity, but because the training signal couldn’t reach the relevant weights.

Hochreiter and Schmidhuber’s 1997 LSTM paper proposed a way out: build an explicit memory pathway through the recurrence that is approximately the identity, so gradients flow through it nearly unattenuated. The pathway is the “cell state” c_t — a vector that the network can write to, read from, and erase from via learned gates, but that is otherwise carried forward unchanged across timesteps. The gates use sigmoid activations producing values in [0, 1], acting as soft on/off switches.

The crucial detail is that the cell state update is additive, not multiplicative through a saturating nonlinearity. Backpropagating through c_t = f_t · c_{t-1} + i_t · g_t, when the forget gate f_t is close to 1, the gradient with respect to c_{t-1} is approximately 1 — no decay. This was the insight: keep the gradient highway open, gate the read/write operations separately.

It worked. LSTMs trained reliably on sequences of hundreds of tokens, and by the mid-2000s they were the default architecture for handwriting recognition, speech recognition, and machine translation. The 2014 sequence-to-sequence paper that launched neural translation was a stack of LSTMs. The GRU (Cho et al., 2014) is a simplification with similar performance and fewer gates. Both were displaced by the transformer in 2017 for parallel-trainable tasks, but the gated-recurrence pattern has kept returning — in state-space models, in attention-augmented RNNs, in recent hybrid architectures.

Inputs and outputs#

The interface is identical to a vanilla RNN: at each step, consume an input vector x_t, update internal state, optionally emit an output. What’s different is the internal state has two tensors instead of one:

h_t — the hidden state, exposed externally (this is what downstream layers see and what gets passed across steps as the “output”).
c_t — the cell state, internal to the cell, carrying long-term information through the gradient highway.

GRU collapses these back into a single hidden state by merging the cell-state and hidden-state pathways, with the gating arithmetic adjusted accordingly. Empirically GRU and LSTM perform within a percent or two of each other on most tasks; LSTM is slightly more expressive on very long sequences, GRU is ~25% cheaper to compute per step.

The same four input/output topologies apply (one-to-many, many-to-one, many-to-many aligned, many-to-many unaligned). Practically every neural NLP system between 2014 and 2017 ran on stacked LSTMs in one of these configurations.

Architecture diagram#

The LSTM cell is the unit of interest. Three gates plus a candidate-update, all conditioned on [x_t, h_{t-1}]:

                           c_{t-1}
                              │
                              ▼
          ┌─────────────┐  ┌─────┐  ┌──────────────┐   c_t
   x_t ──▶│ forget gate │─▶│  *  │─▶│      +       │─────▶
   h_{t-1}│   f_t = σ() │  └─────┘  │              │       │
          └─────────────┘            └──────┬───────┘      │
                                            ▲              │
          ┌─────────────┐                    │              │
   x_t ──▶│ input gate  │──┐                 │              │
   h_{t-1}│   i_t = σ() │  │   ┌─────┐       │              │
          └─────────────┘  └──▶│  *  │───────┘              │
                               └─────┘                      │
          ┌─────────────┐         ▲                         │
   x_t ──▶│  candidate  │─────────┘                         │
   h_{t-1}│   g_t = tanh()│                                 │
          └─────────────┘                                   │
                                                            │
          ┌─────────────┐  ┌─────────┐                      │
   x_t ──▶│ output gate │─▶│   *     │◄──── tanh(c_t) ◄─────┘
   h_{t-1}│   o_t = σ() │  └────┬────┘
          └─────────────┘       │
                                ▼
                               h_t

Equations, all per-step:

f_t = σ(W_f · [x_t, h_{t-1}] + b_f)         # forget gate
i_t = σ(W_i · [x_t, h_{t-1}] + b_i)         # input gate
g_t = tanh(W_g · [x_t, h_{t-1}] + b_g)      # candidate
o_t = σ(W_o · [x_t, h_{t-1}] + b_o)         # output gate
c_t = f_t * c_{t-1} + i_t * g_t              # cell state update
h_t = o_t * tanh(c_t)                        # hidden state output

Four weight matrices per cell — the parameter count is ~4× a vanilla RNN at the same hidden size. The cell state c_t is the gradient highway: when f_t is near 1 and i_t is near 0, c_t ≈ c_{t-1} and gradients flow back unattenuated.

GRU collapses to two gates (reset and update) and a single state:

r_t = σ(W_r · [x_t, h_{t-1}] + b_r)             # reset gate
z_t = σ(W_z · [x_t, h_{t-1}] + b_z)             # update gate
h̃_t = tanh(W_h · [x_t, r_t * h_{t-1}] + b_h)    # candidate
h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t            # blend old and new

The update gate z_t plays the dual role of LSTM’s forget and input gates. Three matrices instead of four. Slightly less expressive on paper, indistinguishable in practice for most tasks.

Training objective#

The objective doesn’t change from vanilla RNNs — it depends on what you’re doing with the sequence. For language modeling, next-token cross-entropy. For classification, cross-entropy on the final hidden state. For sequence-to-sequence, cross-entropy on each generated token in the decoder.

Training uses backpropagation through time (BPTT), same as vanilla RNNs, but the cell-state path makes BPTT actually work over long sequences. Truncated BPTT is still standard practice — backpropagate through windows of 35-200 timesteps, carry hidden and cell states across windows without backpropagation crossing the boundary.

A practical training detail worth remembering: initialize the forget-gate bias b_f to a positive value (Jozefowicz et al., 2015, recommended +1). With zero bias, the sigmoid starts at 0.5, so the cell state decays by half each step — the gradient highway is half-closed at initialization. Biasing forget toward “remember” at init makes LSTMs train noticeably faster and more reliably.

The 2014-2017 period generated dozens of LSTM variants. Most made no difference; a few stuck:

Peephole connections (Gers and Schmidhuber, 2000). The gates can see the cell state directly, not just [x_t, h_{t-1}]. Marginally better on some timing-sensitive tasks. Not standard.
Coupled forget/input gates. i_t = 1 - f_t. Reduces parameters; one less gate to train. Standard in many GRU-flavoured implementations.
Bidirectional LSTM (BiLSTM). Forward LSTM + backward LSTM, concatenated hidden states. Standard for any task where you can see the whole sequence at inference time — tagging, parsing, classification. The dominant architecture for non-generative NLP until BERT replaced it in 2018.
Stacked LSTM. Multiple LSTM layers, each taking the previous layer’s hidden-state sequence as input. Standard depth: 2-4 layers. Beyond that, gradient flow degrades again and residual connections are needed.
Highway LSTM / Recurrent Highway Network. Adds residual connections within the recurrence step. Useful for very deep stacks (~8+ layers).
GRU. Simpler, slightly faster. Best understood as “LSTM minus the cell-state vs hidden-state distinction”.

LSTM — four gates, separate cell and hidden states, ~4× vanilla RNN parameters. The standard for tasks needing the longest memory — speech, handwriting, music generation. More expressive on very long sequences.

GRU — two gates, single state, ~3× vanilla RNN parameters. Trains faster, uses less memory, performs within a percent of LSTM on most language tasks. The pragmatic default when you don’t need every last point of accuracy.

Other refinements that shipped in the LSTM era and influenced later architectures:

Attention over encoder states (Bahdanau, 2014; Luong, 2015). Bolt attention onto a sequence-to-sequence LSTM. The decoder at each step attends over the encoder’s hidden states instead of relying on a single final state. This was the bridge from RNNs to transformers — attention was first a fix for the encoder bottleneck, then the only thing left.
Layer normalization on recurrent activations. Stabilizes deep LSTM training. Standard in 2016+ implementations.
Variational dropout (Gal and Ghahramani, 2016). Apply the same dropout mask at every timestep instead of a fresh one. Much better regularization for RNNs than naive per-step dropout.

Practical considerations#

Useful context length. LSTMs reliably handle ~100-300 tokens of meaningful context on language modeling, ~500-1000 on speech and music tasks where the per-step information density is lower. Past ~1000 tokens, even gated RNNs struggle: the cell state is still a fixed-size vector, and information about token 1 has to survive 1000 multiplicative updates by the forget gate. The gradient flows; the information bottleneck remains.

Memory and compute shape. Per-step compute is dominated by the four matrix multiplies in the gate computations: O(d_h · (d_x + d_h)) per step, where d_h is hidden size (256-1024 typical) and d_x is input embedding size. Memory is O(d_h) per step for activations. For BPTT you store activations at every step, so peak training memory scales linearly with sequence length and hidden size.

A single sequence’s processing is unavoidably sequential — step t needs h_{t-1} — so per-sequence wall-clock time is linear in length. You parallelize across the batch dimension, which works fine until batch size hits the GPU memory wall.

Gradient explosion still happens. Vanishing is mostly fixed; exploding isn’t. Gradient clipping by global norm (typically clip-norm 1.0 or 5.0) is standard and non-optional.

Initialization. Orthogonal initialization for the recurrent matrices, Xavier/Glorot for the input projections. Forget-gate bias initialized positive (+1 or +2). These three details together account for most of the “LSTM is hard to train” folklore disappearing in well-tuned setups.

Real-world deployments#

LSTMs and GRUs were the dominant sequence architecture from roughly 2014 to 2018. Production systems built on them include:

Google Translate’s 2016 neural rewrite. Stacked 8-layer LSTM encoder + 8-layer LSTM decoder + attention. Replaced phrase-based statistical MT and dropped translation error rates by ~60%. Itself replaced by transformer-based GNMT around 2018.
Amazon Echo / Apple Siri / Google Assistant ASR backends, 2015-2018. Deep BiLSTM acoustic models, CTC or attention-based decoders. The streaming variants live on for low-latency speech recognition.
Smartphone keyboard prediction. Compact LSTMs running on-device for next-word prediction. Still in use as of mid-2020s on memory-constrained handsets where transformer KV-caches don’t fit.
DeepMind WaveNet and Tacotron 2 (2016-2017). Neural TTS pipelines with LSTM-based prosody and attention-based alignment.
Algorithmic trading and forecasting. Many time-series shops still run LSTM/GRU stacks for tabular sequence prediction — when the sequence is short (50-200 steps) and you need fast inference on CPU, gated RNNs are still competitive with transformers.

The “LSTMs are dead” narrative is overstated. They’ve been displaced from frontier language modeling, but the architecture is still shipping in production at scale wherever streaming inference with bounded per-token state matters more than ultimate context length.

What state-space models inherited from LSTMs

Mamba, RWKV, and the recent wave of state-space sequence models look superficially different from LSTMs — selective scan kernels, structured state matrices, time-varying parameters — but the family resemblance is direct. They all maintain a fixed-size recurrent state, update it additively with gating-like mechanisms, and use that state as a gradient-friendly pathway through time. The key innovation isn’t replacing recurrence; it’s reformulating the recurrence so the forward pass can be parallelized across positions (via the associative scan, or convolution-equivalent forms). This is the missing piece that handicapped LSTMs against transformers: training parallelism. Once you have that and gated additive memory, you get the best of both worlds. Whether SSMs displace transformers at frontier scale is still an open empirical question in 2026, but the architecture has clearly closed the gap on long-context benchmarks, and the lineage runs straight back to the 1997 LSTM paper.