[Review] A Practical Introduction to Sequence Learning Models—RNN, LSTM, GRU (Literature Review)

Last updated on 04 Oct 2025

TL;DR: This review distills a concise preprint on sequence models into a practitioner-oriented overview covering task formulations, core architectures (RNN/LSTM/GRU), bidirectionality, depth, training dynamics (vanishing/exploding gradients), and practical guidance on model selection.

How to cite the source

Zargar, S. A. (2021). Introduction to Sequence Learning Models: RNN, LSTM, GRU (preprint). North Carolina State University. DOI: 10.13140/RG.2.2.36370.99522. (This review summarizes and paraphrases the work; readers should consult the original for full details.)

1) Scope and Contribution of the Source

The preprint introduces sequence learning from first principles, explains why standard feed-forward networks struggle with ordered data, and then walks through RNNs, LSTMs, GRUs, bidirectional variants, and deep stacks. Its primary value is pedagogical: it translates equations into clear block-level mechanics (e.g., hidden state recurrence, gating, memory cell) and situates architectural choices against training issues like vanishing/exploding gradients.

2) What Counts as “Sequence Learning”?

The text defines sequences as ordered observations where past (or future) context informs prediction, and categorizes tasks into:

Many-to-one (e.g., sentiment classification),
One-to-many (e.g., image captioning),
Many-to-many with equal or unequal lengths (e.g., NER, translation).
This taxonomy clarifies input/output shapes and drives architectural choices (e.g., encoder–decoder vs time-aligned outputs).

3) Why Not Plain MLPs?

Two core limitations are highlighted: (i) independence assumptions that ignore temporal/spatial context; (ii) fixed-length input expectations incompatible with variable-length sequences. These motivate recurrent connections that carry state across time.

4) Recurrent Neural Networks (RNNs): Mechanics and Limits

Mechanics. A basic RNN maintains a hidden state (a^{\langle t\rangle}) that updates from current input (x^{\langle t\rangle}) and previous hidden (a^{\langle t-1\rangle}), producing output (\hat{y}^{\langle t\rangle}). Unfolding in time reveals weight sharing across steps; training uses Backpropagation Through Time (BPTT).

Limitations. Long-range dependencies are hard due to vanishing/exploding gradients when repeatedly multiplying the same transition matrix during BPTT. Gradient clipping and truncated BPTT help with exploding gradients, but vanishing gradients usually require architectural changes (gating/memory).

5) Long Short-Term Memory (LSTM): Gated Memory for Long Dependencies

Key idea. LSTMs introduce a cell state (c^{\langle t\rangle}) that acts like a conveyor belt with near-linear information flow, plus gates (input, forget, output) to regulate writes, retention, and exposure of memory. This design preserves gradient flow and enables learning dependencies over long horizons.

Computation sketch.

Candidate cell content from current input and prior hidden state (typically via tanh).
Forget gate scales previous cell state; input gate scales candidate update; combine to update (c^{\langle t\rangle}).
Output gate filters the exposed hidden state (a^{\langle t\rangle}) and prediction.
This gating allows the model to learn what to remember, overwrite, or ignore.

Practical note. LSTMs are widely effective across domains and remain a strong baseline for sequential forecasting, classification, and sequence-to-sequence tasks, especially when very long memory is necessary.

6) Gated Recurrent Unit (GRU): A Leaner Alternative

Key idea. GRUs simplify LSTMs by (i) merging input+forget into a single update gate, and (ii) combining hidden and cell states. There’s also a reset gate that controls how much past information contributes when forming the candidate state. Fewer gates/parameters can make GRUs faster to train while retaining competitive performance.

When to pick GRU vs LSTM?

GRU: leaner, often strong default when compute/data are constrained or sequences aren’t extremely long.
LSTM: when you suspect long-range structure and want more explicit memory control.
The preprint doesn’t claim categorical superiority; choice remains empirical.

7) Bidirectional RNNs (BRNNs): Using Past and Future

A bidirectional network processes the sequence in both forward and backward directions and concatenates the two hidden representations to predict (\hat{y}^{\langle t\rangle}). This is useful when the full input is available at inference (e.g., tagging, offline transcription). Limitation: it requires the entire input sequence upfront, so it’s not suitable for strict online/causal settings.

8) Deep (Stacked) RNN/LSTM/GRU: Depth vs. Temporal Depth

Although RNNs are already “deep in time,” stacking 2–3 recurrent layers can learn richer abstractions. Very deep stacks are less common (training gets harder and returns may diminish), but modest depth often helps. You can also combine depth with bidirectionality.

9) Training Considerations Summarized

Exploding gradients: use gradient clipping; consider truncated BPTT for long sequences.
Vanishing gradients: prefer gated cells (LSTM/GRU).
Sequence length & latency:
- Online/low-latency → unidirectional RNN/LSTM/GRU.
- Offline tagging/labeling with full context → bidirectional variants.
Model capacity vs data: GRU as lighter baseline; LSTM when capacity/memory control are important.
Stacking: 2–3 layers can help; watch optimization stability.

10) Where This Piece Sits in the Literature

The preprint positions RNN-class models as the historical backbone of sequence learning, tracing development from early recurrent architectures (Hopfield, Elman/Jordan) to practical training (BPTT), then to LSTM/GRU innovations solving gradient flow. It also notes broad applications (e.g., speech, language, structured signals), and points readers to canonical references for each idea. As transformer-based models now dominate many NLP tasks, RNN-family models remain relevant for on-device, lower-latency, or smaller-data regimes, and for domains where temporal inductive biases and streaming inference matter. (Context synthesized based on the preprint’s historical framing and applications list.)

11) Comparative Snapshot

Aspect	Vanilla RNN	LSTM	GRU	Bidirectional (any cell)
Handles long dependencies	Weak (vanishing gradients)	Strong (cell state + gates)	Strong (simpler gates)	Strong (uses both past & future)
Parameters/complexity	Lowest	Higher	Medium	~2× per layer (two directions)
Latency at inference	Low	Moderate	Low–Moderate	Higher (not causal)
Best-fit use cases	Short contexts, simple tasks	Long-range structure, complex tasks	Efficient baseline, competitive accuracy	Offline tagging/labeling with full context

12) Practical Takeaways for Your Projects

Start simple: For many classification/forecasting tasks, begin with a single-layer GRU baseline; add another layer only if validation suggests underfitting.
Need long context? Switch to LSTM (or keep GRU and increase hidden size), and extend sequence windows with truncated BPTT + gradient clipping.
Latency vs. accuracy: Prefer unidirectional models for real-time streams; use bidirectional when you can batch full sequences offline.
Stability tips: Normalize inputs, clip gradients, monitor exploding/vanishing diagnostics (grad norms, layer activations).

13) Strengths and Limitations of the Source

Strengths

Clear conceptualization of sequence problem types and architectural match.
Intuitive exposition of LSTM/GRU gating, with equations tied to block diagrams.
Pragmatic notes on vanishing/exploding gradients and training workarounds.

Limitations

Focuses on RNN-class models; does not cover modern attention/transformer approaches or hybrid architectures.
Empirical comparisons are largely conceptual—no dataset-by-dataset benchmarks.
Engineering practices (regularization, initialization, scheduling) are only briefly touched.

14) Suggested Reading Path (From This Starting Point)

Use this preprint to master RNN/LSTM/GRU mechanics and training dynamics.
Extend to encoder–decoder with attention, then transformers, to compare inductive biases and scalability.
For operations research / air-cargo time series, prototype GRU/LSTM baselines with causal validation splits; compare to 1D-CNNs and transformer-style temporal models under your latency/compute constraints.

15) Copyright & Fair-Use Note

This post is a paraphrased literature review of the cited preprint. Quotations (if any) are minimal and used for purposes of commentary and scholarship. For full figures, equations, and original wording, please read the author’s preprint and respect the author’s rights. If you plan to reuse figures or substantial excerpts, obtain permission from the rights holder.

16) Key Points Checklist

Sequence tasks come in many-to-one, one-to-many, and many-to-many forms—architectures must fit shapes.
Vanilla RNNs struggle with long-range dependencies due to gradient issues.
LSTM/GRU introduce gates/memory that stabilize training over long horizons.
Bidirectional variants exploit full context but are non-causal.
Stacking 2–3 recurrent layers can help, but watch optimization stability.