Transformer Deep Dive: Part 1 - The Original Transformer (2017)
A deep dive into the original Transformer architecture from 'Attention Is All You Need' - the encoder-decoder structure, scaled dot-product attention, multi-head attention, and the design decisions that revolutionized NLP.
Suchinthaka W.
January 15, 2025 · 5 min read
This is the first post in my series exploring the Transformer architecture, from the original 2017 paper to modern LLMs like GPT-4 and LLaMA. We'll start with the canonical Transformer v1, designed for machine translation.
Reference: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
Overall Architecture
The original Transformer has an Encoder-Decoder structure:
Input sentence (source) → Encoder stack (N=6) → Decoder stack (N=6) → Output sentence (target)
Key Properties
- Encoder reads the source language (e.g., English)
- Decoder generates the target language (e.g., German)
- No recurrence (no RNN, no LSTM)
- No convolution
- Relies entirely on attention mechanisms
Why Replace RNNs?
| RNN | Transformer | |-----|-------------| | Sequential processing (slow) | Parallel processing (fast) | | Long-range gradient issues | Direct connections | | Hard to parallelize | GPU-friendly |
Scaled Dot-Product Attention
The core innovation is the attention formula:
Step-by-Step
- Compute scores:
- Scale:
- Softmax: (row-wise)
- Weighted sum:
Why Scale by ?
This is one of the most critical design decisions. When is large, dot products grow large in magnitude, pushing softmax into saturation regions with tiny gradients.
The Math: Assume Q and K vectors have components with mean 0 and variance 1 (standard initialization). The variance of their dot product is:
When , the standard deviation is 8, meaning dot products can easily be ±16 or larger. This causes softmax to produce near-one-hot distributions:
| Without Scaling | With Scaling | |-----------------|--------------| | Scores: [-20, 15, 18, -12] | Scores: [-2.5, 1.9, 2.3, -1.5] | | Softmax: [0.00, 0.05, 0.95, 0.00] | Softmax: [0.02, 0.17, 0.72, 0.09] | | Gradients: Nearly zero | Gradients: Flow to all positions |
Dividing by normalizes the variance back to 1, keeping softmax in a healthy gradient region.
Multi-Head Attention
Instead of performing a single attention function, we project queries, keys, and values h times with different learned projections:
where each head is:
| Configuration | Base Model | Big Model | |---------------|------------|-----------| | Number of heads (h) | 8 | 16 | | | 64 | 64 |
Three Types of Attention
1. Encoder Self-Attention
- Q, K, V all from encoder input
- Full attention: every position attends to all positions
- No masking
2. Masked Decoder Self-Attention
- Q, K, V all from decoder input
- Causal mask: position i can only attend to positions ≤ i
- Prevents looking at future tokens
3. Encoder-Decoder (Cross) Attention
- Q from decoder
- K, V from encoder output
- Decoder can attend to entire source sentence
The Encoder Layer
Each encoder layer has two sublayers with Post-Layer Normalization:
Input x
↓
Multi-Head Self-Attention
↓
Dropout
↓
Add (Residual): x + output
↓
LayerNorm ← Post-LN
↓
Feed Forward Network
↓
Dropout
↓
Add (Residual)
↓
LayerNorm ← Post-LN
↓
Output
Critical: In the original Transformer, LayerNorm comes AFTER the residual addition:
This is a key difference from modern LLMs, which use Pre-LN (we'll cover this in Part 2).
Feed Forward Network
Position-wise FFN applied identically to every token:
| Property | Value | |----------|-------| | Hidden dimension () | 2048 | | Model dimension () | 512 | | Activation | ReLU (not GELU) |
Dimensions: 512 → 2048 → 512
Positional Encoding
Unlike RNNs, Transformers process all tokens in parallel. Without positional information, "The cat sat on the mat" and "The mat sat on the cat" would be indistinguishable.
The original uses fixed sinusoidal encodings:
Why Sinusoids?
- Can extrapolate to longer sequences than seen during training
- Relative positions representable as linear functions
- Different frequencies capture different position scales
Training Configuration
| Setting | Value | |---------|-------| | Optimizer | Adam (, ) | | LR Schedule | Warmup + inverse square root decay | | Warmup steps | 4000 | | Label smoothing | 0.1 | | Dropout | 0.1 | | Layers | 6 encoder + 6 decoder | | Parameters (base) | ~65M | | Parameters (big) | ~213M |
Summary
| Component | Original Design | |-----------|-----------------| | Architecture | Encoder-Decoder | | Layers | 6 + 6 | | LayerNorm | Post-LN (after residual) | | Attention | Full self-attention | | FFN Activation | ReLU | | Positional Encoding | Sinusoidal (fixed) | | Use Case | Translation |
In the next post, we'll explore Part 2: Architecture Changes - how modern LLMs evolved from this original design with decoder-only architectures, Pre-Layer Normalization, and RMSNorm.
Related Articles
Responses
Be the first to share your thoughts!