Transformer Deep Dive: Part 1 - The Original Transformer (2017)

This is the first post in my series exploring the Transformer architecture, from the original 2017 paper to modern LLMs like GPT-4 and LLaMA. We'll start with the canonical Transformer v1, designed for machine translation.

Reference: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)

Overall Architecture

The original Transformer has an Encoder-Decoder structure:

Input sentence (source) → Encoder stack (N=6) → Decoder stack (N=6) → Output sentence (target)

Key Properties

Encoder reads the source language (e.g., English)
Decoder generates the target language (e.g., German)
No recurrence (no RNN, no LSTM)
No convolution
Relies entirely on attention mechanisms

Why Replace RNNs?

| RNN | Transformer | |-----|-------------| | Sequential processing (slow) | Parallel processing (fast) | | Long-range gradient issues | Direct connections | | Hard to parallelize | GPU-friendly |

Scaled Dot-Product Attention

The core innovation is the attention formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Step-by-Step

Compute scores: $E = QK^\top$
Scale: $S = E / \sqrt{d_k}$
Softmax: $A = \text{softmax}(S)$ (row-wise)
Weighted sum: $O = AV$

Why Scale by $\sqrt{d_k}$ ?

This is one of the most critical design decisions. When $d_k$ is large, dot products grow large in magnitude, pushing softmax into saturation regions with tiny gradients.

The Math: Assume Q and K vectors have components with mean 0 and variance 1 (standard initialization). The variance of their dot product is:

\text{Var}(q \cdot k) = d_k

When $d_k = 64$ , the standard deviation is 8, meaning dot products can easily be ±16 or larger. This causes softmax to produce near-one-hot distributions:

| Without Scaling | With Scaling | |-----------------|--------------| | Scores: [-20, 15, 18, -12] | Scores: [-2.5, 1.9, 2.3, -1.5] | | Softmax: [0.00, 0.05, 0.95, 0.00] | Softmax: [0.02, 0.17, 0.72, 0.09] | | Gradients: Nearly zero | Gradients: Flow to all positions |

Dividing by $\sqrt{d_k}$ normalizes the variance back to 1, keeping softmax in a healthy gradient region.

Multi-Head Attention

Instead of performing a single attention function, we project queries, keys, and values h times with different learned projections:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

where each head is:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

| Configuration | Base Model | Big Model | |---------------|------------|-----------| | Number of heads (h) | 8 | 16 | | $d_k = d_v$ | 64 | 64 |

Three Types of Attention

1. Encoder Self-Attention

Q, K, V all from encoder input
Full attention: every position attends to all positions
No masking

2. Masked Decoder Self-Attention

Q, K, V all from decoder input
Causal mask: position i can only attend to positions ≤ i
Prevents looking at future tokens

3. Encoder-Decoder (Cross) Attention

Q from decoder
K, V from encoder output
Decoder can attend to entire source sentence

The Encoder Layer

Each encoder layer has two sublayers with Post-Layer Normalization:

Input x
    ↓
Multi-Head Self-Attention
    ↓
Dropout
    ↓
Add (Residual): x + output
    ↓
LayerNorm          ← Post-LN
    ↓
Feed Forward Network
    ↓
Dropout
    ↓
Add (Residual)
    ↓
LayerNorm          ← Post-LN
    ↓
Output

Critical: In the original Transformer, LayerNorm comes AFTER the residual addition:

\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))

This is a key difference from modern LLMs, which use Pre-LN (we'll cover this in Part 2).

Feed Forward Network

Position-wise FFN applied identically to every token:

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

| Property | Value | |----------|-------| | Hidden dimension ( $d_{ff}$ ) | 2048 | | Model dimension ( $d_{model}$ ) | 512 | | Activation | ReLU (not GELU) |

Dimensions: 512 → 2048 → 512

Positional Encoding

Unlike RNNs, Transformers process all tokens in parallel. Without positional information, "The cat sat on the mat" and "The mat sat on the cat" would be indistinguishable.

The original uses fixed sinusoidal encodings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Why Sinusoids?

Can extrapolate to longer sequences than seen during training
Relative positions representable as linear functions
Different frequencies capture different position scales

Training Configuration

| Setting | Value | |---------|-------| | Optimizer | Adam ( $\beta_1=0.9$ , $\beta_2=0.98$ ) | | LR Schedule | Warmup + inverse square root decay | | Warmup steps | 4000 | | Label smoothing | 0.1 | | Dropout | 0.1 | | Layers | 6 encoder + 6 decoder | | Parameters (base) | ~65M | | Parameters (big) | ~213M |

Summary

| Component | Original Design | |-----------|-----------------| | Architecture | Encoder-Decoder | | Layers | 6 + 6 | | LayerNorm | Post-LN (after residual) | | Attention | Full self-attention | | FFN Activation | ReLU | | Positional Encoding | Sinusoidal (fixed) | | Use Case | Translation |

In the next post, we'll explore Part 2: Architecture Changes - how modern LLMs evolved from this original design with decoder-only architectures, Pre-Layer Normalization, and RMSNorm.

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

Overall Architecture

Key Properties

Why Replace RNNs?

Scaled Dot-Product Attention

Step-by-Step

Why Scale by $\sqrt{d_k}$ ?

Multi-Head Attention

Three Types of Attention

1. Encoder Self-Attention

2. Masked Decoder Self-Attention

3. Encoder-Decoder (Cross) Attention

The Encoder Layer

Feed Forward Network

Positional Encoding

Training Configuration

Summary

Related Articles

Responses

Transformer Deep Dive: Part 1 - The Original Transformer (2017)

Overall Architecture

Key Properties

Why Replace RNNs?

Scaled Dot-Product Attention

Step-by-Step

Why Scale by dk\sqrt{d_k}dk​​?

Multi-Head Attention

Three Types of Attention

1. Encoder Self-Attention

2. Masked Decoder Self-Attention

3. Encoder-Decoder (Cross) Attention

The Encoder Layer

Feed Forward Network

Positional Encoding

Training Configuration

Summary

Related Articles

Responses

Why Scale by $\sqrt{d_k}$ ?