The GPT-2 Decoder Block: Before Writing Any Code

At university, long before the current AI period, neural networks were still niche. Academic appeal, limited hardware, results that rarely justified the compute. We wrote simple recurrent networks in C, used them in robotics and image-processing experiments, and spent most of the time fighting the hardware constraints rather than thinking about the model. I liked that part more than I probably should have admitted.

Then around 2017 and 2018 transformers arrived, and the field reorganized itself around them. Not a replacement for neural networks — a different way to structure them, with a mechanism for sequence modeling that turned out to scale in a way nothing before it had.

When LetGPU published a challenge to implement a GPT-2 Small decoder block, the old interest surfaced immediately. Before writing a single line of code I wanted to understand what the block actually is, how data flows through it, and what the challenge is really asking for. This post is that understanding. The implementation — in CUDA, on the December trip to Tokyo — comes separately.

What a Transformer Block Does

At a high level, a neural network is a system that repeatedly transforms a representation into a better one. In language models the representation is a vector per token. Each block takes those vectors, processes them, and passes a refined version forward.

The key difference from older sequence models is attention. RNNs process tokens one at a time and carry state forward; long-range dependencies get compressed into whatever the hidden state can hold. Transformers let every token look at every other token directly and decide which ones matter for its current representation. That is not a minor optimization — it changes what the model can learn and how fast it can train.

Generic Transformer Block

Figure 1. A simplified transformer block. Attention lets tokens exchange information; the feed-forward network refines each representation independently. Residual connections add the original signal back after each sublayer; LayerNorm keeps the values numerically stable.

The block does two jobs in sequence. First, attention: tokens share context with each other. Second, a feed-forward network: each token’s representation is deepened independently. Residual connections run around both, so the block learns updates rather than replacements. The output shape matches the input — same number of tokens, same hidden width.

The LetGPU Challenge

The task is to implement one GPT-2 Small decoder block. You are given an input tensor x of shape (seq_len, 768) and a packed flat buffer containing all parameters for the block. Compute the output — same shape as the input.

That sounds compact. It is not. One block contains:

two LayerNorms (each with learned scale and bias)
a combined QKV projection (768 → 2304)
12-head self-attention
an output projection (768 → 768)
two residual connections
a feed-forward network: 768 → 3072 → GELU → 768

About 7.1 million parameters in one block. GPT-2 Small stacks twelve of them.

GPT-2 Small Decoder Block Challenge Flow

Figure 2. The GPT-2 Small decoder block from the LetGPU challenge. Pre-norm: LayerNorm runs before attention and before the feed-forward network. Hidden size 768, twelve heads of 64 dimensions each, feed-forward expansion to 3072.

Pre-Norm and the Flow

GPT-2 uses pre-norm: LayerNorm is applied before each sublayer, not after. The original Attention Is All You Need paper used post-norm (normalize after the residual add). The difference matters for training stability in deep models and is worth knowing before reading the weight layout.

The full sequence through the block:

x → LN1 → QKV projection → multi-head attention → output projection → add(x, ·) → x1
x1 → LN2 → FC projection → GELU → output projection → add(x1, ·) → output

Ten operations, two residual paths, one block.

Attention: Q, K, V

Each token is projected into three vectors — Query, Key, Value — via a single combined weight matrix that is then split. The intuition that actually holds up: the Query describes what this token is looking for, the Key describes what it offers, and the Value is what gets passed along if the match is strong.

Attention weights come from comparing queries against keys (scaled by 1/√64 to keep the dot-product variance at 1 regardless of dimension — without this, the softmax inputs grow with d_k and push gradients into near-zero saturation regions), then those weights determine how values are mixed. Each token ends up with a weighted sum of values from the entire sequence — its representation updated by whatever context the model has learned to attend to.

Multi-head attention runs this in parallel across twelve subspaces of 64 dimensions each. 12 × 64 = 768. Each head can specialize in a different kind of dependency; the results are concatenated and projected back to 768.

Feed-Forward: Refine Independently

After attention, the feed-forward network processes each token on its own — no cross-token interaction here. It expands from 768 to 3072, applies GELU, and projects back to 768. The expansion gives the model room to build richer feature combinations before compressing back down. GELU acts as a smooth gate: it suppresses near-zero activations continuously rather than with a hard threshold, which is why it works better than ReLU for this use case.

The Packed Weight Buffer

The parameters are not handed over as named tensors. They arrive as one flat device buffer and you are expected to know the layout:

Parameters	Size
γ₁, β₁ (LN1)	768 + 768
W_qkv, b_qkv	768×2304 + 2304
W_attn, b_attn	768×768 + 768
γ₂, β₂ (LN2)	768 + 768
W_fc, b_fc	768×3072 + 3072
W_proj, b_proj	3072×768 + 768

Working out the byte offsets is part of the exercise. It forces you to understand the model structure at the level of actual memory, not just diagram boxes.

That is what I found interesting about the challenge. The transformer block is not complicated once you stop treating it as a black box. Attention shares context; the feed-forward network deepens each representation; residuals and LayerNorm keep the whole thing trainable. The packed buffer just makes you prove you understood the structure before the hardware will let you run it.

The CUDA implementation is in the follow-up post.