Skip to main content
Vamsi Cheruku.
Back to Notes
gpt2026-05-25

Decoder-Only Transformer Configurations

Block layout, causal routing, and standard hyperparameters for decoder-only models.

gpt transformer-blocks architecture

Autoregressive language models (like GPT, LLaMA, Mistral) use a decoder-only architecture. This differs from the original encoder-decoder structure by removing the cross-attention layer and applying causal masking.

Block Structure

A single GPT decoder block contains:

  1. Pre-Layer Normalization (Pre-LN): Norm applied to input before self-attention.
  2. Causal Multi-Head Self-Attention: Evaluates token relationships, masked from looking ahead.
  3. Residual Connection: Adds block input to attention output.
  4. Pre-Layer Normalization (Pre-LN): Norm applied to output of residual add.
  5. Feed-Forward Network (FFN): Two linear layers with an activation function (like GELU or SwiGLU) in between.
  6. Residual Connection: Adds FFN input to FFN output.
Input ➔ LayerNorm ➔ Causal Attention ➔ (+) ➔ LayerNorm ➔ FFN ➔ (+) ➔ Output
  │                                     ▲     │                  ▲
  └─────────────────────────────────────┘     └──────────────────┘

Parameter Configurations (GPT-2 standards)

Typical scale hyperparameters for autoregressive decoders:

ParameterSymbolGPT-2 SmallGPT-2 MediumDescription
Layersn_layer1224Number of transformer block stacks
Dimensiond_model7681024Hidden state vector size
Headsn_head1216Number of attention heads
Block Sizen_ctx10241024Max context window tokens
Vocab Sizen_vocab5025750257Total size of BPE vocabulary

PyTorch Block Definition

import torch
import torch.nn as nn
 
class FeedForward(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(0.1)
        )
    def forward(self, x):
        return self.net(x)
 
class Block(nn.Module):
    def __init__(self, d_model, n_head):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        # self.attn = MultiHeadAttention(d_model, n_head)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffwd = FeedForward(d_model)
        
    def forward(self, x):
        # Pre-LN architecture (modern standard)
        # x = x + self.attn(self.ln1(x))
        # x = x + self.ffwd(self.ln2(x))
        return x

Share Reference Sheet