Reference Notes.

Concise reference sheets, mathematical derivations, parameter bounds, and architecture summaries compiled during active research.

Stage 1

Deep Learning Foundations (2)

Calculus derivatives, gradients, and backpropagation math.

2026-05-16

Reverse-Mode Auto-Diff (Backprop Graph Math)

Backpropagation derived as reverse-mode automatic differentiation over computational graphs.

#math

View Sheet

2026-05-15

Vector Gradients and Jacobians Reference

Basic formulas for multivariate vector derivatives, Jacobian matrices, and matrix calculus operations.

#math

View Sheet

Stage 2

Transformers (3)

Self-attention queries, keys, values, and multi-head splits.

2026-05-20

Multi-Head Attention Dimensional Splitting

Tensor transformations, shape changes, and concat projections in multi-head attention systems.

#attention

View Sheet

2026-05-19

Causal Masking and Attention Bounds

How to prevent looking ahead during training using a lower-triangular causal mask matrix.

#attention

View Sheet

2026-05-18

Query, Key, and Value Projection Dimensions

Dimensional matrix shapes, projections, and dot-product calculations in self-attention layers.

#attention

View Sheet

Stage 3

Tokenization (2)

BPE merges, vocabulary builders, and UTF-8 byte fallbacks.

2026-05-23

UTF-8 Byte Fallback Strategy

How modern tokenizers handle out-of-vocabulary characters using raw byte encoding.

#tokenization

View Sheet

2026-05-22

Byte-Pair Encoding Merge Tables

How Byte-Pair Encoding (BPE) algorithms construct vocabularies and manage token merge tables.

#tokenization

View Sheet

Stage 4

GPT Architecture (2)

Decoder-only blocks, causal masking, and LayerNorm placements.

2026-05-26

LayerNorm Mechanics (Pre-LN vs Post-LN)

Formulas and differences between Pre-LN and Post-LN architectures in stabilizing transformer training.

#gpt

View Sheet

2026-05-25

Decoder-Only Transformer Configurations

Block layout, causal routing, and standard hyperparameters for decoder-only models.

#gpt

View Sheet