Reference Notes.
Concise reference sheets, mathematical derivations, parameter bounds, and architecture summaries compiled during active research.
Deep Learning Foundations (2)
Calculus derivatives, gradients, and backpropagation math.
Reverse-Mode Auto-Diff (Backprop Graph Math)
Backpropagation derived as reverse-mode automatic differentiation over computational graphs.
Vector Gradients and Jacobians Reference
Basic formulas for multivariate vector derivatives, Jacobian matrices, and matrix calculus operations.
Transformers (3)
Self-attention queries, keys, values, and multi-head splits.
Multi-Head Attention Dimensional Splitting
Tensor transformations, shape changes, and concat projections in multi-head attention systems.
Causal Masking and Attention Bounds
How to prevent looking ahead during training using a lower-triangular causal mask matrix.
Query, Key, and Value Projection Dimensions
Dimensional matrix shapes, projections, and dot-product calculations in self-attention layers.
Tokenization (2)
BPE merges, vocabulary builders, and UTF-8 byte fallbacks.
UTF-8 Byte Fallback Strategy
How modern tokenizers handle out-of-vocabulary characters using raw byte encoding.
Byte-Pair Encoding Merge Tables
How Byte-Pair Encoding (BPE) algorithms construct vocabularies and manage token merge tables.
GPT Architecture (2)
Decoder-only blocks, causal masking, and LayerNorm placements.
LayerNorm Mechanics (Pre-LN vs Post-LN)
Formulas and differences between Pre-LN and Post-LN architectures in stabilizing transformer training.
Decoder-Only Transformer Configurations
Block layout, causal routing, and standard hyperparameters for decoder-only models.