Resources & Exploration

Reflective notes on completed study resources, integrated with codebases I am exploring in systems-level AI.

Timeline Progression

Study reviews and codebase exploration grouped by roadmap stages.

Stage 1

Deep Learning Foundations

Linear Algebra, calculus chain-rule, and backpropagation math.

Studied Materials (1)

3Blue1Brown Neural Networks & Deep Learning Series

A visual mathematical decomposition of multi-layer networks, parameters optimization, and gradient descent trajectories.

Most Valuable Insight"Weight updates are coordinate transforms warping a high-dimensional space to isolate boundaries, and backprop is the reverse chain-rule tracking through these warps."

Personal Reflection

Revisiting Sanderson's series reframed my understanding from programming APIs to geometric tensor warps. It helped me visualize gradient descent not just as a tuning step, but as a path through high-dimensional cost surfaces.

Concepts Gained

Forward propagationBackpropagation graph calculusGeometric weight transformationsSigmoid & ReLU activations

Stage 2

Transformers

Self-Attention mechanics, query/key/value dot products, and multi-head splits.

Studied Materials (2)

Attention Is All You Need (Original Paper)

The seminal 2017 paper by Vaswani et al. replacing recurrence (RNNs/LSTMs) entirely with self-attention.

Most Valuable Insight"By discarding sequence-step recurrence, MHA enables full parallelization during training, though at the cost of quadratic sequence memory scaling."

Personal Reflection

Reading the original paper forced me to look at the raw tensor dimension calculations. Forcing a model to be non-recurrent solves training time but introduces the context memory walls we struggle to optimize in production inference.

Concepts Gained

Scaled Dot-Product AttentionMulti-Head Attention (MHA)Linear dimensional projectsSinusoidal Positional Encoding

Jay Alammar – The Illustrated Transformer

A visual structural analysis of the transformer block, detailing embedding mappings and decoder/encoder projections.

Most Valuable Insight"Self-attention acts as a soft key-value database retrieval mechanism where token vectors exchange contextual queries."

Personal Reflection

Alammar's diagrams clarified the dimensional transformations that happen within the Multi-Head Attention layer. It helped me visualize how head splits allow the model to attend to different representation subspaces concurrently.

Concepts Gained

Self-Attention projectionMulti-Head splitsPositional EncodingResidual additions & normalization

Stage 3

Tokenization

Byte-Pair Encoding (BPE) vocab construction and byte fallback strategies.

Associated Codebases (1)

Karpathy/minbpe

Upcoming

PurposeA clean, educational Python library implementing BPE tokenization.

Why I'm studying itTo study the exact token-merge iterations, regex splitting, and byte-level fallback configurations.

Key Concepts

BPE Vocabulary MergesRegex Splitting PatternsByte Fallback Mapping

View Code

Stage 4

GPT Architecture

Decoder-only models, causal masking, and LayerNorm placement.

Studied Materials (1)

Andrej Karpathy – Let's Build GPT from Scratch

A practical, line-by-line coding implementation of a character-level decoder-only transformer in PyTorch.

Most Valuable Insight"Correct parameter weight initializations and Pre-LN routing are critical for training stability. Without them, gradients collapse silently."

Personal Reflection

Coding along with Karpathy bridged the gap between equations and floating-point computations. Correcting loss calculation details showed me how model errors remain hidden from standard compiler checks.

Concepts Gained

Decoder block configurationCausal masking matrixLayerNorm (Pre-LN)Weight initializations"Regularization (DropoutWeight Decay)"

Associated Codebases (3)

Karpathy/nanoGPT

Exploring

PurposeThe cleanest, fastest repository for training medium-scale decoder-only transformers.

Why I'm studying itTo study the training loop structure, learning rate cosine schedules, and multi-GPU DDP abstractions.

Key Concepts

Cosine Warmup SchedulersWeight InitializationPyTorch DDP Abstractions

View Code

Karpathy/llm.c

Future

PurposeLLM training written in raw C/CUDA without heavy PyTorch dependencies.

Why I'm studying itTo understand the low-level systems engineering of backprop math running directly on GPU registers.

Key Concepts

Custom CUDA KernelsMemory-Mapped FilesHardware-Level Backpropagation

View Code

ggml-org/llama.cpp

Future

PurposeInference engine for LLaMA models in pure C/C++.

Why I'm studying itTo study quantizations (INT8/INT4), KV caching optimizations, and thread pool orchestration on local processors.

Key Concepts

Model QuantizationInference CPU AbstractionsKV Cache Orchestration

View Code

Stage 7

Model Context Protocol (MCP)

Standardizing tool integrations over stdio/SSE JSON-RPC protocols.

Associated Codebases (1)

modelcontextprotocol/servers

Future

PurposeExemplar MCP servers implementing tools and resources.

Why I'm studying itTo explore stdio/SSE transport boundary checks and tool exposure designs.

Key Concepts

JSON-RPC SchemasSecurity SandboxingProtocol Transport Layouts

View Code