Complete Learning Path

How AI Really Works

From biological neurons to the Transformer — a ground-up explanation for people who can code but want to understand the why.

CHAPTER 01

What is a Neural Network, Really?

Starting from biology, then turning it into math

The biological inspiration

Your brain has ~86 billion neurons. Each neuron receives signals from thousands of other neurons, does a tiny calculation, and decides: "should I fire or not?". If it fires, it passes a signal forward. That's it.

In 1943, McCulloch & Pitts said: what if we model this in math?

Think of a single neuron like a committee vote. 5 people are voting. Some votes count more than others (weights). If the total weighted votes cross a threshold, the motion passes (the neuron fires). If not, it stays silent.

The artificial neuron — one unit

An artificial neuron takes several inputs, multiplies each by a weight, sums them all up, adds a bias, then passes through an activation function.

Single neuron output:
output = activation( (x₁×w₁) + (x₂×w₂) + (x₃×w₃) + bias )
PYTHON — ONE NEURON FROM SCRATCH
import numpy as np

# Three inputs (e.g., pixel values, sensor readings, etc.)
inputs = [0.8, 0.3, 0.6]

# Weights — how much does this neuron care about each input?
# These start random. Training will LEARN the right values.
weights = [0.4, -0.2, 0.9]

bias = 0.1  # A constant offset (shifts the decision boundary)

# Step 1: weighted sum
z = np.dot(inputs, weights) + bias
# = (0.8×0.4) + (0.3×-0.2) + (0.6×0.9) + 0.1 = 0.98

# Step 2: activation function (sigmoid squeezes output to 0..1)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

output = sigmoid(z)  # → 0.727
# Closer to 1 = "neuron fired", closer to 0 = "neuron silent"

Why do we need layers?

One neuron can only draw one straight line to separate data. Real problems are not linearly separable. You cannot classify cats vs dogs with one line.

Stack neurons into layers, and each layer learns increasingly abstract features:

Input Layer
28×28 = 784 raw pixels
Hidden Layer 1
128 neurons · detects edges
Hidden Layer 2
64 neurons · combines edges into shapes
Output Layer
10 classes · "this is a 7"

A neural network is just many neurons stacked in layers. The weights are the only thing that gets trained. Everything else — the architecture, the layers, the activation functions — you design. The training process finds the right weights automatically.

Activation functions — why they matter

Without an activation function, stacking layers is useless — you'd just get one big linear equation. Activations add non-linearity, which is what lets networks learn curves, shapes, and complex patterns.

Activation Range / Formula Used for
Sigmoid output ∈ (0, 1) Binary classification outputs
Tanh output ∈ (−1, 1) Centered, better for hidden layers
ReLU max(0, x) Modern default — fast, simple, works great
Softmax outputs sum to 1 Multi-class output (probabilities)
CHAPTER 02

How a Network Actually Learns

Backpropagation, loss functions, and gradient descent — demystified

The core loop: guess → measure error → adjust

Learning is just a repetitive loop. The network makes a guess, you measure how wrong it was, and you nudge every weight slightly in the direction that makes it less wrong. Repeat millions of times.

Forward Pass

Input flows through every layer, layer by layer. Each neuron computes its weighted sum + activation. You get a prediction at the end.

Compute Loss

Compare the prediction to the real answer using a loss function. This gives you a single number: "how wrong are we right now?"

Backward Pass (Backpropagation)

Calculate how much each weight contributed to the error. This uses calculus (chain rule) to propagate error signals backward through the network.

Gradient Descent — Update Weights

Nudge each weight in the direction that reduces the loss. The size of the nudge is controlled by the learning rate.

Loss functions

A loss function is like your exam score — but inverted. Instead of maximizing your score, you're minimizing your mistakes. The loss is 0 if you're perfect, and grows the more wrong you are. The network's only goal during training is to make this number as small as possible.

PYTHON — COMMON LOSS FUNCTIONS
import numpy as np

# === Mean Squared Error (regression tasks) ===
def mse_loss(predictions, targets):
    return np.mean((predictions - targets) ** 2)

# === Cross-Entropy Loss (classification tasks) ===
# "How surprised was the model?" — penalizes confident wrong answers hard
def cross_entropy(pred_prob, true_class):
    return -np.log(pred_prob[true_class])

# Example: model predicted [0.1, 0.8, 0.1], correct class is 1
# Loss = -log(0.8) ≈ 0.22  (low — model was right and confident)

# Example: model predicted [0.1, 0.1, 0.8], correct class is 1
# Loss = -log(0.1) ≈ 2.3  (high — model was confidently wrong)

Gradient Descent — finding the bottom of the hill

Imagine you're blindfolded on a hilly landscape. Your goal is to reach the lowest valley. You can't see — but you can feel which direction is downhill under your feet. Gradient descent is: take one small step in the downhill direction, repeat. The "landscape" is the loss function across all possible weight values.

Weight update rule:
w = w − learning_rate × (∂Loss/∂w)
PYTHON — GRADIENT DESCENT (conceptual)
# learning_rate controls step size
# too large: you overshoot and bounce around
# too small: training takes forever
learning_rate = 0.01

for epoch in range(1000):
    # Forward pass — get predictions
    predictions = forward(X, weights)

    # Compute how wrong we are
    loss = mse_loss(predictions, y_true)

    # Backward pass — compute gradient (how loss changes per weight)
    gradients = backward(loss, weights)

    # Update: move weights in the opposite direction of gradient
    weights = weights - learning_rate * gradients

What backpropagation actually is

Backpropagation is just the chain rule from calculus, applied systematically across all layers. It answers: "if I change this weight by a tiny amount, how much does the total loss change?"

In PyTorch/TensorFlow, calling loss.backward() does all of this automatically. You never write backprop by hand — but knowing it exists is crucial to debugging.

Nothing magic is happening. Training a neural network is just: (1) guess, (2) measure error, (3) calculate which direction each weight should move to reduce error, (4) move them a tiny bit, (5) repeat on millions of examples. The "intelligence" that emerges is a statistical pattern captured in billions of weight values.

CHAPTER 03

How Sequences Were Handled Before 2017

RNNs, LSTMs, and why they were the best we had — and still not good enough

The sequence problem

Language, audio, and time-series data are sequences. The order matters. "The dog bit the man" and "The man bit the dog" have the same words but completely different meanings.

A regular neural network (feedforward) takes a fixed-size input and produces a fixed-size output. It has no concept of order or time. So how do you handle sequences?

Recurrent Neural Networks (RNNs)

The idea: process one word at a time, and keep a "hidden state" — a memory vector that carries information forward as you read each word.

Imagine you're reading a book, but you can only look at one word at a time, and you have a small sticky note to jot down what you remember so far. Each new word, you update the note. The note is the hidden state. When you've read all words, the final note is your "understanding" of the sentence.

"The" "cat" "sat" "on" "the" "mat" h₀ h₁ h₂ h₃ h₄ h₅ Each hₙ = memory after reading word n. Final h₅ should encode the whole sentence. Problem: by h₅, how much of h₀ ("The") still survives in memory?
RNN unrolled across time — information must flow sequentially through every step
PYTHON — SIMPLE RNN CELL (what happens at each word)
import numpy as np

def rnn_cell(x_t, h_prev, W_x, W_h, b):
    """
    x_t    : current word's vector representation
    h_prev : hidden state from previous step (memory)
    W_x    : weights for input
    W_h    : weights for previous hidden state
    b      : bias
    """
    # Combine current input + previous memory
    z = W_x @ x_t + W_h @ h_prev + b
    
    # Squash through tanh to keep values bounded
    h_new = np.tanh(z)
    
    return h_new

# Process the sentence "The cat sat"
h = np.zeros(128)  # initial memory: all zeros

for word_vector in sentence_embeddings:
    h = rnn_cell(word_vector, h, W_x, W_h, b)
    # h now "remembers" everything up to this word

The vanishing gradient problem

Here's the fundamental issue with RNNs: when you backpropagate through time (through 50+ steps), the gradients get multiplied together at each step. If each multiplication makes them slightly smaller... after 50 steps they're essentially zero.

The RNN effectively forgets. By the time it reads word 50, the gradient signal from word 1 is so tiny that the weights at step 1 never actually get updated. The network can't learn long-range dependencies. "The animal didn't cross the street because it was too tired" — what does "it" refer to? A human knows it's the animal (word 2). An RNN often doesn't.

LSTMs — the engineering fix

Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) added explicit memory management. Instead of one hidden state, an LSTM has a cell state (long-term memory) and three gates that control what to remember, forget, and output.

Forget gate

"How much of the old memory do I keep?" — outputs 0 (forget all) to 1 (keep all) per dimension.

Input gate

"How much of the new information do I add to memory?" — controls writes into the cell state.

Output gate

"What part of memory do I expose right now?" — controls what becomes the new hidden state passed forward.

The cell state (long-term memory) flows through with minimal modification — so gradients can now travel further back without vanishing.

LSTMs were dominant from ~2013–2017 and powered Google Translate, Siri, and many production NLP systems. But they were still fundamentally sequential — you had to process word 1 before word 2. You couldn't parallelize.

The Encoder-Decoder architecture

For translation (English → French), the approach was: use one LSTM to encode the entire input sentence into a fixed vector, then use another LSTM to decode that vector into the output sentence.

"How are you?"
English input
LSTM Encoder
reads sentence word-by-word
Context Vector
one fixed-size vector ⚠
LSTM Decoder
generates word-by-word
"Comment allez-vous?"
French output

The bottleneck: the entire input sentence must fit into one fixed-size context vector. For long sentences, this vector simply can't hold everything — and information from early words gets squeezed out.

In 2015, Bahdanau et al. introduced attention as an add-on to fix this bottleneck. The decoder could now "look back" at all encoder hidden states, not just the last one. This was the precursor to everything.

CHAPTER 04

What Was Actually Broken

The exact problems that "Attention Is All You Need" was solving

Three fundamental problems with RNNs/LSTMs

Problem With RNN/LSTM What we needed
Sequential dependency Must process word 1, then 2, then 3... Cannot parallelize. Training a 512-word sentence on GPU = GPU waiting for 512 serial steps. Process all words at once, in parallel. GPUs are good at that.
Long-range dependencies Even with LSTM, remembering something from 100 words ago is unreliable. Information degrades with distance. Any word should be able to directly attend to any other word, regardless of distance.
Fixed bottleneck Encoder-decoder compresses entire input into one vector. Long documents lose information. Keep all token representations available throughout the whole process.

The word embedding problem — inputs as vectors

Before understanding Attention, you need to know how words become numbers. Neural networks only understand numbers.

Think of word embeddings like GPS coordinates for meaning. Words that mean similar things are "near" each other in this mathematical space. "King" and "Queen" are close. "King" and "Pizza" are far apart. The classic example: King − Man + Woman ≈ Queen. The vector arithmetic captures semantic relationships.

PYTHON — WORD EMBEDDINGS CONCEPT
# Each word maps to a dense vector of floats (e.g., 512 dimensions)
# These vectors are LEARNED during training, not hand-crafted

from torch import nn

# vocab_size: how many unique tokens in our vocabulary (e.g. 50,000)
# embed_dim: how many numbers represent each word (e.g. 512)
embedding = nn.Embedding(vocab_size=50000, embedding_dim=512)

# Word ID → vector
word_id = 342   # e.g., token for "cat"
vector = embedding(word_id)
# vector is now shape [512] — a point in 512D space

# A sentence "The cat sat" becomes a matrix:
# shape = [3 words × 512 dimensions]
# This matrix is the input to the Transformer

The core insight that changed everything

What if instead of passing a hidden state left-to-right through a sequence, we let every word look at every other word directly — and decide how much to "pay attention" to each one?

This means the word "it" in "The animal didn't cross because it was tired" can directly look at "animal" and compute a high attention score — without needing to relay information through 4 other words.

The key shift: stop thinking about sequences as chains you process one link at a time. Start thinking about sets of tokens that all interact with each other simultaneously. This is what enables parallel computation and removes the distance limitation.

CHAPTER 05

The Attention Mechanism

The single idea that made modern AI possible

The query-key-value intuition

Attention is built around three vectors per token: Query (Q), Key (K), and Value (V). Each is computed from the token's embedding using learned weight matrices.

Think of it like a search engine inside the network. Every word sends out a Query (what am I looking for?). Every word also broadcasts a Key (what do I contain?). When a query matches a key well, the corresponding Value (the actual information) is retrieved and passed forward. The word "it" queries "who am I referring to?", finds "animal" has a matching key, and retrieves animal's value information.

Compute Q, K, V for every token

Multiply each token embedding by three learned weight matrices: W_Q (what am I looking for?), W_K (what do I offer?), W_V (what information do I carry?).

Score: Q · Kᵀ

Dot product between every Query and every Key produces an N×N matrix of similarity scores — "how relevant is token j to token i?"

Scale by √d_k

Without scaling, large dot products push softmax into saturation regions where gradients vanish. Dividing by √d_k keeps things stable.

Softmax → attention weights

Convert each row to a probability distribution that sums to 1. This is "what % of attention does token i pay to each other token?"

Multiply by V

Take a weighted blend of all Value vectors using those attention weights. Each token's output is now a context-aware mix of every other token.

The attention formula:
Attention(Q, K, V) = softmax( Q × Kᵀ / √d_k ) × V

In Python — from scratch

PYTHON — ATTENTION MECHANISM (numpy)
import numpy as np

def attention(Q, K, V):
    """
    Q: [seq_len, d_k]  — queries (what each token looks for)
    K: [seq_len, d_k]  — keys   (what each token offers)
    V: [seq_len, d_v]  — values (what each token contains)
    """
    d_k = Q.shape[-1]

    # How much does each query match each key?
    # scores[i][j] = "how much should token i attend to token j?"
    scores = Q @ K.T / np.sqrt(d_k)
    # shape: [seq_len, seq_len]

    # Convert to probabilities (each row sums to 1)
    def softmax(x):
        e = np.exp(x - x.max(axis=-1, keepdims=True))
        return e / e.sum(axis=-1, keepdims=True)

    attn_weights = softmax(scores)
    # Each row = "what % of attention goes to each other token?"

    # Weighted blend of value vectors
    output = attn_weights @ V
    # output[i] = a mix of all tokens' values, weighted by relevance

    return output, attn_weights

Critical insight: Every token attends to every other token in one matrix multiplication. Sequence of length N requires one N×N matrix multiply — all done in parallel on GPU. Compare to RNN's N sequential steps. This is why Transformers train so much faster.

Try it: which word does "it" attend to?

Click any token to see which other tokens it would most strongly attend to in this sentence. (Scores are illustrative — but they reflect what real attention heads actually learn.)

The0
animal0
didn't0
cross0
the0
street0
because0
it0
was0
tired0
↑ click any word ↑

Multi-Head Attention

One attention operation can only capture one type of relationship. Multi-head attention runs the attention mechanism multiple times in parallel (e.g., 8 or 16 heads), each with its own learned W_Q, W_K, W_V matrices.

It's like having 8 different analysts all reading the same document simultaneously. Head 1 might focus on subject-verb relationships. Head 2 might focus on pronoun coreference. Head 3 might focus on syntactic dependencies. Their outputs are concatenated and projected into a single vector. More perspectives = richer understanding.

Input tokens Head 1subj↔verb Head 2coref Head 3syntax Head 4position Head 5semantics Head 6rare↔common Head 7next-token Head 8long-range Concatenate all 8 outputs Linear projection W_O
Multi-Head Attention — labels are illustrative; real heads learn whatever helps reduce loss

"Do we tell each head what to look for?"

Short answer: no. Each head's W_Q, W_K, W_V matrices start as small random numbers. Nothing in the architecture says "head 3 = pronouns." During training, the only signal the model gets is the next-token loss, propagated back through every weight via backpropagation.

So why do heads specialize? Because the architecture forces them to:

The output is a concatenation of all heads

If two heads compute the exact same thing, one is redundant — it doesn't reduce loss, but it costs parameters. Gradient descent quickly pushes them apart.

Each head has a smaller dimension

If d_model=512 and there are 8 heads, each head only operates on 64 dimensions. No single head has the capacity to model everything — they have to divide the labor.

Loss rewards diverse, useful patterns

Heads that capture genuinely different relationships (one tracks syntax, another tracks semantics, another tracks long-range dependencies) reduce loss the most. So that's what they drift toward.

This is one of the most beautiful things about deep learning: specialization is emergent, not designed. Researchers later probe trained models (e.g., the famous "What does BERT look at?" paper) and discover heads that track subject-verb agreement, coreference, syntactic dependencies — none of which were programmed in. The architecture creates pressure for diversity, and the data does the rest.

CHAPTER 06

The Full Transformer Architecture

"Attention Is All You Need" — what the paper actually built

The original paper — "Attention Is All You Need"

The 2017 Google paper (Vaswani et al., arXiv:1706.03762) stacked attention layers into a full architecture. No recurrence. No convolutions. Just attention and feed-forward layers.

ENCODER processes input sentence DECODER generates output one token at a time N × 6 N × 6 Input Embeddings + Positional Encoding Multi-Head Self-Attention Add & LayerNorm Feed-Forward Network Add & LayerNorm ↓ encoder output (passed to every decoder layer) K, V Output Embeddings (shifted) + Positional Encoding Masked Multi-Head Self-Attention Add & LayerNorm Cross-Attention (Q from decoder, K/V from encoder) Add & LayerNorm Feed-Forward Network Add & LayerNorm Linear projection Softmax → next-token probs Output: "Comment allez-vous?" Input: "How are you?"

Walking through every stage

Here's exactly what happens to a sentence as it flows through the architecture above, top to bottom:

Tokenization & Input Embeddings

"How are you?" is split into tokens, each token ID is looked up in an embedding table, producing a vector of size d_model (e.g., 512). Output: a matrix of shape [seq_len × 512].

Positional Encoding

Attention has no built-in notion of order. A sinusoidal positional vector is added to each token embedding so the model knows token 1 is different from token 5.

Encoder · Multi-Head Self-Attention

Every input token attends to every other input token in parallel. Each token's representation becomes a weighted blend of all input tokens — context-aware, bidirectional.

Add & LayerNorm (after attention)

The original input is added back to the attention output (residual connection — prevents vanishing gradients in deep stacks). Then LayerNorm normalizes each token vector independently for training stability.

Encoder · Feed-Forward Network

Two linear layers with ReLU in between, applied independently to each token position. This is where most of the model's parameters live, and where factual knowledge is stored as a key-value memory.

Add & LayerNorm (after FFN) — repeat × 6

Another residual + LayerNorm. This whole block (attention → norm → FFN → norm) is stacked 6 times. The output is a deeply contextualized representation of the input sentence.

Decoder · Masked Self-Attention

Same as encoder self-attention, but with a causal mask: token t can only attend to tokens 1…t. This prevents cheating — when generating word 5, the model can't peek at words 6+.

Decoder · Cross-Attention

The crucial bridge between input and output. Q comes from the decoder (the partially generated output), K and V come from the encoder's final output. This is how the decoder "looks at" the source sentence while generating each target word.

Decoder · Feed-Forward Network — repeat × 6

Same FFN as encoder, applied per position. Stack of 6 decoder layers total.

Linear + Softmax → next-token probabilities

The final decoder output is projected into vocabulary space (size ~50,000) and softmax produces a probability distribution. The highest-probability token is emitted, fed back into the decoder, and the process repeats until an end-of-sequence token is generated.

The same building block is reused everywhere: Attention → Add&Norm → FFN → Add&Norm. The encoder uses self-attention, the decoder adds masking and cross-attention. That's the entire architecture. Modern GPT models drop the encoder entirely (decoder-only), and BERT drops the decoder (encoder-only) — both are subsets of this original design.

Positional Encoding — solving "no order" problem

Attention has no concept of order. If you shuffle the input words, you get the same attention scores. So the paper adds positional encodings — vectors that encode each token's position in the sequence — and add them to the embeddings.

PYTHON — POSITIONAL ENCODING
import torch
import math

def positional_encoding(seq_len, d_model):
    """
    seq_len: number of tokens in the sequence
    d_model: embedding dimension (e.g., 512)
    Returns positional vectors to ADD to token embeddings
    """
    PE = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)

    # Use sine/cosine waves of different frequencies
    # Each position gets a unique "fingerprint" vector
    div_term = torch.exp(torch.arange(0, d_model, 2) *
                       (-math.log(10000.0) / d_model))

    PE[:, 0::2] = torch.sin(position * div_term)
    PE[:, 1::2] = torch.cos(position * div_term)

    return PE

# Add to your embeddings before feeding to Transformer
x = token_embeddings + positional_encoding(seq_len, d_model)

Layer Norm and Feed-Forward — the other components

After each attention block, two things happen:

Add & LayerNorm: The input is added back to the output (residual connection), then normalized. Residuals prevent vanishing gradients in deep networks and enable extremely deep architectures.

Feed-Forward Network (FFN): Two linear layers with ReLU in between. Applied independently to each token position. This is where the network stores and retrieves factual knowledge — each FFN layer acts like a key-value memory store.

PYTHON — ONE TRANSFORMER BLOCK
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8, d_ff=2048):
        super().__init__()
        # Multi-head self-attention
        self.attention = nn.MultiheadAttention(d_model, n_heads)
        # Layer normalizations
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        # Feed-forward network (applied to each position independently)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        # Self-attention + residual connection + normalization
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_out)   # "Add & Norm"

        # Feed-forward + residual + normalization
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)   # "Add & Norm"

        return x

GPT vs BERT — two ways to use the Transformer

Aspect BERT (encoder only) GPT (decoder only)
Uses Encoder stack only Decoder stack only
Attention type Bidirectional (sees all tokens) Causal/masked (only sees past tokens)
Training task Mask random words, predict them Predict next token given previous
Good for Classification, Q&A, embeddings Generation, completion, chat
Examples BERT, RoBERTa, your Qdrant embeddings GPT-4, Claude, Llama, Mistral
CHAPTER 07

How LLMs Are Actually Trained

Pre-training, fine-tuning, RLHF — the full pipeline

Stage 1: Pre-training — next token prediction

The base LLM is trained on a massive corpus (trillions of tokens from the internet, books, code). The task is brutally simple: given all previous tokens, predict the next token.

Imagine an exam where the only question is: "given this half-sentence, what word comes next?" But the exam has 1 trillion questions. To answer them well, the model must implicitly learn grammar, facts, reasoning, coding patterns, world knowledge — everything. The single task of "predict the next word" compresses an extraordinary amount of knowledge into the weights.

PYTHON — TRAINING LOOP CONCEPT (GPT-style)
for batch in training_data:
    # batch = tokenized text like: [42, 8931, 102, 5531, 99, ...]

    # Input: all tokens except the last
    # Target: all tokens except the first (shifted by 1)
    input_ids  = batch[:-1]  # "The cat sat on the"
    target_ids = batch[1:]   # "cat sat on the mat"

    # Model predicts probability distribution over vocab for each position
    logits = model(input_ids)
    # logits shape: [seq_len, vocab_size]  e.g., [512, 50000]

    # Cross-entropy loss: how surprised was the model by the real next word?
    loss = cross_entropy(logits.view(-1, vocab_size), target_ids.view(-1))

    loss.backward()   # backprop through ALL layers of the Transformer
    optimizer.step()  # update all weights (billions of them)

GPT-3 was trained on ~300 billion tokens. The model has 175 billion parameters (weights). Training cost ~$4.6 million in compute. Llama 3 70B was trained on 15 trillion tokens. This is why you don't train from scratch — you fine-tune.

Stage 2: Supervised Fine-Tuning (SFT)

A base model is just a text completion engine. It will complete "How do I make a bomb?" by continuing in the style of whatever it was trained on. To make it a helpful assistant, you fine-tune on instruction-response pairs.

JSON — TYPICAL SFT EXAMPLE
{
  "instruction": "Summarize this text in 3 bullet points",
  "input":       "... long article ...",
  "output":      "• Key point 1\n• Key point 2\n• Key point 3"
}

Thousands-to-millions of such examples teach the model to produce outputs that match this instruction-response format. The result: a model that follows instructions instead of just continuing text.

Stage 3: RLHF — making it actually good

SFT gets you an instruction-following model. RLHF (Reinforcement Learning from Human Feedback) makes it aligned — honest, helpful, harmless.

Collect human preference data

Show humans two model responses to the same prompt. They pick which one is better. Collect millions of these comparisons.

Train a Reward Model

Train a separate model to predict which response a human would prefer. This model gives a "quality score" to any response.

PPO — Reinforcement Learning

Use the reward model as the "environment". Optimize the LLM using PPO to maximize the reward score while staying close to the SFT model (to prevent it going off the rails).

Think of RLHF like training a new employee. Pre-training = they went to school and read everything. SFT = you give them a job manual. RLHF = you watch them work, give feedback on what they do well and poorly, and they adjust their behavior accordingly.

Tokenization — what the model actually sees

Models don't see words. They see tokens — sub-word chunks. "unhappiness" → ["un", "happi", "ness"]. This lets the model handle words it has never seen.

PYTHON — TOKENIZATION
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

text = "The telecom network had 5G latency issues"
tokens = tokenizer.encode(text)
# → [450, 14584, 3904, 2400, 29871, 29945, 29954, 21162, 5626]

# Decode to see the actual chunks:
decoded = [tokenizer.decode([t]) for t in tokens]
# → ['The', ' tele', 'com', ' network', ' had', ' 5', 'G', ' lat', 'ency', ' issues']

# This is why domain-specific terms fragment strangely —
# "nrcelldu" would tokenize into many small pieces
# This is the exact reason your schema synonym enrichment matters!
CHAPTER 08

From Text Input to Generated Response

The full modern stack — generation, RAG, and how your rag-sql fits in

Inference: how a model generates text

During generation, the model outputs one token at a time. Each token becomes part of the input for the next step (autoregressive generation). The model predicts a probability distribution over the entire vocabulary, and you sample from it.

Step 1

Input ["SELECT", "*", "FROM"] → model produces P(next token). Highest prob: "customers".

Step 2

Input ["SELECT", "*", "FROM", "customers"] → model produces P(next). Highest prob: "WHERE".

Step 3 … N

Repeat — append predicted token, run again — until the model emits the <EOS> end-of-sequence token.

Temperature Behavior Best for
0 Always pick the single most likely token (deterministic) SQL generation, structured output, code
0.7 Sample from distribution with mild flattening Default for chat — balanced
1.0+ More random, more creative, less coherent Brainstorming, creative writing
PYTHON — GENERATION WITH TEMPERATURE
import torch
import torch.nn.functional as F

def generate(model, prompt_tokens, max_new_tokens=100, temperature=0.7):
    tokens = prompt_tokens.copy()

    for _ in range(max_new_tokens):
        # Forward pass: get logits for ALL positions
        logits = model(tokens)

        # Take only the LAST position's logits (the next token prediction)
        next_logits = logits[-1] / temperature

        # Convert to probabilities
        probs = F.softmax(next_logits, dim=-1)

        # Sample (don't just take argmax — that's too rigid)
        next_token = torch.multinomial(probs, 1).item()

        if next_token == EOS_TOKEN:
            break

        tokens.append(next_token)

    return tokens

What to explore next

You now have the foundation. The next chapter covers everything that has happened since the Transformer — and the modern stack you'll actually build on.

For your rag-sql work specifically
Quantization — why Llama runs on CPU (INT4/INT8 weight compression)
KV cache — why long prompts are slow, what Ollama reuses
Fine-tuning vs RAG — when to add knowledge vs inject it
LoRA / QLoRA — efficient fine-tuning, only train small adapter matrices
For deeper understanding
Andrej Karpathy — "Let's build GPT from scratch" (YouTube, 2hr)
Sebastian Raschka — "Build a Large Language Model from Scratch" (book)
Annotated Transformer — Harvard NLP blog, line-by-line walkthrough
CHAPTER 09

The Modern Stack — Everything After the Transformer

RAG, quantization, agents, reasoning models, MCP — what's actually used in production through 2026

The chronological map

The Transformer (2017) was the architectural breakthrough. Everything since has been about scaling it up, making it cheaper, connecting it to tools, and letting it think and act. This timeline only includes things that are actually used in production today — no hype, no toys.

2018
Pre-train + Fine-tune era (BERT, GPT-1)One huge pre-training run, then cheap task-specific fine-tuning. The "transfer learning" template.
2020
GPT-3 + scaling laws (Kaplan, Hoffmann)175B parameters. Few-shot learning emerges from scale alone. Chinchilla (2022) later refines the compute-optimal ratio.
2020
RAG (Lewis et al.)Pair an LLM with a vector database. Inject fresh facts at inference time instead of retraining.
2021
LoRA (Microsoft)Fine-tune billion-parameter models by training only tiny low-rank adapter matrices. Democratized fine-tuning.
Jan 2022
Chain-of-Thought (Wei et al.)"Let's think step by step" — eliciting reasoning by prompting alone. Foundation for later reasoning models.
Mar 2022
InstructGPT — RLHFSFT + reward model + PPO turns a base model into a helpful assistant. ChatGPT ships in November.
May 2022
FlashAttention (Dao et al.)IO-aware attention kernel. 2–4× faster training, lower memory. The reason long context windows became feasible.
Feb 2023
Llama + llama.cpp + GGUFMeta releases Llama. Georgi Gerganov ports it to C++ with INT4 quantization. Local AI revolution begins.
Jun 2023
OpenAI function calling + vLLMModels start invoking APIs as structured JSON. vLLM ships PagedAttention — 24× higher throughput than HF Transformers.
Sep 2023
Mistral 7B + Mixtral (MoE)Open-weight model that beats Llama 2 13B. Mixtral 8×7B (Dec) brings sparse Mixture of Experts to open weights.
Nov 2023
GPT-4V + Claude 100K + Gemini 1M contextMultimodal mainstream. Long context becomes table stakes via RoPE scaling, YaRN, ring attention.
Apr 2024
Llama 3 + speculative decoding everywhereMeta's Llama 3 (8B/70B). vLLM, llama.cpp, TGI all ship speculative decoding by default. 2–3× speedup for free.
Sep 2024
OpenAI o1 — test-time computeRL on reasoning traces. Models "think for longer" and dominate math/code. The biggest paradigm shift since RLHF.
Nov 2024
MCP (Model Context Protocol) — AnthropicStandardized wire protocol connecting LLMs to tools, data, prompts. Becomes industry standard through 2025.
Jan 2025
DeepSeek-R1 + GRPOOpen-weight reasoning model matching o1. Introduces GRPO (Group Relative Policy Optimization) + RLVR — RL with programmatic verifiable rewards. The training story of 2025.
Feb 2025
Hybrid architectures go mainstreamMamba-2, Jamba, Falcon-Mamba, Nemotron-H — Transformer + State Space Model hybrids. Linear-time inference for long context.
Mid 2025
Hybrid reasoning + long-horizon agentsClaude 3.7 / 4 "extended thinking" toggle. SWE-bench Verified saturates. Cursor / Claude Code / Codex agents handle multi-hour autonomous tasks.
2025–26
MCP ecosystem maturitySampling, Roots, Elicitation primitives standardized. Hundreds of MCP servers in the wild. Cursor / Claude Desktop / VS Code / Zed all speak MCP natively.
2026
The current frontierReasoning + agents + tools fused into single deployable artifacts. Inference-time compute as a knob you turn. Local 30B-class models on consumer GPUs via 4-bit quant + speculative decoding.

Concept 1 · RAG (Retrieval-Augmented Generation)

An LLM's knowledge is frozen at training time. RAG fixes this without retraining: when a question comes in, retrieve relevant documents from a vector database and stuff them into the prompt. The model now "knows" your private data.

① OFFLINE — INDEXING Documents PDFs · DB · wiki Chunker split into passages Embedding model BERT-style encoder Vector DB Qdrant · Pinecone · pgvector ② AT QUERY TIME User question "What's our refund policy?" Embedding model same as indexing Vector search top-k cosine sim Retrieved passages e.g. top-5 chunks Prompt = question + retrieved passages + system instructions ["Answer using ONLY the context below: …"] LLM → grounded answer with citations
RAG = vector search + prompt augmentation. Knowledge stays outside the model.

Why RAG matters: retraining a 70B model costs millions. Adding a document to a vector DB costs cents. RAG also gives you citations (you can show which chunks were retrieved) and fresh data (last week's docs are queryable today).

Concept 2 · Context, the Context Window, and KV Cache

"Context" is just everything you put into the prompt: system message, conversation history, retrieved RAG chunks, tool outputs. The context window is the hard limit on how many tokens fit.

Model Context window (tokens) Roughly
GPT-3.5 (2022)4K~5 pages
GPT-4 Turbo128K~250 pages
Claude 3.5 Sonnet200K~400 pages
Gemini 1.5 Pro1M – 2M~3,000+ pages, an hour of video
Llama 3.1 8B (local)128K~250 pages

Attention is O(n²) in sequence length — doubling context makes attention 4× slower and uses 4× more memory. The KV cache is what makes inference tractable: during generation, the K and V matrices for previous tokens are cached so you don't recompute attention for them every step. This is why your first token is slow ("prefill") and subsequent tokens are fast ("decode").

Concept 3 · Quantization & Local Models (llama.cpp)

A 70B-parameter model in float16 is 140GB. You can't fit that in consumer RAM. Quantization compresses each weight from 16 bits to 8, 5, 4, or even 2 bits — with surprisingly small quality loss.

Precision Bits/weight Llama-70B size Quality loss
FP16 (full)16140 GBbaseline
INT8870 GBnegligible
Q5_K_M~548 GBvery small
Q4_K_M~440 GBsmall (sweet spot)
Q2_K~2~26 GBnoticeable
HuggingFace
FP16 .safetensors
llama.cpp / convert
quantize → GGUF format
.gguf file
single portable binary
Ollama / LM Studio
runs on CPU + GPU

llama.cpp (Georgi Gerganov, 2023) is the C++ inference engine that started the local-AI revolution. GGUF is its file format. Ollama wraps llama.cpp with a friendly API. When you run ollama pull llama3.1:8b, you're downloading a quantized GGUF and running it through llama.cpp under the hood.

Concept 4 · Mixture of Experts (MoE)

A standard Transformer activates every parameter for every token. MoE replaces the FFN with N experts and a router that picks just 2 of them per token. You get a model with the knowledge capacity of all N experts, but the inference cost of just 2.

Token after attention Router (small linear net) picks top-2 experts per token Expert 1 Expert 2 Expert 3 ✓ Expert 4 Expert 5 Expert 6 ✓ Expert 7 Expert 8 Weighted sum of chosen 2
MoE: 8 experts in storage, only 2 active per token. Mixtral 8×7B = 47B total params, ~13B active.

Examples: Mixtral 8×7B (open), GPT-4 (rumored MoE), DeepSeek V3, Grok-1. The downside: you still need RAM for all experts because the router can pick any of them next token.

Concept 5 · Multimodal Models

A multimodal model accepts more than text — typically images, but also audio and video. The architectural trick is simple: convert every modality into tokens, then feed them into the same Transformer.

Image
224×224 pixels
Vision encoder (ViT)
split into 16×16 patches → tokens
Image tokens
treated like text tokens
Text prompt
"What's in this image?"
Tokenizer
BPE → text tokens
Text tokens
concatenated with image tokens
Single Transformer (decoder)
attends across image + text uniformly
Text output
"A cat on a mat."

Examples: GPT-4o, Claude 3.5, Gemini 1.5, Llama 3.2 Vision, Qwen-VL. The same recipe works for audio (whisper-style tokens) and video (frame-by-frame ViT tokens).

Concept 6 · Tool Use & Function Calling

By itself an LLM can't access the internet, query a database, or run code. Function calling teaches it to emit structured JSON describing what tool it wants to invoke. Your code runs the tool, feeds the result back, and the model continues.

JSON — TOOL DEFINITION + MODEL OUTPUT
// You provide tool definitions in the system prompt:
{
  "name": "get_weather",
  "description": "Get current weather for a city",
  "parameters": { "city": "string" }
}

// User: "What's the weather in Tokyo?"
// Model emits this instead of plain text:
{
  "tool_call": {
    "name": "get_weather",
    "arguments": { "city": "Tokyo" }
  }
}

// Your code runs the function, feeds result back:
{ "tool_result": { "temp": 22, "condition": "clear" } }

// Model now produces the final answer:
"It's 22°C and clear in Tokyo right now."

Concept 7 · MCP — Model Context Protocol

Function calling is per-application. Every team re-invents tool plumbing. MCP (Anthropic, Nov 2024) standardizes it: a client-server protocol where any LLM can talk to any tool / data source / prompt-library as long as both speak MCP. Think "USB for AI" — and by 2026 it has won as the de facto standard, the way LSP won for editors.

MCP exposes four primitive types over JSON-RPC:

Tools — actions the model can invoke Resources — data the model can read Prompts — reusable templates Sampling — server asks model to generate

Plus newer 2025 additions: Roots (which directories/URIs the host exposes to a server) and Elicitation (server asks the user a question mid-flow — "which env should I deploy to?"). These are what turned MCP from "function calling 2.0" into a real two-way agent protocol.

MCP HOST (Cursor, Claude Desktop, …) LLM Claude · GPT · Llama MCP Client discovers + calls tools JSON-RPC over stdio/HTTP MCP Server: GitHub issues, PRs, repos MCP Server: Postgres read-only DB queries MCP Server: Filesystem read/write local files MCP Server: your-rag-sql your custom tool
MCP — one client, many interoperable servers. Cursor uses MCP to expose your tools to its agent.

Before MCP: every product re-built tool integrations from scratch. After MCP: write your tool as an MCP server once, and Cursor, Claude Desktop, VS Code, Zed, and every cloud agent platform can use it. By 2026 there are hundreds of community MCP servers (GitHub, Postgres, Slack, Linear, Sentry, Datadog, Notion, …) and SDKs in TypeScript, Python, Rust, Go, Kotlin, C#. This is exactly the protocol the MCP servers in your Cursor setup are speaking.

Concept 8 · Agents & the Harness

An agent is an LLM in a loop: think → act → observe → think. The model decides what tool to call, your code runs it, the result goes back into the prompt, and the model decides the next step. The loop terminates when the model emits a "done" signal.

THINK LLM reasons about next step ACT call tool / write file OBSERVE tool result back to prompt DONE? if yes → return
The ReAct loop — the foundation under Cursor agents, Claude Code, AutoGPT, etc.

The harness is the surrounding code that makes the loop work: tool registry, prompt assembly, output parsing, retries, safety checks, token-budget management, and the UI. Frontier-class agentic products are 90% harness engineering on top of an off-the-shelf model.

Concept 9 · Mixture of Agents (MoA)

If one agent is good, multiple specialist agents in collaboration can be better. Mixture of Agents (Wang et al., 2024) stacks layers of agents: each layer's outputs become the next layer's inputs.

User query LAYER 1 Llama 70Bdraft 1 Qwen 72Bdraft 2 Mixtraldraft 3 DBRXdraft 4 LAYER 2 Llama 70Brefine using all 4 Qwen 72Brefine using all 4 AGGREGATOR Final synthesizer LLM picks/blends best response
MoA — open-source ensembles can beat GPT-4 on benchmarks like AlpacaEval 2.0

Concept 10 · Test-Time Compute & Reasoning Models

For 7 years the recipe was: spend more compute at training time → get a smarter model at inference. In Sep 2024, OpenAI's o1 flipped the script: spend more compute at inference time (let the model think longer) and you get the same gain. This is the biggest shift since RLHF.

Hard problem
competition math, hard SQL, tricky bug
Internal "thinking" tokens
1K–100K tokens · self-check · backtrack · explore branches
Final answer
concise · usually correct
Model (released) Open / Closed Notable trait
OpenAI o1 / o3 / o4-mini (2024–25)ClosedFirst, set the bar. Hidden CoT.
DeepSeek-R1 (Jan 2025)Open weightsMatches o1, full CoT visible, GRPO recipe public
Qwen QwQ / Qwen3-ThinkingOpen weightsStrong open replication, runs locally
Claude 3.7 / 4 — Extended ThinkingClosedToggleable: same model, with/without thinking
Gemini 2.0/2.5 Flash ThinkingClosedCheap reasoning at scale

The new knob: at inference time you can now choose how much to spend per query. A factual lookup? Skip thinking. A debugging session? Crank thinking budget to 32K tokens. Modern APIs (OpenAI, Anthropic) expose reasoning_effort / thinking_budget parameters. By 2026, this is just a regular dial in your prompt config.

Concept 11 · GRPO & RLVR — how reasoning models are actually trained

RLHF needs humans to label "which answer is better." That doesn't scale to math problems with 10,000-token solutions. RLVR (Reinforcement Learning with Verifiable Rewards) replaces the human reward model with a program: run the unit tests, check the math, parse the SQL — if it passes, reward = 1.

GRPO (Group Relative Policy Optimization, DeepSeek) is the lightweight RL algorithm that made this practical. Instead of training a separate value/critic network like PPO, it samples multiple candidate solutions per prompt and uses their relative rewards as the signal.

Prompt "Solve: 2x + 3 = 11" Policy (LLM) samples N=8 solutions "x = 4" ✓ "x = 5" ✗ "x = 4" ✓ "x = -7" ✗ … (8 total samples) Verifier code · math · regex Rewards: [1, 0, 1, 0, …] advantage = (r − group_mean) / group_std ↑ GRPO update — push up rewarded traces, push down failed traces No human labelers · no separate reward model · no critic network
GRPO + RLVR: scale RL training by replacing humans with verifiers

Why this matters for your work: RLVR works for anything you can verify programmatically — and SQL execution is a perfect verifier. By 2025–26, fine-tuning a small SQL-specialized model with GRPO on "did the query run + return correct rows?" is a real, accessible recipe (TRL, Unsloth, verl all support it). Worth knowing this exists when you outgrow pure prompting.

Concept 12 · Hybrid architectures: Mamba & State Space Models

Pure attention is O(n²). For 1M-token contexts that becomes painful. State Space Models (Mamba, Dec 2023) revisit RNN-style linear-time sequence processing, but with selective gating that — empirically — competes with attention on quality.

By 2025, the winning recipe turned out to be hybrid: a few Transformer layers (for precise recall) interleaved with many Mamba/SSM layers (for cheap long context).

Architecture Compute per token Memory per token Long-context behavior
Pure Transformer O(n) · grows with seq O(n) · KV cache grows Best recall, but expensive
Pure SSM (Mamba) O(1) · constant O(1) · constant Cheap, weaker on exact recall
Hybrid (Jamba, Nemotron-H) ~O(1) for most layers Mostly constant Near-Transformer quality at SSM cost

Production-deployed hybrids (2025–26): AI21 Jamba 1.5, NVIDIA Nemotron-H, TII Falcon-Mamba, Zyphra Zamba. You won't write Mamba code by hand — but if a model card says "hybrid" or "SSM," now you know what's inside and why it's faster on long inputs.

Concept 13 · Speculative Decoding — the silent 2× speedup

Generation is sequential: predict token, append, predict next. Each step needs a full forward pass through 70B parameters. Speculative decoding uses a small "draft" model to propose k tokens at once, then the big model verifies all k in a single parallel pass. Accepted tokens are kept; the first rejected one resets the draft.

Prompt + tokens so far "SELECT * FROM" Draft model (1B) cheap · fast · proposes 4 tokens customers WHERE id = Target model (70B) — verifies ALL 4 in ONE forward pass checks: would I have produced these same tokens? ✓ accept ✓ accept ✓ accept ✗ reject discard
Speculative decoding — accept the matches, discard the rest. 2–3× faster on average.

Where you'll see it: vLLM, TGI, llama.cpp, SGLang, Ollama, every commercial API. By 2025–26, it's not optional — every production inference stack runs it by default. Variants: EAGLE, Medusa, self-speculation (no separate draft model needed).

Concept 14 · Production inference stacks — beyond llama.cpp

llama.cpp is for local single-user inference. For serving thousands of concurrent users, you need a different stack — one that batches requests, manages KV cache memory, and supports speculative decoding at scale.

Stack Sweet spot Key feature
llama.cpp / Ollama Local · single user · CPU+GPU GGUF quantization, runs anywhere
vLLM High-throughput GPU serving PagedAttention — packs KV cache like virtual memory
SGLang Structured generation, agents RadixAttention — caches prefixes across requests
TensorRT-LLM NVIDIA GPUs · max throughput Hand-tuned CUDA kernels, in-flight batching
HuggingFace TGI Easy deploy · HF ecosystem Drop-in for any HF model

Rule of thumb (2026): running on your laptop → Ollama / llama.cpp. Serving an internal team (10–100 QPS) → vLLM. Building an agent platform with shared system prompts → SGLang (its prefix caching shines when 1000 agents share the same 5K-token system prompt). Behind a paid API at scale → TensorRT-LLM on H100/B200.

When to use what — the practical decision matrix

The hardest part of using all this isn't understanding any single concept — it's knowing which tool to reach for. Here's the rough decision tree you'll actually use:

Your problem Reach for Avoid
Model doesn't know our private docs / DB schema RAG (vector DB + retrieval) Fine-tuning · stuffing 1M tokens of context
Model needs to follow a specific style / format consistently SFT or LoRA fine-tune on 100–10k examples RAG · long system prompts
Model hallucinates on rare technical terms (e.g. nrcelldu) RAG with synonym enrichment, or LoRA Just prompting harder
Need to reason through hard math / multi-step bugs Reasoning model (o-series, R1, Claude extended thinking) Cheap chat model with longer prompts
Need to read / process a 200-page document end-to-end Long-context model (Gemini 1M, Claude 200K) Naive chunked RAG (loses cross-references)
Need to take multi-step actions (edit files, run tests, deploy) Agent loop + MCP tools + a real harness A single LLM call with a long prompt
Need cheap, private, fully offline inference Quantized open-weight model on Ollama / llama.cpp A frontier API for everything
Serving 100+ QPS to internal users vLLM / SGLang / TGI + speculative decoding Single-process Ollama
Want to specialize a model on a verifiable task (SQL, code, math) GRPO / RLVR fine-tune (TRL, Unsloth, verl) Classic RLHF (needs human labelers)
Want your tool to plug into many AI clients Expose it as an MCP server Custom HTTP adapter per client

How everything fits together

Foundation: the Transformer (2017)
Scale it — pretraining · scaling laws · RLHF · constitutional AI
Make it cheaper — quantization · MoE · LoRA · llama.cpp · vLLM · speculative decoding · SSM/hybrid
Give it knowledge — RAG · long context · multimodal
Give it actions — function calling · MCP · agents · harnesses
Make it think harder — chain-of-thought · reasoning models · GRPO/RLVR · test-time compute

Where your rag-sql project sits in 2026: you already use an embedding model (encoder Transformer) for Qdrant, a quantized decoder (Llama via Ollama / llama.cpp) for SQL generation, and RAG to inject schema context. The natural next steps: (1) expose it as an MCP server so Cursor's agent can call it directly, (2) move serving to vLLM when you outgrow single-user, (3) try a small GRPO/RLVR fine-tune where the verifier is "did the SQL execute and return the right rows?", and (4) optionally route hard queries to a reasoning model with a thinking budget. You're already using six of the concepts on this page — knowing the names just gives you the vocabulary to extend it.

1 / 9