What is a Neural Network, Really?
Starting from biology, then turning it into math
The biological inspiration
Your brain has ~86 billion neurons. Each neuron receives signals from thousands of other neurons, does a tiny calculation, and decides: "should I fire or not?". If it fires, it passes a signal forward. That's it.
In 1943, McCulloch & Pitts said: what if we model this in math?
Think of a single neuron like a committee vote. 5 people are voting. Some votes count more than others (weights). If the total weighted votes cross a threshold, the motion passes (the neuron fires). If not, it stays silent.
The artificial neuron — one unit
An artificial neuron takes several inputs, multiplies each by a weight, sums them all up, adds a bias, then passes through an activation function.
import numpy as np # Three inputs (e.g., pixel values, sensor readings, etc.) inputs = [0.8, 0.3, 0.6] # Weights — how much does this neuron care about each input? # These start random. Training will LEARN the right values. weights = [0.4, -0.2, 0.9] bias = 0.1 # A constant offset (shifts the decision boundary) # Step 1: weighted sum z = np.dot(inputs, weights) + bias # = (0.8×0.4) + (0.3×-0.2) + (0.6×0.9) + 0.1 = 0.98 # Step 2: activation function (sigmoid squeezes output to 0..1) def sigmoid(x): return 1 / (1 + np.exp(-x)) output = sigmoid(z) # → 0.727 # Closer to 1 = "neuron fired", closer to 0 = "neuron silent"
Why do we need layers?
One neuron can only draw one straight line to separate data. Real problems are not linearly separable. You cannot classify cats vs dogs with one line.
Stack neurons into layers, and each layer learns increasingly abstract features:
A neural network is just many neurons stacked in layers. The weights are the only thing that gets trained. Everything else — the architecture, the layers, the activation functions — you design. The training process finds the right weights automatically.
Activation functions — why they matter
Without an activation function, stacking layers is useless — you'd just get one big linear equation. Activations add non-linearity, which is what lets networks learn curves, shapes, and complex patterns.
| Activation | Range / Formula | Used for |
|---|---|---|
| Sigmoid | output ∈ (0, 1) | Binary classification outputs |
| Tanh | output ∈ (−1, 1) | Centered, better for hidden layers |
| ReLU | max(0, x) | Modern default — fast, simple, works great |
| Softmax | outputs sum to 1 | Multi-class output (probabilities) |
How a Network Actually Learns
Backpropagation, loss functions, and gradient descent — demystified
The core loop: guess → measure error → adjust
Learning is just a repetitive loop. The network makes a guess, you measure how wrong it was, and you nudge every weight slightly in the direction that makes it less wrong. Repeat millions of times.
Forward Pass
Input flows through every layer, layer by layer. Each neuron computes its weighted sum + activation. You get a prediction at the end.
Compute Loss
Compare the prediction to the real answer using a loss function. This gives you a single number: "how wrong are we right now?"
Backward Pass (Backpropagation)
Calculate how much each weight contributed to the error. This uses calculus (chain rule) to propagate error signals backward through the network.
Gradient Descent — Update Weights
Nudge each weight in the direction that reduces the loss. The size of the nudge is controlled by the learning rate.
Loss functions
A loss function is like your exam score — but inverted. Instead of maximizing your score, you're minimizing your mistakes. The loss is 0 if you're perfect, and grows the more wrong you are. The network's only goal during training is to make this number as small as possible.
import numpy as np # === Mean Squared Error (regression tasks) === def mse_loss(predictions, targets): return np.mean((predictions - targets) ** 2) # === Cross-Entropy Loss (classification tasks) === # "How surprised was the model?" — penalizes confident wrong answers hard def cross_entropy(pred_prob, true_class): return -np.log(pred_prob[true_class]) # Example: model predicted [0.1, 0.8, 0.1], correct class is 1 # Loss = -log(0.8) ≈ 0.22 (low — model was right and confident) # Example: model predicted [0.1, 0.1, 0.8], correct class is 1 # Loss = -log(0.1) ≈ 2.3 (high — model was confidently wrong)
Gradient Descent — finding the bottom of the hill
Imagine you're blindfolded on a hilly landscape. Your goal is to reach the lowest valley. You can't see — but you can feel which direction is downhill under your feet. Gradient descent is: take one small step in the downhill direction, repeat. The "landscape" is the loss function across all possible weight values.
# learning_rate controls step size # too large: you overshoot and bounce around # too small: training takes forever learning_rate = 0.01 for epoch in range(1000): # Forward pass — get predictions predictions = forward(X, weights) # Compute how wrong we are loss = mse_loss(predictions, y_true) # Backward pass — compute gradient (how loss changes per weight) gradients = backward(loss, weights) # Update: move weights in the opposite direction of gradient weights = weights - learning_rate * gradients
What backpropagation actually is
Backpropagation is just the chain rule from calculus, applied systematically across all layers. It answers: "if I change this weight by a tiny amount, how much does the total loss change?"
In PyTorch/TensorFlow, calling loss.backward() does all of this automatically. You never write backprop by hand — but knowing it exists is crucial to debugging.
Nothing magic is happening. Training a neural network is just: (1) guess, (2) measure error, (3) calculate which direction each weight should move to reduce error, (4) move them a tiny bit, (5) repeat on millions of examples. The "intelligence" that emerges is a statistical pattern captured in billions of weight values.
How Sequences Were Handled Before 2017
RNNs, LSTMs, and why they were the best we had — and still not good enough
The sequence problem
Language, audio, and time-series data are sequences. The order matters. "The dog bit the man" and "The man bit the dog" have the same words but completely different meanings.
A regular neural network (feedforward) takes a fixed-size input and produces a fixed-size output. It has no concept of order or time. So how do you handle sequences?
Recurrent Neural Networks (RNNs)
The idea: process one word at a time, and keep a "hidden state" — a memory vector that carries information forward as you read each word.
Imagine you're reading a book, but you can only look at one word at a time, and you have a small sticky note to jot down what you remember so far. Each new word, you update the note. The note is the hidden state. When you've read all words, the final note is your "understanding" of the sentence.
import numpy as np def rnn_cell(x_t, h_prev, W_x, W_h, b): """ x_t : current word's vector representation h_prev : hidden state from previous step (memory) W_x : weights for input W_h : weights for previous hidden state b : bias """ # Combine current input + previous memory z = W_x @ x_t + W_h @ h_prev + b # Squash through tanh to keep values bounded h_new = np.tanh(z) return h_new # Process the sentence "The cat sat" h = np.zeros(128) # initial memory: all zeros for word_vector in sentence_embeddings: h = rnn_cell(word_vector, h, W_x, W_h, b) # h now "remembers" everything up to this word
The vanishing gradient problem
Here's the fundamental issue with RNNs: when you backpropagate through time (through 50+ steps), the gradients get multiplied together at each step. If each multiplication makes them slightly smaller... after 50 steps they're essentially zero.
The RNN effectively forgets. By the time it reads word 50, the gradient signal from word 1 is so tiny that the weights at step 1 never actually get updated. The network can't learn long-range dependencies. "The animal didn't cross the street because it was too tired" — what does "it" refer to? A human knows it's the animal (word 2). An RNN often doesn't.
LSTMs — the engineering fix
Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) added explicit memory management. Instead of one hidden state, an LSTM has a cell state (long-term memory) and three gates that control what to remember, forget, and output.
Forget gate
"How much of the old memory do I keep?" — outputs 0 (forget all) to 1 (keep all) per dimension.
Input gate
"How much of the new information do I add to memory?" — controls writes into the cell state.
Output gate
"What part of memory do I expose right now?" — controls what becomes the new hidden state passed forward.
The cell state (long-term memory) flows through with minimal modification — so gradients can now travel further back without vanishing.
LSTMs were dominant from ~2013–2017 and powered Google Translate, Siri, and many production NLP systems. But they were still fundamentally sequential — you had to process word 1 before word 2. You couldn't parallelize.
The Encoder-Decoder architecture
For translation (English → French), the approach was: use one LSTM to encode the entire input sentence into a fixed vector, then use another LSTM to decode that vector into the output sentence.
The bottleneck: the entire input sentence must fit into one fixed-size context vector. For long sentences, this vector simply can't hold everything — and information from early words gets squeezed out.
In 2015, Bahdanau et al. introduced attention as an add-on to fix this bottleneck. The decoder could now "look back" at all encoder hidden states, not just the last one. This was the precursor to everything.
What Was Actually Broken
The exact problems that "Attention Is All You Need" was solving
Three fundamental problems with RNNs/LSTMs
| Problem | With RNN/LSTM | What we needed |
|---|---|---|
| Sequential dependency | Must process word 1, then 2, then 3... Cannot parallelize. Training a 512-word sentence on GPU = GPU waiting for 512 serial steps. | Process all words at once, in parallel. GPUs are good at that. |
| Long-range dependencies | Even with LSTM, remembering something from 100 words ago is unreliable. Information degrades with distance. | Any word should be able to directly attend to any other word, regardless of distance. |
| Fixed bottleneck | Encoder-decoder compresses entire input into one vector. Long documents lose information. | Keep all token representations available throughout the whole process. |
The word embedding problem — inputs as vectors
Before understanding Attention, you need to know how words become numbers. Neural networks only understand numbers.
Think of word embeddings like GPS coordinates for meaning. Words that mean similar things are "near" each other in this mathematical space. "King" and "Queen" are close. "King" and "Pizza" are far apart. The classic example: King − Man + Woman ≈ Queen. The vector arithmetic captures semantic relationships.
# Each word maps to a dense vector of floats (e.g., 512 dimensions) # These vectors are LEARNED during training, not hand-crafted from torch import nn # vocab_size: how many unique tokens in our vocabulary (e.g. 50,000) # embed_dim: how many numbers represent each word (e.g. 512) embedding = nn.Embedding(vocab_size=50000, embedding_dim=512) # Word ID → vector word_id = 342 # e.g., token for "cat" vector = embedding(word_id) # vector is now shape [512] — a point in 512D space # A sentence "The cat sat" becomes a matrix: # shape = [3 words × 512 dimensions] # This matrix is the input to the Transformer
The core insight that changed everything
What if instead of passing a hidden state left-to-right through a sequence, we let every word look at every other word directly — and decide how much to "pay attention" to each one?
This means the word "it" in "The animal didn't cross because it was tired" can directly look at "animal" and compute a high attention score — without needing to relay information through 4 other words.
The key shift: stop thinking about sequences as chains you process one link at a time. Start thinking about sets of tokens that all interact with each other simultaneously. This is what enables parallel computation and removes the distance limitation.
The Attention Mechanism
The single idea that made modern AI possible
The query-key-value intuition
Attention is built around three vectors per token: Query (Q), Key (K), and Value (V). Each is computed from the token's embedding using learned weight matrices.
Think of it like a search engine inside the network. Every word sends out a Query (what am I looking for?). Every word also broadcasts a Key (what do I contain?). When a query matches a key well, the corresponding Value (the actual information) is retrieved and passed forward. The word "it" queries "who am I referring to?", finds "animal" has a matching key, and retrieves animal's value information.
Compute Q, K, V for every token
Multiply each token embedding by three learned weight matrices: W_Q (what am I looking for?), W_K (what do I offer?), W_V (what information do I carry?).
Score: Q · Kᵀ
Dot product between every Query and every Key produces an N×N matrix of similarity scores — "how relevant is token j to token i?"
Scale by √d_k
Without scaling, large dot products push softmax into saturation regions where gradients vanish. Dividing by √d_k keeps things stable.
Softmax → attention weights
Convert each row to a probability distribution that sums to 1. This is "what % of attention does token i pay to each other token?"
Multiply by V
Take a weighted blend of all Value vectors using those attention weights. Each token's output is now a context-aware mix of every other token.
In Python — from scratch
import numpy as np def attention(Q, K, V): """ Q: [seq_len, d_k] — queries (what each token looks for) K: [seq_len, d_k] — keys (what each token offers) V: [seq_len, d_v] — values (what each token contains) """ d_k = Q.shape[-1] # How much does each query match each key? # scores[i][j] = "how much should token i attend to token j?" scores = Q @ K.T / np.sqrt(d_k) # shape: [seq_len, seq_len] # Convert to probabilities (each row sums to 1) def softmax(x): e = np.exp(x - x.max(axis=-1, keepdims=True)) return e / e.sum(axis=-1, keepdims=True) attn_weights = softmax(scores) # Each row = "what % of attention goes to each other token?" # Weighted blend of value vectors output = attn_weights @ V # output[i] = a mix of all tokens' values, weighted by relevance return output, attn_weights
Critical insight: Every token attends to every other token in one matrix multiplication. Sequence of length N requires one N×N matrix multiply — all done in parallel on GPU. Compare to RNN's N sequential steps. This is why Transformers train so much faster.
Try it: which word does "it" attend to?
Click any token to see which other tokens it would most strongly attend to in this sentence. (Scores are illustrative — but they reflect what real attention heads actually learn.)
Multi-Head Attention
One attention operation can only capture one type of relationship. Multi-head attention runs the attention mechanism multiple times in parallel (e.g., 8 or 16 heads), each with its own learned W_Q, W_K, W_V matrices.
It's like having 8 different analysts all reading the same document simultaneously. Head 1 might focus on subject-verb relationships. Head 2 might focus on pronoun coreference. Head 3 might focus on syntactic dependencies. Their outputs are concatenated and projected into a single vector. More perspectives = richer understanding.
"Do we tell each head what to look for?"
Short answer: no. Each head's W_Q, W_K, W_V matrices start as small random numbers. Nothing in the architecture says "head 3 = pronouns." During training, the only signal the model gets is the next-token loss, propagated back through every weight via backpropagation.
So why do heads specialize? Because the architecture forces them to:
The output is a concatenation of all heads
If two heads compute the exact same thing, one is redundant — it doesn't reduce loss, but it costs parameters. Gradient descent quickly pushes them apart.
Each head has a smaller dimension
If d_model=512 and there are 8 heads, each head only operates on 64 dimensions. No single head has the capacity to model everything — they have to divide the labor.
Loss rewards diverse, useful patterns
Heads that capture genuinely different relationships (one tracks syntax, another tracks semantics, another tracks long-range dependencies) reduce loss the most. So that's what they drift toward.
This is one of the most beautiful things about deep learning: specialization is emergent, not designed. Researchers later probe trained models (e.g., the famous "What does BERT look at?" paper) and discover heads that track subject-verb agreement, coreference, syntactic dependencies — none of which were programmed in. The architecture creates pressure for diversity, and the data does the rest.
The Full Transformer Architecture
"Attention Is All You Need" — what the paper actually built
The original paper — "Attention Is All You Need"
The 2017 Google paper (Vaswani et al., arXiv:1706.03762) stacked attention layers into a full architecture. No recurrence. No convolutions. Just attention and feed-forward layers.
Walking through every stage
Here's exactly what happens to a sentence as it flows through the architecture above, top to bottom:
Tokenization & Input Embeddings
"How are you?" is split into tokens, each token ID is looked up in an embedding table, producing a vector of size d_model (e.g., 512). Output: a matrix of shape [seq_len × 512].
Positional Encoding
Attention has no built-in notion of order. A sinusoidal positional vector is added to each token embedding so the model knows token 1 is different from token 5.
Encoder · Multi-Head Self-Attention
Every input token attends to every other input token in parallel. Each token's representation becomes a weighted blend of all input tokens — context-aware, bidirectional.
Add & LayerNorm (after attention)
The original input is added back to the attention output (residual connection — prevents vanishing gradients in deep stacks). Then LayerNorm normalizes each token vector independently for training stability.
Encoder · Feed-Forward Network
Two linear layers with ReLU in between, applied independently to each token position. This is where most of the model's parameters live, and where factual knowledge is stored as a key-value memory.
Add & LayerNorm (after FFN) — repeat × 6
Another residual + LayerNorm. This whole block (attention → norm → FFN → norm) is stacked 6 times. The output is a deeply contextualized representation of the input sentence.
Decoder · Masked Self-Attention
Same as encoder self-attention, but with a causal mask: token t can only attend to tokens 1…t. This prevents cheating — when generating word 5, the model can't peek at words 6+.
Decoder · Cross-Attention
The crucial bridge between input and output. Q comes from the decoder (the partially generated output), K and V come from the encoder's final output. This is how the decoder "looks at" the source sentence while generating each target word.
Decoder · Feed-Forward Network — repeat × 6
Same FFN as encoder, applied per position. Stack of 6 decoder layers total.
Linear + Softmax → next-token probabilities
The final decoder output is projected into vocabulary space (size ~50,000) and softmax produces a probability distribution. The highest-probability token is emitted, fed back into the decoder, and the process repeats until an end-of-sequence token is generated.
The same building block is reused everywhere: Attention → Add&Norm → FFN → Add&Norm. The encoder uses self-attention, the decoder adds masking and cross-attention. That's the entire architecture. Modern GPT models drop the encoder entirely (decoder-only), and BERT drops the decoder (encoder-only) — both are subsets of this original design.
Positional Encoding — solving "no order" problem
Attention has no concept of order. If you shuffle the input words, you get the same attention scores. So the paper adds positional encodings — vectors that encode each token's position in the sequence — and add them to the embeddings.
import torch import math def positional_encoding(seq_len, d_model): """ seq_len: number of tokens in the sequence d_model: embedding dimension (e.g., 512) Returns positional vectors to ADD to token embeddings """ PE = torch.zeros(seq_len, d_model) position = torch.arange(0, seq_len).unsqueeze(1) # Use sine/cosine waves of different frequencies # Each position gets a unique "fingerprint" vector div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) PE[:, 0::2] = torch.sin(position * div_term) PE[:, 1::2] = torch.cos(position * div_term) return PE # Add to your embeddings before feeding to Transformer x = token_embeddings + positional_encoding(seq_len, d_model)
Layer Norm and Feed-Forward — the other components
After each attention block, two things happen:
Add & LayerNorm: The input is added back to the output (residual connection), then normalized. Residuals prevent vanishing gradients in deep networks and enable extremely deep architectures.
Feed-Forward Network (FFN): Two linear layers with ReLU in between. Applied independently to each token position. This is where the network stores and retrieves factual knowledge — each FFN layer acts like a key-value memory store.
import torch.nn as nn class TransformerBlock(nn.Module): def __init__(self, d_model=512, n_heads=8, d_ff=2048): super().__init__() # Multi-head self-attention self.attention = nn.MultiheadAttention(d_model, n_heads) # Layer normalizations self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) # Feed-forward network (applied to each position independently) self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model) ) def forward(self, x): # Self-attention + residual connection + normalization attn_out, _ = self.attention(x, x, x) x = self.norm1(x + attn_out) # "Add & Norm" # Feed-forward + residual + normalization ffn_out = self.ffn(x) x = self.norm2(x + ffn_out) # "Add & Norm" return x
GPT vs BERT — two ways to use the Transformer
| Aspect | BERT (encoder only) | GPT (decoder only) |
|---|---|---|
| Uses | Encoder stack only | Decoder stack only |
| Attention type | Bidirectional (sees all tokens) | Causal/masked (only sees past tokens) |
| Training task | Mask random words, predict them | Predict next token given previous |
| Good for | Classification, Q&A, embeddings | Generation, completion, chat |
| Examples | BERT, RoBERTa, your Qdrant embeddings | GPT-4, Claude, Llama, Mistral |
How LLMs Are Actually Trained
Pre-training, fine-tuning, RLHF — the full pipeline
Stage 1: Pre-training — next token prediction
The base LLM is trained on a massive corpus (trillions of tokens from the internet, books, code). The task is brutally simple: given all previous tokens, predict the next token.
Imagine an exam where the only question is: "given this half-sentence, what word comes next?" But the exam has 1 trillion questions. To answer them well, the model must implicitly learn grammar, facts, reasoning, coding patterns, world knowledge — everything. The single task of "predict the next word" compresses an extraordinary amount of knowledge into the weights.
for batch in training_data: # batch = tokenized text like: [42, 8931, 102, 5531, 99, ...] # Input: all tokens except the last # Target: all tokens except the first (shifted by 1) input_ids = batch[:-1] # "The cat sat on the" target_ids = batch[1:] # "cat sat on the mat" # Model predicts probability distribution over vocab for each position logits = model(input_ids) # logits shape: [seq_len, vocab_size] e.g., [512, 50000] # Cross-entropy loss: how surprised was the model by the real next word? loss = cross_entropy(logits.view(-1, vocab_size), target_ids.view(-1)) loss.backward() # backprop through ALL layers of the Transformer optimizer.step() # update all weights (billions of them)
GPT-3 was trained on ~300 billion tokens. The model has 175 billion parameters (weights). Training cost ~$4.6 million in compute. Llama 3 70B was trained on 15 trillion tokens. This is why you don't train from scratch — you fine-tune.
Stage 2: Supervised Fine-Tuning (SFT)
A base model is just a text completion engine. It will complete "How do I make a bomb?" by continuing in the style of whatever it was trained on. To make it a helpful assistant, you fine-tune on instruction-response pairs.
{
"instruction": "Summarize this text in 3 bullet points",
"input": "... long article ...",
"output": "• Key point 1\n• Key point 2\n• Key point 3"
}
Thousands-to-millions of such examples teach the model to produce outputs that match this instruction-response format. The result: a model that follows instructions instead of just continuing text.
Stage 3: RLHF — making it actually good
SFT gets you an instruction-following model. RLHF (Reinforcement Learning from Human Feedback) makes it aligned — honest, helpful, harmless.
Collect human preference data
Show humans two model responses to the same prompt. They pick which one is better. Collect millions of these comparisons.
Train a Reward Model
Train a separate model to predict which response a human would prefer. This model gives a "quality score" to any response.
PPO — Reinforcement Learning
Use the reward model as the "environment". Optimize the LLM using PPO to maximize the reward score while staying close to the SFT model (to prevent it going off the rails).
Think of RLHF like training a new employee. Pre-training = they went to school and read everything. SFT = you give them a job manual. RLHF = you watch them work, give feedback on what they do well and poorly, and they adjust their behavior accordingly.
Tokenization — what the model actually sees
Models don't see words. They see tokens — sub-word chunks. "unhappiness" → ["un", "happi", "ness"]. This lets the model handle words it has never seen.
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B") text = "The telecom network had 5G latency issues" tokens = tokenizer.encode(text) # → [450, 14584, 3904, 2400, 29871, 29945, 29954, 21162, 5626] # Decode to see the actual chunks: decoded = [tokenizer.decode([t]) for t in tokens] # → ['The', ' tele', 'com', ' network', ' had', ' 5', 'G', ' lat', 'ency', ' issues'] # This is why domain-specific terms fragment strangely — # "nrcelldu" would tokenize into many small pieces # This is the exact reason your schema synonym enrichment matters!
From Text Input to Generated Response
The full modern stack — generation, RAG, and how your rag-sql fits in
Inference: how a model generates text
During generation, the model outputs one token at a time. Each token becomes part of the input for the next step (autoregressive generation). The model predicts a probability distribution over the entire vocabulary, and you sample from it.
Step 1
Input ["SELECT", "*", "FROM"] → model produces P(next token). Highest prob: "customers".
Step 2
Input ["SELECT", "*", "FROM", "customers"] → model produces P(next). Highest prob: "WHERE".
Step 3 … N
Repeat — append predicted token, run again — until the model emits the <EOS> end-of-sequence token.
| Temperature | Behavior | Best for |
|---|---|---|
| 0 | Always pick the single most likely token (deterministic) | SQL generation, structured output, code |
| 0.7 | Sample from distribution with mild flattening | Default for chat — balanced |
| 1.0+ | More random, more creative, less coherent | Brainstorming, creative writing |
import torch import torch.nn.functional as F def generate(model, prompt_tokens, max_new_tokens=100, temperature=0.7): tokens = prompt_tokens.copy() for _ in range(max_new_tokens): # Forward pass: get logits for ALL positions logits = model(tokens) # Take only the LAST position's logits (the next token prediction) next_logits = logits[-1] / temperature # Convert to probabilities probs = F.softmax(next_logits, dim=-1) # Sample (don't just take argmax — that's too rigid) next_token = torch.multinomial(probs, 1).item() if next_token == EOS_TOKEN: break tokens.append(next_token) return tokens
What to explore next
You now have the foundation. The next chapter covers everything that has happened since the Transformer — and the modern stack you'll actually build on.
The Modern Stack — Everything After the Transformer
RAG, quantization, agents, reasoning models, MCP — what's actually used in production through 2026
The chronological map
The Transformer (2017) was the architectural breakthrough. Everything since has been about scaling it up, making it cheaper, connecting it to tools, and letting it think and act. This timeline only includes things that are actually used in production today — no hype, no toys.
Concept 1 · RAG (Retrieval-Augmented Generation)
An LLM's knowledge is frozen at training time. RAG fixes this without retraining: when a question comes in, retrieve relevant documents from a vector database and stuff them into the prompt. The model now "knows" your private data.
Why RAG matters: retraining a 70B model costs millions. Adding a document to a vector DB costs cents. RAG also gives you citations (you can show which chunks were retrieved) and fresh data (last week's docs are queryable today).
Concept 2 · Context, the Context Window, and KV Cache
"Context" is just everything you put into the prompt: system message, conversation history, retrieved RAG chunks, tool outputs. The context window is the hard limit on how many tokens fit.
| Model | Context window (tokens) | Roughly |
|---|---|---|
| GPT-3.5 (2022) | 4K | ~5 pages |
| GPT-4 Turbo | 128K | ~250 pages |
| Claude 3.5 Sonnet | 200K | ~400 pages |
| Gemini 1.5 Pro | 1M – 2M | ~3,000+ pages, an hour of video |
| Llama 3.1 8B (local) | 128K | ~250 pages |
Attention is O(n²) in sequence length — doubling context makes attention 4× slower and uses 4× more memory. The KV cache is what makes inference tractable: during generation, the K and V matrices for previous tokens are cached so you don't recompute attention for them every step. This is why your first token is slow ("prefill") and subsequent tokens are fast ("decode").
Concept 3 · Quantization & Local Models (llama.cpp)
A 70B-parameter model in float16 is 140GB. You can't fit that in consumer RAM. Quantization compresses each weight from 16 bits to 8, 5, 4, or even 2 bits — with surprisingly small quality loss.
| Precision | Bits/weight | Llama-70B size | Quality loss |
|---|---|---|---|
| FP16 (full) | 16 | 140 GB | baseline |
| INT8 | 8 | 70 GB | negligible |
| Q5_K_M | ~5 | 48 GB | very small |
| Q4_K_M | ~4 | 40 GB | small (sweet spot) |
| Q2_K | ~2 | ~26 GB | noticeable |
llama.cpp (Georgi Gerganov, 2023) is the C++ inference engine that started the local-AI revolution. GGUF is its file format. Ollama wraps llama.cpp with a friendly API. When you run ollama pull llama3.1:8b, you're downloading a quantized GGUF and running it through llama.cpp under the hood.
Concept 4 · Mixture of Experts (MoE)
A standard Transformer activates every parameter for every token. MoE replaces the FFN with N experts and a router that picks just 2 of them per token. You get a model with the knowledge capacity of all N experts, but the inference cost of just 2.
Examples: Mixtral 8×7B (open), GPT-4 (rumored MoE), DeepSeek V3, Grok-1. The downside: you still need RAM for all experts because the router can pick any of them next token.
Concept 5 · Multimodal Models
A multimodal model accepts more than text — typically images, but also audio and video. The architectural trick is simple: convert every modality into tokens, then feed them into the same Transformer.
Examples: GPT-4o, Claude 3.5, Gemini 1.5, Llama 3.2 Vision, Qwen-VL. The same recipe works for audio (whisper-style tokens) and video (frame-by-frame ViT tokens).
Concept 6 · Tool Use & Function Calling
By itself an LLM can't access the internet, query a database, or run code. Function calling teaches it to emit structured JSON describing what tool it wants to invoke. Your code runs the tool, feeds the result back, and the model continues.
// You provide tool definitions in the system prompt: { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "city": "string" } } // User: "What's the weather in Tokyo?" // Model emits this instead of plain text: { "tool_call": { "name": "get_weather", "arguments": { "city": "Tokyo" } } } // Your code runs the function, feeds result back: { "tool_result": { "temp": 22, "condition": "clear" } } // Model now produces the final answer: "It's 22°C and clear in Tokyo right now."
Concept 7 · MCP — Model Context Protocol
Function calling is per-application. Every team re-invents tool plumbing. MCP (Anthropic, Nov 2024) standardizes it: a client-server protocol where any LLM can talk to any tool / data source / prompt-library as long as both speak MCP. Think "USB for AI" — and by 2026 it has won as the de facto standard, the way LSP won for editors.
MCP exposes four primitive types over JSON-RPC:
Plus newer 2025 additions: Roots (which directories/URIs the host exposes to a server) and Elicitation (server asks the user a question mid-flow — "which env should I deploy to?"). These are what turned MCP from "function calling 2.0" into a real two-way agent protocol.
Before MCP: every product re-built tool integrations from scratch. After MCP: write your tool as an MCP server once, and Cursor, Claude Desktop, VS Code, Zed, and every cloud agent platform can use it. By 2026 there are hundreds of community MCP servers (GitHub, Postgres, Slack, Linear, Sentry, Datadog, Notion, …) and SDKs in TypeScript, Python, Rust, Go, Kotlin, C#. This is exactly the protocol the MCP servers in your Cursor setup are speaking.
Concept 8 · Agents & the Harness
An agent is an LLM in a loop: think → act → observe → think. The model decides what tool to call, your code runs it, the result goes back into the prompt, and the model decides the next step. The loop terminates when the model emits a "done" signal.
The harness is the surrounding code that makes the loop work: tool registry, prompt assembly, output parsing, retries, safety checks, token-budget management, and the UI. Frontier-class agentic products are 90% harness engineering on top of an off-the-shelf model.
Concept 9 · Mixture of Agents (MoA)
If one agent is good, multiple specialist agents in collaboration can be better. Mixture of Agents (Wang et al., 2024) stacks layers of agents: each layer's outputs become the next layer's inputs.
Concept 10 · Test-Time Compute & Reasoning Models
For 7 years the recipe was: spend more compute at training time → get a smarter model at inference. In Sep 2024, OpenAI's o1 flipped the script: spend more compute at inference time (let the model think longer) and you get the same gain. This is the biggest shift since RLHF.
| Model (released) | Open / Closed | Notable trait |
|---|---|---|
| OpenAI o1 / o3 / o4-mini (2024–25) | Closed | First, set the bar. Hidden CoT. |
| DeepSeek-R1 (Jan 2025) | Open weights | Matches o1, full CoT visible, GRPO recipe public |
| Qwen QwQ / Qwen3-Thinking | Open weights | Strong open replication, runs locally |
| Claude 3.7 / 4 — Extended Thinking | Closed | Toggleable: same model, with/without thinking |
| Gemini 2.0/2.5 Flash Thinking | Closed | Cheap reasoning at scale |
The new knob: at inference time you can now choose how much to spend per query. A factual lookup? Skip thinking. A debugging session? Crank thinking budget to 32K tokens. Modern APIs (OpenAI, Anthropic) expose reasoning_effort / thinking_budget parameters. By 2026, this is just a regular dial in your prompt config.
Concept 11 · GRPO & RLVR — how reasoning models are actually trained
RLHF needs humans to label "which answer is better." That doesn't scale to math problems with 10,000-token solutions. RLVR (Reinforcement Learning with Verifiable Rewards) replaces the human reward model with a program: run the unit tests, check the math, parse the SQL — if it passes, reward = 1.
GRPO (Group Relative Policy Optimization, DeepSeek) is the lightweight RL algorithm that made this practical. Instead of training a separate value/critic network like PPO, it samples multiple candidate solutions per prompt and uses their relative rewards as the signal.
Why this matters for your work: RLVR works for anything you can verify programmatically — and SQL execution is a perfect verifier. By 2025–26, fine-tuning a small SQL-specialized model with GRPO on "did the query run + return correct rows?" is a real, accessible recipe (TRL, Unsloth, verl all support it). Worth knowing this exists when you outgrow pure prompting.
Concept 12 · Hybrid architectures: Mamba & State Space Models
Pure attention is O(n²). For 1M-token contexts that becomes painful. State Space Models (Mamba, Dec 2023) revisit RNN-style linear-time sequence processing, but with selective gating that — empirically — competes with attention on quality.
By 2025, the winning recipe turned out to be hybrid: a few Transformer layers (for precise recall) interleaved with many Mamba/SSM layers (for cheap long context).
| Architecture | Compute per token | Memory per token | Long-context behavior |
|---|---|---|---|
| Pure Transformer | O(n) · grows with seq | O(n) · KV cache grows | Best recall, but expensive |
| Pure SSM (Mamba) | O(1) · constant | O(1) · constant | Cheap, weaker on exact recall |
| Hybrid (Jamba, Nemotron-H) | ~O(1) for most layers | Mostly constant | Near-Transformer quality at SSM cost |
Production-deployed hybrids (2025–26): AI21 Jamba 1.5, NVIDIA Nemotron-H, TII Falcon-Mamba, Zyphra Zamba. You won't write Mamba code by hand — but if a model card says "hybrid" or "SSM," now you know what's inside and why it's faster on long inputs.
Concept 13 · Speculative Decoding — the silent 2× speedup
Generation is sequential: predict token, append, predict next. Each step needs a full forward pass through 70B parameters. Speculative decoding uses a small "draft" model to propose k tokens at once, then the big model verifies all k in a single parallel pass. Accepted tokens are kept; the first rejected one resets the draft.
Where you'll see it: vLLM, TGI, llama.cpp, SGLang, Ollama, every commercial API. By 2025–26, it's not optional — every production inference stack runs it by default. Variants: EAGLE, Medusa, self-speculation (no separate draft model needed).
Concept 14 · Production inference stacks — beyond llama.cpp
llama.cpp is for local single-user inference. For serving thousands of concurrent users, you need a different stack — one that batches requests, manages KV cache memory, and supports speculative decoding at scale.
| Stack | Sweet spot | Key feature |
|---|---|---|
| llama.cpp / Ollama | Local · single user · CPU+GPU | GGUF quantization, runs anywhere |
| vLLM | High-throughput GPU serving | PagedAttention — packs KV cache like virtual memory |
| SGLang | Structured generation, agents | RadixAttention — caches prefixes across requests |
| TensorRT-LLM | NVIDIA GPUs · max throughput | Hand-tuned CUDA kernels, in-flight batching |
| HuggingFace TGI | Easy deploy · HF ecosystem | Drop-in for any HF model |
Rule of thumb (2026): running on your laptop → Ollama / llama.cpp. Serving an internal team (10–100 QPS) → vLLM. Building an agent platform with shared system prompts → SGLang (its prefix caching shines when 1000 agents share the same 5K-token system prompt). Behind a paid API at scale → TensorRT-LLM on H100/B200.
When to use what — the practical decision matrix
The hardest part of using all this isn't understanding any single concept — it's knowing which tool to reach for. Here's the rough decision tree you'll actually use:
| Your problem | Reach for | Avoid |
|---|---|---|
| Model doesn't know our private docs / DB schema | RAG (vector DB + retrieval) | Fine-tuning · stuffing 1M tokens of context |
| Model needs to follow a specific style / format consistently | SFT or LoRA fine-tune on 100–10k examples | RAG · long system prompts |
Model hallucinates on rare technical terms (e.g. nrcelldu) |
RAG with synonym enrichment, or LoRA | Just prompting harder |
| Need to reason through hard math / multi-step bugs | Reasoning model (o-series, R1, Claude extended thinking) | Cheap chat model with longer prompts |
| Need to read / process a 200-page document end-to-end | Long-context model (Gemini 1M, Claude 200K) | Naive chunked RAG (loses cross-references) |
| Need to take multi-step actions (edit files, run tests, deploy) | Agent loop + MCP tools + a real harness | A single LLM call with a long prompt |
| Need cheap, private, fully offline inference | Quantized open-weight model on Ollama / llama.cpp | A frontier API for everything |
| Serving 100+ QPS to internal users | vLLM / SGLang / TGI + speculative decoding | Single-process Ollama |
| Want to specialize a model on a verifiable task (SQL, code, math) | GRPO / RLVR fine-tune (TRL, Unsloth, verl) | Classic RLHF (needs human labelers) |
| Want your tool to plug into many AI clients | Expose it as an MCP server | Custom HTTP adapter per client |
How everything fits together
Where your rag-sql project sits in 2026: you already use an embedding model (encoder Transformer) for Qdrant, a quantized decoder (Llama via Ollama / llama.cpp) for SQL generation, and RAG to inject schema context. The natural next steps: (1) expose it as an MCP server so Cursor's agent can call it directly, (2) move serving to vLLM when you outgrow single-user, (3) try a small GRPO/RLVR fine-tune where the verifier is "did the SQL execute and return the right rows?", and (4) optionally route hard queries to a reasoning model with a thinking budget. You're already using six of the concepts on this page — knowing the names just gives you the vocabulary to extend it.