ML Bible

ML Bible · Chapter 1

Neural Networks

From a single neuron up through backpropagation, optimizers, and the practical craft of training a neural network.

1. What Machine Learning Is Doing

We begin with the big picture. Normally, to make a computer do something, you write down the rules yourself. "If the email contains the word lottery, mark it as spam." The computer follows your rules to the letter. That approach works fine until the rules get too tangled to write down, and for a surprising number of useful problems, they are impossible to write down at all.

Try to write the instructions that separate a cat from a dog in a photo which a computer can follow, stated in terms of the millions of pixel values it actually receives. I'll help you: You can't do it. Nobody can. You know a cat when you see one, but that knowledge lives in your head as intuition, and intuition doesn't come with instructions attached.

Machine learning fixes this problem. Instead of writing the rules, you write a program that figures out the rules from examples. You show it thousands of photos already labeled "cat" or "dog," and it works out the pattern on its own. You never have to say what makes a cat a cat. You just need examples and a procedure for turning examples into a rule.

That is the whole enterprise. Everything in this chapter converges to do that one thing well: take a ton of examples and squeeze a rule out of them that the computer can apply to the task.

So what is the rule, in practice? It's always a function, which is just a machine that takes something in and gives something out. You hand it an input, it hands you back an output. We write it f(x)f(x), read as "the output of the function ff on input xx." Feed in the pixels of a photo, get back "cat." Feed in the square footage of a house, get back a predicted price.

The part that makes this learning rather than just a function is that ours has adjustable knobs inside it. Picture a machine with a few million little dials on the side. Set the dials one way and it maps cat photos to "dog." Set them another way and it gets the answer right. Learning is the process of finding the dial settings that make the function behave. We call those dials the parameters and bundle them under one symbol, the Greek letter theta, θ\theta. So a more precise way to write the function is f(x;θ)f(x; \theta): the output depends both on the input xx and on the current setting θ\theta of all the knobs. The semicolon just separates "the thing we're classifying" from "the thing we're tuning."

Training means finding a good setting of θ\theta. Hold on to that sentence, because the rest of this chapter is in service of it.

Check your understanding

Why can't we just write explicit rules for a task like recognizing cats in photos, the way we would for sorting numbers?

Show answer ▸

Because the rule lives in our intuition, not as something we can state in terms of raw pixel values. For most perception and language tasks, no human can actually write the rule down, so instead we let the model learn it from labeled examples.

2. Preliminary Definitions

Here is the working vocabulary. Skim it now to get the shape of each term, and it will settle in as we start using them.

Model
A function, with "learnable" values, that maps inputs to outputs. Formally, a model is a function f(x,θ)f(x, \theta) where xx is the input and θ\theta are the learnable parameters set during training.
Label
The output or target variable the model is trying to predict, typically denoted yy. For example, the house price in a model that predicts house prices.
Features
The input variables that describe each example, typically denoted xx or XX. For example, square footage in a house price model.
Supervised Learning
A method of training where a model is trained on input-output pairs, with each training example having a label.
Unsupervised Learning
A method of training where the model is only given inputs and no labels, and must discover the structure in the data on its own. The goal here is to get the model to discover hidden patterns by itself.
Semi-Supervised
The model trains on a small amount of labeled data and a lot of unlabeled data. This is particularly useful when labels are "expensive," for example medical imaging.
Self-Supervised Learning
The model generates its own labels from the structure of the input data, then trains in a supervised manner on those auto-generated labels. For example, next-token prediction (Chapter 3) is a form of SSL.
Classification
A supervised learning task where the output (label) is a discrete category, assigning each input to one of a finite set of classes. For instance, spam detection is a classification task (just as in the name).
Regression
A supervised learning task where the output is a continuous numerical value. The model predicts a real number rather than a category. For example, predicting house prices.
Train/Validation/Test Split
When working with a model, we split our dataset into subsets. The training set is used to fit model parameters. The validation set is used to tune hyperparameters. The test set is used only for estimating model performance (e.g. accuracy).
Cross-Validation
A method for estimating model performance, used especially for scenarios with limited data.
Parameters
The internal variables of a model learned from data during training. In a linear regression model, given by y=wx+by = wx + b, the weights ww and the bias bb are parameters.
Hyperparameters
Configuration values set before training begins that are not learned from data. For example, the learning rate (we'll explore this in a bit).
Underfitting
When a model is too simple or undertrained to capture the underlying structure in the data, resulting in poor performance on both the training data and unseen data.
Overfitting
When a model memorizes idiosyncrasies and noise of the training data rather than learning generalizable patterns. This results in very high accuracy on training data but poor performance on unseen data.
Generalization
The ability of a model to perform well on new, unseen data drawn from the same distribution as the training data. It is the ultimate goal of machine learning.
Tensor
A multi-dimensional array of numbers, and the fundamental data structure of modern ML. A tensor's rank describes how many dimensions it has. These allow operations to run efficiently on GPUs and TPUs.
Matrix
A two-dimensional, rectangular array.
Pre-Training
The first and most expensive training phase, where a model learns broad, general-purpose patterns from a large body of data before it is adapted to any specific task.
Post-Training
Everything done after pre-training to adapt the model to do something useful, such as fine-tuning on curated examples or learning from human feedback.

Check your understanding

What's the difference between a parameter and a hyperparameter?

Show answer ▸

A parameter is learned from the data during training, like the weights and bias. A hyperparameter is something you set by hand before training starts and training never changes it, like the learning rate or the number of layers.

3. Building a Neuron

At a high level, a neuron is a tiny decision-maker: it looks at some inputs, decides how much each one matters, and produces a single number. That's all it is. Let's build one from scratch and motivate every piece as we add it.

Suppose you want to predict one number from a handful of input numbers. Say you're guessing whether a loan should be approved, with inputs like income, credit score, and existing debt. The natural first idea is that some inputs matter more than others, so you give each one an importance and add them up. High income should push toward approval, high debt should push against it. So you assign each input a weight, a number saying how much that input counts and in which direction. A large positive weight means "this input is strong evidence for yes," a large negative weight means "strong evidence for no," and a weight near zero means "ignore this one."

Then you combine them the obvious way: multiply each input by its weight and add up the results.

z=w1x1+w2x2++wnxn+bz = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

Let's read every symbol, because this small formula is the atom of everything that follows. The x1,x2,,xnx_1, x_2, \ldots, x_n are your nn input numbers (the features). The w1,,wnw_1, \ldots, w_n are their weights, one per input. You multiply each input by its weight and add up the products. The bb on the end is the bias, and zz is the result, which we'll call the pre-activation for a reason that will be clear in a minute.

The bias earns its place because, without it, when all your inputs happen to be zero the output is forced to be zero too, and that's an arbitrary limitation. The bias is a constant offset that lets the neuron shift its whole output up or down regardless of the inputs. You can think of it as the neuron's default leaning, where it sits before it has looked at any evidence. In the line y=wx+by = wx + b you saw in school, ww is the slope and bb is where the line crosses the axis. Same bb, same job.

That sum-of-products has a compact name and notation. If you gather the weights into a vector w\mathbf{w} (bold, to signal it's a whole list of numbers rather than one number) and the inputs into a vector x\mathbf{x}, then "multiply matching entries and add them all up" is the dot product, written wx\mathbf{w}^\top \mathbf{x}. The small \top means "transpose," which here is just bookkeeping that lines the shapes up so the multiplication is defined; you can read wx\mathbf{w}^\top \mathbf{x} as "the dot product of w\mathbf{w} and x\mathbf{x}." So the whole formula shortens to:

z=wx+bz = \mathbf{w}^\top \mathbf{x} + b

This is identical to the long version; we've only stopped writing out the sum. Get comfortable with it, because from here on, whenever you see a weight vector dotted with an input vector, your mind should translate it straight back to "weighted sum of the inputs."

0income: w·x1.80debt: w·x-0.80age: w·x0.24bias: b0.50z = 1.74z = w·x + b (no activation yet)
income x1.5w1.2
debt x1.0w-0.8
age x0.8w0.3
bias b0.5
Fig 1.1 — A neuron playground: drag each input, its weight, and the bias, and watch the weighted sum z respond.

So now we have something that takes inputs, weighs them, sums them, and adds an offset. Is that a neuron? Almost. There is one missing ingredient, and it turns out to be the ingredient that everything depends on. It gets its own section.

Check your understanding

What does the bias let a neuron do that the weighted sum alone cannot?

Show answer ▸

It shifts the neuron's output up or down independent of the inputs, so the output isn't forced to zero when all inputs are zero. It sets the neuron's default leaning, or threshold for activating.

4. Nonlinearity: Why a Network Needs a Bend

High-level idea first: a plain weighted sum can only draw straight lines, and the world is not made of straight lines. So we add a gentle bend to each neuron, and that single change is what lets a deep stack of them represent genuinely complicated patterns. Now let's see why, by trying to build something out of just weighted sums and watching it fall short.

What we want from depth is layering: early units detect simple things, later units combine them into complex things, the way you might first notice edges, then shapes, then faces. So stack two weighted-sum units. The first takes input xx and produces an intermediate value. The second takes that value and produces the output. Each is just "multiply by a weight, add a bias."

Watch what happens. The first unit gives w1x+b1w_1 x + b_1. Feed that into the second: w2(w1x+b1)+b2w_2 (w_1 x + b_1) + b_2. Multiply it out: w2w1x+w2b1+b2w_2 w_1 x + w_2 b_1 + b_2. Look at that result. The thing multiplying xx is just a single number (w2w1w_2 w_1), and the rest is just another number (w2b1+b2w_2 b_1 + b_2). So the whole two-layer contraption is exactly the same as one weighted-sum unit with weight w2w1w_2 w_1 and bias w2b1+b2w_2 b_1 + b_2. We stacked two and got nothing more than one.

This is not a quirk of two layers. Stack a hundred and they still collapse into a single weighted sum, because a chain of straight-line operations is itself a straight line. The formal way to say it is that the composition of linear functions is linear, where "linear" means straight-line, no curves. A weighted sum can only ever split the world with a straight cut. It can never bend. And almost nothing worth predicting is separable by a straight cut.

So we give it a bend. After the weighted sum produces zz, we pass zz through one more step: a curved, nonlinear function written σ\sigma (the Greek letter sigma), called the activation function. The neuron's output is then:

a=σ(z)a = \sigma(z)

Here zz is the weighted sum, σ\sigma is some fixed curve we apply to it, and aa is the neuron's final output, called the activation. That's the full neuron, start to finish: take inputs, weigh and sum them into zz, then bend zz through σ\sigma to get aa. The weighted sum decides which inputs matter and the bias sets the threshold, but it's the bend that gives the network the ability to represent curves, corners, and the complicated shapes real patterns demand. Now when you stack neurons, each bend stays a bend, and the stack can express things a single unit never could. That is why depth buys you anything. Take the bends out and the whole tower collapses back into one straight line.

xyf(x) = w₂·a(w₁x+b₁)+b₂g(x) = Wx + BW = w₂·w₁ = -1.68B = w₂·b₁ + b₂ = 2.06nonlinearity: offcurves coincide exactly
Fig 1.2 — Two stacked linear layers collapse into a single line — until you insert a nonlinearity between them.

So which curve do we use for σ\sigma? A few have been popular, and going through them in order is really a tour of the field correcting its own mistakes.

The old favorite is the sigmoid, σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}. The symbol ee is a fixed number, about 2.7182.718, that shows up everywhere in math because it makes calculus clean; eze^{-z} means ee raised to the power z-z. When zz is a large positive number, eze^{-z} is nearly zero, so the fraction is nearly 11. When zz is a large negative number, eze^{-z} is huge, so the fraction is nearly 00. In between it slides smoothly from 00 up to 11 in a soft S-shape. That's its appeal: it squashes any input, however large, into the range between 00 and 11, which is exactly what you want if you're going to read the output as a probability. The sigmoid ruled the early decades and still earns its keep at the very end of a network when you want a single yes/no probability.

However, the sigmoid is not perfect. (But the sigmoid has a quiet flaw that nearly stalled deep learning for good, and it's worth seeing now because it returns in the training sections.) Look at the flat parts of the S. When zz is very positive or very negative, the curve is almost horizontal, so nudging zz barely changes the output. Training, as we'll see, depends on those nudges carrying a signal. When the curve goes flat, the signal dies. Stack many sigmoid layers and the signal, passing through one flat region after another, fades to nothing before it reaches the early layers, and they stop learning. This is the vanishing gradient problem, which we'll name properly once we have gradients in hand. Its cousin tanh (hyperbolic tangent) is the same S-shape squashed into the range 1-1 to 11 instead of 00 to 11. Being centered on zero helps training a little, but it has the same flat-tails problem.

The fix, and a big reason deep learning took off after 2012, is simple. It's called ReLU, the Rectified Linear Unit, and it's just σ(z)=max(0,z)\sigma(z) = \max(0, z): if the input is positive, pass it through unchanged; if it's negative, output zero. A flat line for negatives, a 45-degree ramp for positives, with a sharp corner at zero. It looks too crude to work, yet it's the default for almost every hidden layer built today. The reason is the flat-tail problem in reverse: on the positive side ReLU has no flat region, so the learning signal passes through undamped no matter how many layers you stack. It's also very cheap to compute, which matters when you do it billions of times.

ReLU isn't perfect. A neuron can get pushed into the negative region and stay there, where its output is always zero and, since the curve is flat there too, no learning signal ever reaches it to revive it. The neuron is effectively dead. This is the dying ReLU problem. The patch is Leaky ReLU, max(αz,z)\max(\alpha z, z) with a small α\alpha like 0.010.01, which gives the negative side a gentle slope instead of a flat zero, so a stuck neuron always has a thread of signal to climb back out on. The smooth, modern relative used in transformers like GPT is GELU, the Gaussian Error Linear Unit, which behaves like ReLU but rounds the sharp corner into a soft transition.

GELU is worth a short detour because it leans on a concept you'll meet again: the Gaussian, also called the normal distribution, the famous bell curve. The idea: take a quantity that's the sum of many small random influences, like a person's height or the noise in a measurement, and it tends to pile up symmetrically around an average, common near the middle and rare at the extremes. Plot how often each value occurs and you get the bell shape. GELU uses the bell curve's running total, the probability that a standard bell-curve draw lands below your value zz, as a soft, smoothly increasing gate on the input. You don't need to compute it by hand; just picture a smoothed-out ReLU and you have the intuition.

-6-4-2246-11f(z) = 0.500slope f'(z) = 0.250z = 0.00
Fig 1.3 — Activation explorer: each curve with its slope (derivative). The slope vanishes in the flat tails of sigmoid and tanh.

There's one more activation that lives in a special place, the output. When you classify among several classes, you don't want one number, you want a full set of probabilities, one per class, all positive and summing to exactly 11 (because the answer is some class, with total certainty 11 split across the options). The function that turns a raw vector of scores into exactly such a distribution is softmax:

softmax(z)i=ezij=1Kezj\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Take it piece by piece. You have KK classes and a raw score ziz_i for each. The j=1K\sum_{j=1}^{K} means "add up the following over every class jj from 11 to KK." So for class ii, you exponentiate its score, ezie^{z_i}, and divide by the sum of the exponentiated scores of all classes. Exponentiating makes every number positive (no negative probabilities) and exaggerates differences (so the network can express confidence), and dividing by the total forces the whole set to sum to 11. Out comes a clean probability distribution. Softmax sits at the end of nearly every classifier.

1.00.00.55A0.20B0.12C0.03D0.09EΣ p = 1.00T ≈ 1: standard softmax.
Fig 1.4 — Softmax turns raw scores into probabilities that always sum to 1; temperature sharpens or flattens them.

Check your understanding

If you stack two linear layers with no activation between them, what do you end up with, and why does that matter?

Show answer ▸

You end up with something equivalent to a single linear layer, a plain straight-line function, so the extra depth bought you nothing. The nonlinear activation between layers is what lets a deep stack represent curved, complicated patterns instead of one straight cut.

5. From One Neuron to a Network

The high-level picture: one neuron makes one bent cut through the data, which isn't much. Put many neurons side by side into a layer, then stack layers, and you get a function flexible enough to describe almost anything.

A layer is a row of neurons that all look at the same input at the same time and each produce their own output. If nn inputs come in and you want mm neurons in the layer, each neuron has its own weight vector of length nn and its own bias. Rather than track mm separate weight vectors, we stack them as the rows of a single grid of numbers, a matrix, written WW. So WW has mm rows (one per neuron) and nn columns (one per input), summarized as WRm×nW \in \mathbb{R}^{m \times n}. (The symbol R\mathbb{R} means "the real numbers," ordinary numbers; Rm×n\mathbb{R}^{m \times n} means "an mm-by-nn grid of ordinary numbers.") The biases of all mm neurons stack into one vector b\mathbf{b}, and the whole layer computes:

z=Wx+b,a=σ(z)\mathbf{z} = W \mathbf{x} + \mathbf{b}, \qquad \mathbf{a} = \sigma(\mathbf{z})

This is the single-neuron equation from before, done mm times in parallel and packed into matrix form. The matrix-vector product WxW\mathbf{x} means "take the dot product of each row of WW with x\mathbf{x}," which produces every neuron's weighted sum in one operation. Add the bias vector, apply σ\sigma to every entry (that's what "element-wise" means, one bend per neuron), and you have the layer's output vector a\mathbf{a}, one number per neuron. The reason for all the matrix machinery, instead of looping over neurons one at a time, is the tensor point from the definitions: a matrix-vector product is exactly the operation a GPU runs fastest, so writing it this way is what makes the whole thing practical to run.

W0.50-1.002.001.500.50-0.50x1.002.00-1.00+b0.50-1.00=z-3.002.00→ σ →a0.050.88
z_1 = w_11·x_1 + w_12·x_2 + w_13·x_3 + b_1
= 0.50·1.00 + -1.00·2.00 + 2.00·-1.00 + 0.50
= 0.50 - 2.00 - 2.00 + 0.50 = -3.00
a_1 = σ(z_1) = σ(-3.00) = 0.05
each row of W is one neuron's weights
Fig 1.5 — A layer as a matrix–vector product: each row of W is one neuron, and the product runs every neuron at once.

To build a deep network you feed the output of one layer in as the input to the next. The first layer reads the raw features, the last layer produces the final answer, and the layers between are called hidden layers, because you never directly observe what they compute; they're the network's private scratch space for inventing intermediate features. A network with LL layers strung together is the composition:

f(x;θ)=fLfL1f2f1(x)f(x; \theta) = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(x)

The small circle \circ means "compose," that is, "feed the output of the right-hand function into the left-hand one." Read right to left: f1f_1 runs on the input xx, its output feeds f2f_2, and so on up to fLf_L, whose output is the prediction. The full bundle of parameters θ\theta is every weight matrix and every bias vector across all the layers: θ={W(1),b(1),,W(L),b(L)}\theta = \{W^{(1)}, \mathbf{b}^{(1)}, \ldots, W^{(L)}, \mathbf{b}^{(L)}\}. The superscripts in parentheses are just layer labels, "the weights of layer 1," not exponents.

It's fair to ask how much this stacking actually buys us. Are there shapes a neural network cannot represent? The reassuring answer is a result called the universal approximation theorem, which says a network with even a single hidden layer, given enough neurons, can approximate essentially any continuous function to any accuracy you like. So expressiveness is not the bottleneck. Then why bother with depth, if one wide layer can in principle do anything? Because "in principle" hides a hard "in practice." A shallow network might need an enormous number of neurons to capture a pattern a deep network captures with few. Depth lets the network reuse its intermediate features, building complex ideas out of simpler ones layer by layer, and that reuse is the difference between a model that's merely possible and one you can actually train. Depth buys efficiency, not new powers.

-2-1012-3-2-10123target (dashed) · S(x) = Σ blocks
blocks: 1err: 1.061block 1
Fig 1.6 — Universal approximation: sum a handful of simple bent pieces to mold almost any target curve.

So we now have our function with knobs: a deep stack of weighted sums and bends, with millions of weights and biases waiting to be set. But notice what we don't have yet. A freshly built network has random knobs, so it's useless; it maps cat photos to nonsense. Nothing so far makes it learn. Learning, concretely, means adjusting those knobs until the network's outputs fit the data, and we don't yet have any procedure for doing the adjusting.

That procedure is what the next three sections build, in order, each one needed by the next. First we need a way to measure how wrong the network currently is, because you can't improve what you can't measure; that's the loss (section 6). Once we can score wrongness, we need a rule for which way to turn each knob to make that score smaller; that's gradient descent (section 7), the actual learning mechanism. And gradient descent turns out to need one ingredient it can't easily get, the slope of the loss with respect to every knob at once; computing that efficiently for millions of knobs is what backpropagation does (section 8). Loss tells us how wrong, gradient descent tells us which way to move, backprop makes the move computable. Keep that chain in mind as we go.

Check your understanding

What does each row of a layer's weight matrix WW represent?

Show answer ▸

One neuron's weights. Multiplying W by the input vector takes the dot product of each row with the input, which computes every neuron's weighted sum at once.

6. The Loss: Measuring How Wrong We Are

Before we can make the network better, we need a single number that says how bad it is right now. You can't improve what you can't measure. That number is the loss.

The loss function takes the network's prediction, written y^\hat{y} (the small hat means "predicted," to set it apart from the true answer yy), compares it to the true answer, and returns one number that's large when the prediction is far off and small when it's close. Training then has a clear target: turn the knobs to make that number as small as possible, averaged over all the training examples.

For regression, where the answer is a number, the natural measure of wrongness is the gap between prediction and truth, squared:

L(y^,y)=(y^y)2L(\hat{y}, y) = (\hat{y} - y)^2

The (y^y)(\hat{y} - y) is the error, the difference between your guess and the truth. We square it for two reasons. First, squaring removes the sign, so being off by +5+5 or by 5-5 counts the same. Second, squaring punishes big misses far more than small ones (an error of 1010 contributes 100100, an error of 22 contributes only 44), which pushes the network away from catastrophic mistakes. Averaged over many examples, this is mean squared error, the standard loss for regression.

For classification, where the network outputs a probability distribution from softmax, we want a loss that's happy when the model puts high probability on the correct class and unhappy when it confidently backs the wrong one. That loss is cross-entropy:

L=k=1Kyklogy^kL = -\sum_{k=1}^{K} y_k \log \hat{y}_k

Here's how to read it. The true label yy is a one-hot vector: a list with a 11 in the slot of the correct class and 00 everywhere else (if the answer is "class 3 of 5," then y=[0,0,1,0,0]y = [0, 0, 1, 0, 0]). The prediction y^\hat{y} is the probability the model assigned to each class. The sum runs over all KK classes, but because yky_k is zero for every class except the correct one, every term drops out except the single term for the right answer. So the loss reduces to log(probability the model gave the correct class)-\log(\text{probability the model gave the correct class}). Recall what a logarithm does: log\log is the inverse of exponentiation, and the key fact is that the log of a number close to 11 is close to 00, while the log of a number close to 00 falls toward negative infinity. So if the model gave the right answer probability 0.990.99, then log(0.99)-\log(0.99) is a tiny loss, nearly zero, good. If it gave the right answer probability 0.010.01, then log(0.01)-\log(0.01) is a large positive loss, a heavy penalty for being confidently wrong. Cross-entropy rewards the model for assigning high probability to the truth.

MSEyL = (ŷ − y)² = 2.25Cross-entropy10L = −log(p) = 0.51
Fig 1.7 — Loss explorer: squared error punishes big misses quadratically; cross-entropy punishes confident wrong answers.

Whichever loss we use, the full training objective is to minimize its average over the entire training set:

L(θ)=1Ni=1NL(f(xi;θ),yi)\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\big(f(x_i; \theta), y_i\big)

Reading it: NN is the number of training examples, the sum runs over all of them, f(xi;θ)f(x_i; \theta) is the network's prediction on example ii given the current knobs θ\theta, yiy_i is that example's true answer, and the 1N\frac{1}{N} averages the whole thing. The capital L(θ)\mathcal{L}(\theta) is the average loss as a function of the knobs. The job from section 1 is now precise: **find the θ\theta that makes L(θ)\mathcal{L}(\theta) as small as possible.**

Check your understanding

Why does cross-entropy punish a confident wrong answer so heavily?

Show answer ▸

The loss reduces to -log(probability assigned to the correct class). As that probability approaches zero, -log of it shoots toward infinity, so the more confident the model was in the wrong answer, the larger the penalty.

7. Gradient Descent

This is the section where the network finally learns. We have a network full of random knobs and a loss that scores how wrong they are. What we're missing is the actual mechanism of learning: a rule that takes the current knobs and the current wrongness and produces better knobs. Gradient descent is that rule. It's how the network fits the data, by repeatedly nudging every knob in whatever direction lowers the loss a little, until the loss is small and the outputs match the answers.

The Cost Function, Seen as a Surface

We already met this number in the last section as the loss. (People say "loss" and "cost" almost interchangeably.

Stop thinking of the cost as a function of the data, and start thinking of it as a function of the knobs. The training data is fixed, baked in. What's free to vary is the millions of weights and biases. So the cost is really a function that takes in one complete setting of every weight and bias and returns a single number: how badly that setting does across the data. Write it C(θ)C(\theta), where θ\theta is the whole bundle of weights and biases and CC is the cost. Training the network means hunting through the space of possible θ\theta for the setting that makes CC smallest.

Now try to picture that function, and do it by climbing the dimensions one at a time. Suppose the network had a single weight. Then CC depends on one number, and you can draw it as an ordinary curve: the weight along the horizontal axis, the cost along the vertical. Making the network better is just finding the bottom of the curve. Now suppose it had two weights. Then CC depends on two numbers, and you draw it as a surface: a landscape floating above a flat plane of the two weights, where the height at each point is the cost there, full of hills and valleys. Making the network better is rolling a ball to the lowest point of that surface. A real network has not one or two weights but millions, so the true cost surface lives in a space with millions of directions, which nobody can draw or even imagine. And here is the leap worth making: you do not need to. The reasoning that works for the curve and the surface, find the downhill direction and step that way, works exactly the same with a million directions. Everything we figure out from the 2D and 3D pictures is literally what happens up in the enormous space; there are just more directions to choose a step in.

weight wcostbestw = -1.60one weight → cost is a curve; training finds the bottoma real network has millions of weights — same downhill logic, just more directions
Fig 1.8 — The cost as a surface over the weights: training is the search for the lowest valley.

This reframing is what gradient descent acts on. The cost surface is the thing we want to get to the bottom of, and from here the section is really just answering one question: standing somewhere on that surface, which way is downhill, and how big a step do we take?

The high-level idea is a hike in fog. You have a number, the cost, that depends on millions of knobs, and you want it small. So you imagine standing on the surface we just described, feel which way is downhill, take a step, and repeat. Let's make that precise.

That surface is the landscape. Every setting of the knobs θ\theta is a location on it, and the height there is the cost at that setting. Bad settings sit up on hills and ridges; good settings sit down in valleys. Training is the search for the lowest valley. You're standing somewhere on this landscape, in thick fog, and you want to get downhill. What do you do? You feel the slope under your feet and step in the steepest downhill direction. Then you feel the slope again and repeat. That's the whole method. The only thing we need to make it work is a way to feel the slope, and that's where calculus comes in.

The slope of a function is its derivative. For a function of one variable, the derivative at a point answers a single question: if I nudge the input a tiny bit, how much, and in which direction, does the output move? A positive derivative means the output rises as you move right; a negative one means it falls. The size of the derivative tells you how steep the slope is. That's all a derivative is, a rate of change, the steepness of the curve at a point.

Our loss doesn't depend on one knob, though, it depends on millions. So instead of a single slope we have a slope for each knob: if I nudge this particular weight a tiny bit and hold all the others fixed, how does the loss move? That per-knob slope is a partial derivative (partial because you vary one variable at a time and freeze the rest). Collect all those partial derivatives into one big vector, one slope per knob, and you have the gradient, written L(θ)\nabla \mathcal{L}(\theta). The upside-down triangle \nabla is just the symbol for "gradient of." The gradient is the many-dimensional version of a slope, and it has a useful property: it points in the direction of steepest increase of the loss, the most uphill direction. Its exact opposite, the negative gradient, points most steeply downhill, which is the way we want to walk.

So the update rule follows directly. Stand at your current knobs θt\theta_t (the subscript tt labels which step you're on), compute the gradient, and step in the downhill direction:

θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)

The new setting θt+1\theta_{t+1} is the old one minus a small multiple of the gradient. The minus sign turns "uphill" into "downhill." And η\eta (the Greek letter eta) is the learning rate, the size of the step you take. This is the single most important hyperparameter you'll tune. Too small and you inch down the mountain so slowly you may never arrive. Too large and you take giant reckless leaps, overshooting the valley floor, maybe bouncing out of the valley entirely or flying off to infinity. Getting η\eta right, or scheduling how it changes over training, is much of the practical art. Repeat the step over and over and you descend toward a valley. The whole procedure is gradient descent, and it's the engine underneath essentially all of modern machine learning.

One honest caveat. The loss landscape of a real network is not a single smooth bowl with one obvious bottom. It's a crinkled, high-dimensional terrain with countless valleys, some deeper than others. Gradient descent only promises to bring you to a low point near where you started, not the lowest point anywhere. For years people assumed this would be fatal. In practice, in these huge spaces, the many valleys tend to be roughly as good as each other, and the method works remarkably well anyway. We'll mostly set the worry aside.

minimumclick anywhere to move the start pointstep 0
loss ≈ 17.68
Fig 1.9 — Gradient descent: drag the learning rate and step the ball downhill. Too large a rate overshoots and diverges.

Check your understanding

What happens if the learning rate is set too large?

Show answer ▸

The steps overshoot the bottom of the valley, so instead of settling at a minimum the parameters bounce back and forth across it or diverge entirely toward infinity.

8. Backpropagation

Gradient descent told us the rule for learning: nudge every knob opposite its slope. But it quietly assumed we already had those slopes, the gradient, the partial derivative of the loss with respect to every single knob. That assumption is the whole catch. A real network has millions or billions of knobs, and gradient descent is useless until we can actually produce that gradient. So gradient descent and backpropagation are two halves of one idea: gradient descent decides which way to move each knob, and backpropagation is what makes computing that direction, for all the knobs at once, fast enough to be possible. Without backprop, the learning rule from the last section stays a nice idea you can't run.

At a high level, backpropagation is just careful bookkeeping with the chain rule. It computes the whole gradient in one sweep backward through the network, reusing shared work so nothing is recomputed.

First, why the obvious approach fails. You could, for each weight, trace by hand how a nudge to it ripples forward through every later layer to finally move the loss, then write down that one partial derivative. But each such trace walks the whole depth of the network, and you'd repeat it for every one of billions of weights, redoing nearly all of the same intermediate work each time. The cost explodes with depth and width. It doesn't scale.

The escape rests on one idea from calculus, the chain rule, which is the rule for differentiating a function of a function. In plain terms: if a change in xx causes a change in uu, and that change in uu causes a change in yy, then the effect of xx on yy is the product of the two link-by-link effects. Sensitivities multiply along a chain. A neural network is one long chain (the input feeds layer 1, which feeds layer 2, on up to the loss), so the chain rule is exactly the right tool. It tells us we can find how the loss responds to an early weight by multiplying together the local sensitivities of each link along the way. What makes it fast is that those link sensitivities are shared across all the weights, so if we compute them in the right order and reuse them, we never redo work.

x1.00u1.50y2.25du/dx = a 1.50 0.75dy/du = 2bu 3.00 2.25Δx = 0.50dy/dx = dy/du · du/dx= 3.00 · 1.50 = 4.50Δx ⇒ Δy ≈ 2.25 (sensitivity is the product of local slopes)
Fig 1.10 — The chain rule: a nudge to x is scaled by each local derivative along the path to y.

Here is how it runs. Picture the network as a graph of operations with data flowing through it. First the forward pass: push the input through the network layer by layer, computing and remembering each layer's z\mathbf{z} and a\mathbf{a} along the way (we keep them because the backward pass will need them), until we reach the end and compute the loss. For each layer \ell from first to last:

z()=W()a(1)+b(),a()=σ(z())\mathbf{z}^{(\ell)} = W^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)})

This is the layer equation from section 5 applied down the stack, with a(0)\mathbf{a}^{(0)} being the raw input xx and the final a(L)\mathbf{a}^{(L)} being the prediction y^\hat{y} (run through softmax if you're classifying). Then compute the loss.

Now the backward pass, where the real work happens. The quantity we send backward is the sensitivity of the loss to each layer's pre-activation z()\mathbf{z}^{(\ell)}. We name it δ()\boldsymbol{\delta}^{(\ell)} (the Greek letter delta) and define it as exactly that:

δ()=Lz()\boldsymbol{\delta}^{(\ell)} = \frac{\partial L}{\partial \mathbf{z}^{(\ell)}}

In words, δ()\boldsymbol{\delta}^{(\ell)} is the error signal at layer \ell: how much the final loss would change if you jiggled that layer's pre-activations. The reason to track this quantity is that once you have it for a layer, the gradients of that layer's actual knobs follow with almost no extra work. So the plan is: get δ\boldsymbol{\delta} at the last layer, then pass it backward layer by layer, reading off the weight and bias gradients at each stop.

Getting it started, at the output layer, is where an earlier design choice pays off. If you used softmax with cross-entropy (the standard classification pairing), the error signal at the final layer simplifies to something very clean:

δ(L)=y^y\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}

Just the prediction minus the truth. This isn't luck; softmax and cross-entropy were chosen because they combine into this tidy form, which also avoids the vanishing-gradient trouble right where the network is most sensitive.

To pass the error from one layer back to the previous one, we use this rule:

δ()=(W(+1)δ(+1))σ(z())\boldsymbol{\delta}^{(\ell)} = \left( W^{(\ell+1)\top} \boldsymbol{\delta}^{(\ell+1)} \right) \odot \sigma'(\mathbf{z}^{(\ell)})

It looks busy but it tells a two-step story. The piece W(+1)δ(+1)W^{(\ell+1)\top} \boldsymbol{\delta}^{(\ell+1)} takes the error from the layer ahead and pushes it back through the weights that connected the two layers (the transpose \top reverses the direction of flow, sending the signal backward instead of forward). That spreads the blame for the error across the neurons of the current layer, in proportion to how much each contributed. Then the σ(z())\odot \, \sigma'(\mathbf{z}^{(\ell)}) part scales that blame by how sensitive each neuron's bend actually was. The \odot symbol means element-wise multiplication (multiply matching entries, no matrix product), and σ\sigma' is the derivative of the activation, the slope of the bend at that point. This is exactly where the vanishing gradient bites: if a neuron's activation was saturated, out on a flat tail of the sigmoid, then σ\sigma' there is nearly zero, so it multiplies the error signal down to almost nothing, and that neuron and everything behind it get almost no signal to learn from. Use ReLU, whose slope is a healthy 11 on the positive side, and the signal survives the trip.

Finally, the payoff. Once you have δ()\boldsymbol{\delta}^{(\ell)} for a layer, its parameter gradients are immediate:

LW()=δ()a(1),Lb()=δ()\frac{\partial L}{\partial W^{(\ell)}} = \boldsymbol{\delta}^{(\ell)} \, \mathbf{a}^{(\ell-1)\top}, \qquad \frac{\partial L}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}

The bias gradient is just the error signal itself. The weight gradient is an outer product of the error signal with the input that came into the layer, which is a compact way of saying each individual weight's gradient is "how wrong the neuron it feeds was" times "what input rode in on that weight." A weight gets a big correction when the neuron it serves was badly wrong and the input flowing through it was large. That single idea, blame times input, is the heart of the algorithm.

ŷinh1h2outforward: activations
slope: sigmoid (~0.25, δ shrinks)
Fig 1.11 — Backpropagation: the forward pass fills activations; the backward pass sends the error signal back, gated by each slope.

So one training step, in full: run the forward pass and store the intermediate values; compute the loss; compute δ\boldsymbol{\delta} at the output; walk it backward layer by layer, reading off each layer's weight and bias gradients as you go; then hand all those gradients to gradient descent, which nudges every knob a little downhill. Repeat over many batches and the network learns. You'll rarely code this by hand, because every framework computes it automatically through automatic differentiation, where each basic operation knows its own local derivative and the framework chains them together for you. But knowing what's underneath is the difference between fixing a stuck network and staring at it.

Check your understanding

Why do we compute the error signal δ\boldsymbol{\delta} for a layer before computing that layer's weight gradients?

Show answer ▸

Because once you know delta for a layer, every weight and bias gradient in that layer follows immediately (the weight gradient is delta times the incoming activation), and the same delta is reused for all of them. Tracking delta is what lets backprop avoid recomputing shared work for each weight.

9. Optimizers

This is the densest section in the chapter, so we'll take it slowly and build up one step at a time. The good news is that every optimizer here is a small patch on the one before it. If you understand plain gradient descent from the last section, you can understand all of these, because each one keeps that same core idea (step opposite the gradient) and only changes how the step is sized or smoothed.

Plain gradient descent, step opposite the gradient, works, but it's a little dumb, and seeing exactly how it's dumb motivates every improvement people have added. Each optimizer here fixes a specific weakness of the one before it, so read it as a sequence of repairs.

The starting point is SGD: sample a mini-batch (a small random handful of training examples, since using all of them every step is too slow), compute the gradient gtg_t on it, and step against it. Its signature failure shows up in a ravine, a long narrow gully in the landscape, steep on the sides but only gently sloped along the floor toward the minimum. Plain SGD, following the steepest local direction, bounces back and forth between the steep walls while crawling along the gentle floor, wasting most of its motion rattling side to side. (Sampling a mini-batch each step rather than the full dataset also adds a little noise to the gradient, which is mostly fine and sometimes even helps, by jostling you out of shallow dead-end valleys. One full pass through the training data is called an epoch.)

minstep 0SGDSGD+momentumAdam
Fig 1.12 — An optimizer race down a ravine: plain SGD zig-zags while momentum and Adam glide along the floor.

The first repair is momentum, and the picture is exactly what the word suggests. Instead of each step depending only on the current gradient, give the optimizer inertia, like a heavy ball rolling downhill. The ball builds up speed in directions where the gradient consistently points, and the back-and-forth jitter across the ravine walls cancels out because it keeps reversing. We track a running velocity vv and step along it:

vt+1=βvt+gt,θt+1=θtηvt+1v_{t+1} = \beta v_t + g_t, \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}

The velocity vt+1v_{t+1} blends the old velocity (scaled by β\beta, typically 0.90.9, so about 90%90\% of the previous momentum carries over) with the current gradient gtg_t. Then we step along that accumulated velocity rather than the raw gradient. Consistent directions build speed; oscillating ones average to nearly nothing. The ravine problem largely goes away.

minv = β·v_old + gradvx = 0.00vy = 0.00SGDmomentumstep 0
momentum glides along the floor
Fig 1.13 — Momentum builds velocity along the valley floor while the side-to-side oscillation cancels out.

A refinement called Nesterov Accelerated Gradient (NAG) adds a bit of foresight. Plain momentum measures the slope where it currently stands and then leaps. Nesterov first looks ahead to roughly where the momentum is about to carry it, measures the slope there, and uses that to correct the jump mid-flight, like a runner adjusting before the corner instead of after. On well-behaved problems it converges a little faster:

vt+1=βvt+L(θtηβvt),θt+1=θtηvt+1v_{t+1} = \beta v_t + \nabla \mathcal{L}(\theta_t - \eta \beta v_t), \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}

The only change is where the gradient is evaluated: at the looked-ahead point θtηβvt\theta_t - \eta \beta v_t rather than at θt\theta_t.

look-aheadgradient herex=-4.00 v=0.00 x_look=-4.00 grad=-4.00v_new=-4.00 x_new=-2.80 loss=8.00
Fig 1.14 — Nesterov look-ahead: measure the gradient where momentum is about to land, then correct the step.

The next family attacks a different weakness: one global learning rate for all knobs is crude, because some knobs need large updates and others tiny ones. AdaGrad gives every parameter its own learning rate by tracking how much gradient each one has accumulated and shrinking the step for the busy ones:

Gt=Gt1+gt2,θt+1=θtηGt+ϵgtG_t = G_{t-1} + g_t^2, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t

Here GtG_t is a running sum of the squared gradients for each parameter, and dividing the step by Gt\sqrt{G_t} throttles parameters that have seen large gradients while keeping large steps for rarely-touched ones. (The ϵ\epsilon is a microscopic constant, around 10810^{-8}, parked in the denominator only to avoid dividing by zero; you'll see this guard everywhere.) This helps with rare, sparse features, like uncommon words in text. But AdaGrad dies slowly: GtG_t only grows, so the effective learning rate ηGt\frac{\eta}{\sqrt{G_t}} marches toward zero, and eventually the model freezes and stops learning.

G_t (Σ g²)600eff_t = η / √(G_t + ε)0.50step t → 600G only grows → eff → 0 → learning freezes (motivates RMSProp)
busy: G=60 eff=0.065sparse: G=7 eff=0.189
Fig 1.15 — AdaGrad gives each parameter its own rate, but its accumulator only grows, so the rate decays toward zero.

RMSProp fixes that death by replacing the ever-growing sum with a decaying average that gently forgets old gradients, so it can't blow up forever:

vt=βvt1+(1β)gt2,θt+1=θtηvt+ϵgtv_t = \beta v_{t-1} + (1 - \beta) g_t^2, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t

Now vtv_t is a weighted average leaning mostly on recent squared gradients (with β\beta around 0.90.9), so it stays responsive instead of grinding to a halt. (A small piece of history: RMSProp was never formally published. Geoff Hinton described it in an online lecture and it caught on anyway.)

accumulator0G=120v=1.01effective rate00.0460.498step t (1 .. 120)AdaGradRMSPropAdaGrad G_t growswithout bound;RMSProp v_t levelsoff → rate holds.
Forgetting old gradients keeps RMSProp alive where AdaGrad freezes.
Fig 1.16 — AdaGrad keeps accumulating and freezes; RMSProp forgets old gradients and stays responsive.

Combine the two good ideas, momentum's inertia and RMSProp's per-parameter scaling, and you get Adam (Adaptive Moment Estimation), the default optimizer for almost everything today:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \frac{\eta \, \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Two running averages: mtm_t tracks the gradient itself (the momentum, the "first moment") and vtv_t tracks the squared gradient (the scale, the "second moment"). The m^t\hat{m}_t and v^t\hat{v}_t are bias-corrected versions; the correction matters early in training because both averages start at zero and would otherwise be pulled toward zero for a while, and dividing by 1βt1 - \beta^t undoes that startup bias (note βt\beta^t shrinks to nothing as training proceeds, so the correction quietly switches itself off). The final step uses the momentum direction m^t\hat{m}_t scaled per-parameter by v^t\sqrt{\hat{v}_t}. Typical settings are β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}, and they work across a wide range of problems with little tuning, which is why Adam is everywhere.

first moment m_t — the momentum part: which direction, smoothedm_t (raw, starts ~0)m̂_t (bias-corrected ↑)second moment v_t — the RMSProp part: how big the gradients have beenstep_t Adamt = 0t = 59 (step →)two previous ideas running at once
Adam isn't a new idea — it's the two previous ideas running at once.
Fig 1.17 — Adam = momentum (first moment) + RMSProp (second moment). Toggle each ingredient on and off.

Adam has one subtle issue with weight decay (a regularization technique from the next section), where the decay gets unintentionally scaled by the per-parameter learning rate. The fix is AdamW, which separates the weight decay out and applies it cleanly:

θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

The extra λθt\lambda \theta_t shrinks every weight a little on each step, untouched by the adaptive scaling. AdamW is the standard for training large modern models.

One more lever, separate from the choice of optimizer: the learning rate doesn't have to stay fixed. You usually want bold steps early, when you're far from any good solution, and small steps later, when you're closing in on a valley floor and a big step would overshoot. A learning rate schedule varies η\eta over training. You can drop it by a factor every so often (step decay), shrink it smoothly (exponential decay), or ease it down along a cosine curve from a high value to a low one (cosine annealing), a common choice for transformers:

ηt=ηmin+12(ηmaxηmin)(1+cos(tTπ))\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\tfrac{t}{T}\pi\right)\right)

Here tt is the current step and TT is the total number of steps, so as tt runs from 00 to TT the cosine sweeps the rate smoothly from ηmax\eta_{\max} down to ηmin\eta_{\min}. There's also a counterintuitive move at the very start called warmup, where you ramp the rate up from nearly zero over the first few thousand steps before letting it decay, because the adaptive averages are unreliable at the very beginning and a big early step can throw them off. Warmup followed by cosine decay is a common recipe.

ηmaxηminstep t0T=300η

The step size is itself shaped over time — bold early, gentle late.

Fig 1.18 — Learning-rate schedules shape the step size over training: bold early, gentle late.

Check your understanding

What problem does momentum solve compared to plain SGD?

Show answer ▸

In a narrow ravine, plain SGD zig-zags across the steep walls and crawls along the floor. Momentum builds up velocity in the consistent downhill direction while the side-to-side oscillation cancels itself out, so it moves much faster toward the minimum.

10. Regularization

The big idea: stop the model from memorizing. Recall the villain from the definitions, overfitting, where the model latches onto the noise in the training data and falls apart on anything new. Everything in this section works against it. The umbrella term is regularization: any technique that shrinks the gap between training performance and unseen-data performance, usually by discouraging the model from getting too complicated or too sure of itself.

-3-2-10123true (dashed) · fit deg 3balancederror vs degree d151015■ train 0.121■ val 0.168
low d underfits · high d overfits · λ tames wiggle
Fig 1.19 — Underfitting vs overfitting: training error keeps falling while validation error bottoms out and turns back up.

The most direct approach is to penalize complexity. An overfit model often does its overfitting by cranking some weights to extreme values to thread the needle through every noisy point. So we add a term to the loss that grows whenever the weights get large, nudging the model to keep them modest. L2 regularization (also called weight decay) adds λθ2\lambda \|\theta\|^2 to the loss, where θ2\|\theta\|^2 is the sum of the squares of all the weights (a measure of their overall size) and λ\lambda is a hyperparameter setting how hard you push. The result is a preference for smaller, smoother solutions that don't lurch around to chase noise. A close relative, L1 regularization, adds λθ1\lambda \|\theta\|_1 (the sum of absolute values instead of squares), which tends to drive many weights to exactly zero, in effect letting the model ignore some features entirely.

A different and surprisingly effective idea is dropout. During training, on each step, you randomly switch off a fraction of the neurons (say 10%10\% to 50%50\%), forcing the network to cope without them. Because any given neuron might vanish at any moment, the network can't build fragile arrangements that lean on one specific neuron being present; it has to spread its knowledge redundantly across many neurons, which is exactly the robustness that generalizes. At test time you switch all the neurons back on (with a small rescaling to keep the math consistent) and keep the benefit.

There's also the simplest effective move, early stopping: watch the validation loss as you train, and the moment it stops improving and starts creeping back up, stop. That upward creep is overfitting beginning, so you quit while you're ahead. It costs nothing.

If you can't get more data, you can manufacture more with data augmentation: apply changes to your training examples that don't change the answer. Flip, rotate, crop, or recolor an image and it's still the same cat, but to the network it's a fresh example, and training on these variants teaches it to ignore irrelevant details. The text equivalent is swapping in synonyms or translating a sentence back and forth.

A few subtler regularizers round out the toolkit. Label smoothing softens the targets so the model aims for, say, 90%90\% confidence on the right answer rather than a brittle 100%100\%, which keeps it from getting overconfident. Mixup and CutMix train on blended combinations of pairs of examples and their labels, a strong regularizer for images. And stochastic depth randomly drops entire layers during training in very deep networks, the dropout idea scaled up from neurons to whole layers.

Check your understanding

During training you notice the training loss is still falling while the validation loss has started rising. What's happening, and what can you do about it?

Show answer ▸

That's overfitting, the model is memorizing the training set instead of learning a general pattern. You can add regularization (L2 or dropout), get more data, shrink the model, or stop training earlier (early stopping).

11. Normalization and Initialization

Two practical concerns can sink a network before it learns anything, and both come down to keeping the numbers flowing through it in a sensible range. The high-level point: if activations or gradients drift too large or too small, training becomes unstable or stalls, so we manage their scale deliberately.

The first concern is normalization. As data passes through layer after layer, the scale of the numbers can drift, ballooning in some layers and shrinking in others, which makes training erratic and slow. Normalization layers fix this by rescaling the activations back to a steady distribution at each step, then letting the network learn to shift and scale them as it sees fit. The best-known is batch normalization, which for each feature normalizes across the current mini-batch to zero average and unit spread:

x^i=xiμBσB2+ϵ,yi=γx^i+β\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta

Reading it: μB\mu_B is the mean (the average) of that feature over the batch, and σB2\sigma_B^2 is its variance (a measure of how spread out the values are, the average squared distance from the mean). Subtracting the mean re-centers the values on zero, and dividing by the square root of the variance (the standard deviation) rescales them to a standard spread. The γ\gamma and β\beta are learnable knobs that let the network stretch and shift the normalized result if a different scale serves it better, so we keep full expressiveness while gaining stability. Batch norm speeds up training of convolutional networks a lot, though it behaves differently during training (where it uses the live batch statistics) than at test time (where it uses running averages collected during training), and it needs a reasonably large batch to estimate those statistics well.

That batch-size dependence is why layer normalization exists: it normalizes across the features within a single example rather than across the batch, so it doesn't care how big your batch is, which suits transformers, where it's standard. A leaner version called RMSNorm skips the mean-subtraction and only rescales, which is cheaper and used in many recent large models. There are further variants (group normalization for small batches, instance normalization for style transfer), all the same idea applied to a different slice of the data.

The second concern is initialization: where the knobs start before training. This matters more than you'd guess, because a bad starting point can make activations explode or vanish before the first gradient step lands. The clear mistake is setting all weights to zero: if every neuron in a layer starts identical, they all receive identical gradients and update identically forever, so they never differentiate into doing different jobs. The symmetry is never broken and a wide layer collapses into the behavior of a single neuron. The fix is random initialization, but the scale of the randomness has to match the size of the layer, because too small and the signal shrinks layer by layer toward nothing, too large and it explodes. Two recipes solve this by tuning the variance of the random weights: Xavier (Glorot) initialization uses variance 2/(nin+nout)2/(n_\text{in} + n_\text{out}) and is tuned for tanh-style activations, while He initialization uses variance 2/nin2/n_\text{in} and is tuned for ReLU (which, by zeroing the negatives, halves the variance and so needs a compensating boost). For ReLU networks, which is most of them, He initialization is standard. (ninn_\text{in} and noutn_\text{out} are simply the number of inputs and outputs of the layer.)

Check your understanding

Why is initializing all the weights to zero a bad idea?

Show answer ▸

Every neuron in a layer would compute the same thing and receive the same gradient, so they would update identically and never become different from one another. The symmetry is never broken, and the whole layer behaves like a single neuron.

12. Diagnostics

A trained sense for what a training curve means is one of the most useful skills in deep learning, and it separates people who can fix a broken model from people who just re-run it and hope. The loss curve is the network's vital sign. Here is what the common patterns are telling you.

Loss not decreasing at all
Learning rate too low, dead activations, a bug in the loss function, labels misaligned with inputs, or gradients not flowing (check for ReLU saturation or missing skip connections).
Loss explodes to NaN
Learning rate too high, exploding gradients, numerical instability in the loss (e.g. log(0)\log(0)), or bad initialization. Apply gradient clipping, lower the learning rate, and check for division by zero.
Training loss decreases but validation loss increases
Classic overfitting. Add regularization, get more data, reduce model size, or stop earlier.
Training and validation both plateau high
Underfitting. Use a bigger model, better features, more training, or less regularization.
Loss oscillates wildly
Learning rate too high, or batch size too small.
Loss decreases then suddenly spikes
Often a single bad batch or numerical instability. Inspect the outlier examples.

A few sanity checks are worth running first, because they catch most catastrophic bugs in minutes:

  • Can the model overfit a single batch? If it can't drive the loss to nearly zero on five examples, something is fundamentally broken and no tuning will save it.
  • Does the initial loss match the expected value for random predictions (e.g. logK\log K for KK-class cross-entropy)? If it's wildly off, your loss or labels are wrong.
  • Are the gradients a reasonable magnitude across all layers, rather than vanishing in the early ones?

That's the foundation. A function with knobs, a loss that measures wrongness, a gradient that points downhill, and backpropagation to compute it efficiently. The architectures get fancier, but the spine stays the same.

Next, we look at the advanced math behind these networks.