ML Bible

ML Bible · Chapter 2

Math of Neural Networks

The calculus underneath Chapter 1 — gradients, Jacobians, the chain rule, and the four equations of backpropagation, derived from scratch.

1. What This Chapter Is For

In Chapter 1 we built the whole picture of how a network learns, and we stated the rules without proving them. We said the error at the output is the prediction minus the truth, that it propagates backward through the transposed weights, that the weight gradient is an outer product of an error signal and an activation. We used those facts. We never earned them.

This chapter earns them. Everything here is the calculus underneath Chapter 1, derived step by step so that by the end the four equations of backpropagation are not magic words you memorize but results you could rederive on a napkin. We are not going to re-explain what a neuron is or why gradient descent works; you have that. We are going deeper into the math of how.

The plan, in order. First we fix the notation, which matters more here than in any other subject because of the swarm of indices. Then we review the two pieces of calculus we lean on: gradients and Jacobians, and the chain rule that ties them together. Then we write forward propagation as clean math. Then we differentiate it the slow, honest way for a tiny network, watch a pattern appear, and watch that slow way fail to scale. Finally we fix the scaling problem by naming one reusable quantity and propagating it backward, which is backpropagation, and we derive its four equations and prove they match the slow way.

A quick word on what you need. From Chapter 1 you already have derivatives (a slope, a rate of change), partial derivatives (the slope with respect to one variable while the rest are frozen), the gradient (all the partials stacked into a vector), the dot product, and the basic shape of a matrix. We will build on those rather than restate them. The genuinely new piece of machinery is the Jacobian, and we will take that one slowly.

Check your understanding

What is the goal of this chapter, as opposed to Chapter 1?

Show answer ▸

Chapter 1 gave the intuition and stated the rules of forward propagation and backpropagation. This chapter derives those rules from calculus, so the four backprop equations become results you can prove rather than facts you accept.

2. Notation, Set Up Carefully

Most confusion in this subject is not confusion about ideas. It is misreading an index. So before any derivation, here is the bookkeeping, and it is worth fixing in your head now.

We use four typographic conventions. A lowercase italic letter like xx, ww, or bb is a single number, a scalar. A lowercase bold letter like x\mathbf{x}, w\mathbf{w}, or b\mathbf{b} is a vector, an ordered list of numbers, which we always treat as a column unless we say otherwise. An uppercase italic letter like WW is a matrix, a grid of numbers. And a superscript in parentheses, like w()w^{(\ell)} or a()\mathbf{a}^{(\ell)}, is a layer label, not an exponent; a()\mathbf{a}^{(\ell)} means "the activations of layer \ell," and the parentheses are there precisely so you never mistake it for raising something to the power \ell.

Subscripts pick out a component. The one to internalize is the weight index. We write

wjk()w^{(\ell)}_{jk}

for the weight in layer \ell on the connection going into the jj-th neuron of layer \ell, from the kk-th neuron of layer 1\ell-1. Read the order out loud: destination first (jj), source second (kk). This feels backward, because when you draw an arrow you naturally think source-then-destination. There is a concrete payoff for the inversion, and we will cash it in two sections from now: it is exactly what makes the layer's matrix multiplication work with no stray transposes.

Two more symbols carry the whole story. We write zj()z^{(\ell)}_j for the weighted input to neuron jj in layer \ell, the value before the activation function, and aj()=σ(zj())a^{(\ell)}_j = \sigma(z^{(\ell)}_j) for the activation, the value after it. The cost is CC. And one quantity we will define carefully later, the error of a neuron, is

δj()=Czj().\delta^{(\ell)}_j = \frac{\partial C}{\partial z^{(\ell)}_j}.

You do not need to understand that last line yet. Just register that δ\delta (delta) lives at the heart of backpropagation and is defined as a partial derivative of the cost with respect to a neuron's weighted input. Keep this notation nearby. When something looks impenetrable later, nine times out of ten it is a misread index, not a hard idea.

layer ℓ−1layer ℓlayer ℓ+101201201w(ℓ)02into neuron 0 of layer ℓ, from neuron 2 of layer ℓ−1
selected: w(ℓ)_02 — dest j=0, src k=2
Fig 2.1 — Weight-index decoder: click a connection to read w⁽ˡ⁾₍ⱼₖ₎ as 'into neuron j of layer ℓ, from neuron k of layer ℓ−1.'

Check your understanding

In wjk()w^{(\ell)}_{jk}, which index is the destination neuron and which is the source?

Show answer ▸

j is the destination (the neuron in layer l that the connection feeds into), and k is the source (the neuron in layer l-1 the connection comes from). Destination first, source second.

3. Gradients and Jacobians

Start with what you know and stretch it by one step. A gradient is the object you get when a function takes in a vector and returns a single number. Take f:RnRf : \mathbb{R}^n \to \mathbb{R}, read as "a function from nn-dimensional vectors to single numbers" (the symbol R\mathbb{R} means the ordinary real numbers, and Rn\mathbb{R}^n means a list of nn of them). For example f(x)=x12+x22+x32f(\mathbf{x}) = x_1^2 + x_2^2 + x_3^2. You can take a partial derivative with respect to each input, and the gradient simply stacks them into a column:

f(x)=[fx1fx2fxn].\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The \nabla symbol (read "del" or "gradient of") just means "collect all the partials." For the example, f=[2x1,2x2,2x3]\nabla f = [2x_1, 2x_2, 2x_3]^\top, where the small \top ("transpose") is flipping the row I wrote on the page into the column it should be. We met the geometric meaning in Chapter 1: the gradient points in the direction of steepest ascent, and the negative gradient points downhill, which is why we walk against it. We will treat the gradient as a column throughout, because it makes the Jacobian conventions line up cleanly, which is the new idea we turn to now.

What if the function returns not one number but several? That is the common case in a network: a layer eats a vector and produces a vector. Take f:RnRm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m, a vector in and a vector out, so f(x)\mathbf{f}(\mathbf{x}) has mm output components, each depending on all nn inputs. Stack the gradient of each output as a row and you get the Jacobian matrix:

J=fx=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn].J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

There is one rule to hold onto, and it governs everything: rows are outputs, columns are inputs. Entry (i,j)(i, j) of the Jacobian is fixj\frac{\partial f_i}{\partial x_j}, the sensitivity of output ii to input jj. So row ii is the gradient of the ii-th output, and the matrix is mm rows tall (one per output) by nn columns wide (one per input). That shape, output-by-input, is what makes the dimensions click together when we chain things in the next section.

The gradient is just a Jacobian in disguise. If m=1m = 1, meaning the function returns a single number, the Jacobian collapses to a single 1×n1 \times n row. Transpose it and you have the column-vector gradient from before. So gradients and Jacobians are the same animal; the gradient is the Jacobian of a function whose output happens to be one number.

One special shape shows up constantly in networks and is worth memorizing, because it makes the algebra later evaporate. Suppose f\mathbf{f} acts elementwise, meaning each output depends only on the input in the same position: fi(x)=g(xi)f_i(\mathbf{x}) = g(x_i) for some single-variable function gg. Activation functions are exactly this; σ\sigma hits each entry on its own. Then output ii does not depend on input jj at all when iji \neq j, so every off-diagonal partial is zero, and the Jacobian is diagonal, carrying g(xi)g'(x_i) down the diagonal:

fx=diag(g(x1),g(x2),,g(xn)).\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \mathrm{diag}\big(g'(x_1), g'(x_2), \ldots, g'(x_n)\big).

A diagonal matrix is one whose only nonzero entries sit on the top-left-to-bottom-right diagonal. Why care? Because a diagonal Jacobian, when it multiplies a vector inside a chain rule, behaves exactly like multiplying entry by entry. If J=diag(v)J = \mathrm{diag}(\mathbf{v}) and w\mathbf{w} is any vector, then Jw=vwJ\mathbf{w} = \mathbf{v} \odot \mathbf{w}, where \odot is the Hadamard product, plain element-wise multiplication. That single fact is what turns the intimidating matrix expressions in backprop into the friendly σ(z)\odot \, \sigma'(\mathbf{z}) you saw in Chapter 1.

A concrete worked example to make the rows-are-outputs rule stick. Let f(x)=[x1x2,  x1+x2]\mathbf{f}(\mathbf{x}) = [x_1 x_2, \; x_1 + x_2]^\top, with two inputs and two outputs. Differentiate each output by each input:

J=[x2x111].J = \begin{bmatrix} x_2 & x_1 \\ 1 & 1 \end{bmatrix}.

Row 1 is the gradient of f1=x1x2f_1 = x_1 x_2 (its partials are x2x_2 and x1x_1), and row 2 is the gradient of f2=x1+x2f_2 = x_1 + x_2 (its partials are 11 and 11). Rows are outputs, columns are inputs. Always.

J = ∂f / ∂x (rows = outputs, cols = inputs)x1x2x3inputs x_j → columnsf1 = σ(x1)output (row)f2 = σ(x2)output (row)f3 = σ(x3)output (row)σ'(x1)000σ'(x2)000σ'(x3)f(x)f1f2f3diagonal Jacobian × vector = element-wise (Hadamard) product
hover a cell to trace its output row & input column
Fig 2.2 — Jacobian builder: rows are outputs, columns are inputs. For an elementwise function the off-diagonal entries vanish, leaving a diagonal.

Check your understanding

Why is the Jacobian of an elementwise function (like an activation) diagonal?

Show answer ▸

Because each output depends only on the input in the same position, so the partial of output i with respect to input j is zero whenever i and j differ. Only the same-position partials survive, and they sit on the diagonal as g'(x_i).

4. The Chain Rule, as a Sum Over Paths

You know the basic chain rule from Chapter 1: if y=f(u)y = f(u) and u=g(x)u = g(x), then dydx=dydududx\frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx}. Sensitivities multiply along a chain. Before we lift this to vectors, there is a slightly richer version that makes the vector case feel inevitable instead of surprising.

Suppose yy depends on xx through two intermediate variables, both of which are functions of xx: y=f(u1(x),u2(x))y = f\big(u_1(x), u_2(x)\big). Now xx has two separate routes to influence yy, one through u1u_1 and one through u2u_2. The total derivative adds up both routes:

dydx=yu1du1dx+yu2du2dx.\frac{dy}{dx} = \frac{\partial y}{\partial u_1}\frac{du_1}{dx} + \frac{\partial y}{\partial u_2}\frac{du_2}{dx}.

Read the structure, not just the symbols. Each term is one path from xx to yy: travel from xx to u1u_1 (that is du1dx\frac{du_1}{dx}), then from u1u_1 to yy (that is yu1\frac{\partial y}{\partial u_1}), and multiply the two sensitivities along that path. Do the same for the path through u2u_2. Then sum over all the paths. This "multiply along a path, sum over paths" idea is the entire content of the multivariable chain rule, and once it is in your head the matrix version below is just bookkeeping for doing many paths at once.

This matters for networks specifically because a single neuron's output usually fans out to every neuron in the next layer. So when we ask how wiggling one neuron changes the cost, there is not one path back to the cost, there are many, one through each downstream neuron, and we will be summing over exactly those paths. That sum is where the matrix transpose in backprop comes from. Keep the picture handy.

du1/dx= 2du2/dx= -1∂y/∂u1= 3∂y/∂u2= 4xu1u2ydy/dx = (dy/du1)(du1/dx) + (dy/du2)(du2/dx) = 6 + (-4) = 2total sensitivity = sum of per-path products
x reaches y by two routes — dy/dx sums them, just like a neuron feeding many downstream neurons.
Fig 2.3 — Sum over paths: when x reaches y by several routes, the total derivative adds the product of edge derivatives along each path.

Check your understanding

If xx influences yy through two intermediate variables, how do you combine the two routes?

Show answer ▸

Multiply the sensitivities along each path from x to y, then add the path totals together. Two routes means two products summed. This sum-over-paths is the multivariable chain rule.

5. The Jacobian Chain Rule

Now the generalization that runs the whole machine. Suppose y=f(u)\mathbf{y} = \mathbf{f}(\mathbf{u}) and u=g(x)\mathbf{u} = \mathbf{g}(\mathbf{x}), so a vector x\mathbf{x} produces a vector u\mathbf{u}, which produces a vector y\mathbf{y}. The chain rule says: multiply the Jacobians.

yx=yuux.\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}}.

The only thing to check is that the shapes fit, and they do, by design. Say xRn\mathbf{x} \in \mathbb{R}^n, uRk\mathbf{u} \in \mathbb{R}^k, and yRm\mathbf{y} \in \mathbb{R}^m. The first Jacobian yu\frac{\partial \mathbf{y}}{\partial \mathbf{u}} has rows for y\mathbf{y} and columns for u\mathbf{u}, so it is m×km \times k. The second, ux\frac{\partial \mathbf{u}}{\partial \mathbf{x}}, is k×nk \times n. A matrix product is allowed exactly when the inner dimensions match, and they do (both are kk), and the result is m×nm \times n, which is precisely the shape yx\frac{\partial \mathbf{y}}{\partial \mathbf{x}} should have, with rows for y\mathbf{y} and columns for x\mathbf{x}. The output-by-input convention from the last section is what guarantees this lines up.

Write out a single entry of that product and the sum-over-paths idea from the previous section walks right back in:

yixj=p=1kyiupupxj.\frac{\partial y_i}{\partial x_j} = \sum_{p=1}^{k} \frac{\partial y_i}{\partial u_p} \frac{\partial u_p}{\partial x_j}.

Each intermediate component upu_p is one path from input xjx_j to output yiy_i, the term upxj\frac{\partial u_p}{\partial x_j} is the first leg and yiup\frac{\partial y_i}{\partial u_p} is the second, and we sum over all kk paths. Matrix multiplication is quite literally an automated way of doing sum-over-paths for every input-output pair at once.

For a deeper stack you just keep multiplying. If y=f(g(h(x)))\mathbf{y} = \mathbf{f}(\mathbf{g}(\mathbf{h}(\mathbf{x}))), then

yx=ygghhx.\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{g}} \, \frac{\partial \mathbf{g}}{\partial \mathbf{h}} \, \frac{\partial \mathbf{h}}{\partial \mathbf{x}}.

That is, in one line, what backpropagation does. A network is a deep composition of functions, and computing how the cost depends on an early weight means multiplying a string of Jacobians together as you travel from the output back to that weight. Everything from here is figuring out what those particular Jacobians are and how to multiply them in an order that avoids redoing work.

∂y/∂um × k∂u/∂xk × n·=∂y/∂xm × ninner k must match → cancelsouter m, n survive into ∂y/∂x[m×k] · [k×n] = [m×n]a network is a deep composition — backprop multiplies thischain of Jacobians from output back to input.
Fig 2.4 — The Jacobian chain rule as shapes: the inner dimensions must match and cancel, leaving an m×n result.

Check your understanding

When you chain yu\frac{\partial \mathbf{y}}{\partial \mathbf{u}} (size m×km \times k) with ux\frac{\partial \mathbf{u}}{\partial \mathbf{x}} (size k×nk \times n), what size is the result and why?

Show answer ▸

It is m-by-n. The inner dimensions (both k) match and cancel in the matrix product, leaving rows from the first matrix (m, the outputs y) and columns from the second (n, the inputs x), which is exactly the shape of the Jacobian of y with respect to x.

6. Forward Propagation, Written Out

We covered the forward pass conceptually in Chapter 1. Here we write it precisely, with the indexing from section 2, because the derivations later depend on getting these expressions exactly right.

A single neuron takes its inputs, forms a weighted sum, adds a bias, and applies the activation:

z=wx+b,a=σ(z).z = \mathbf{w}^\top \mathbf{x} + b, \qquad a = \sigma(z).

Here wx\mathbf{w}^\top \mathbf{x} is the dot product (multiply matching entries, add them up), bb is the bias scalar, zz is the weighted input, and aa is the activation after the nonlinearity σ\sigma. Nothing new yet; this is Chapter 1 in symbols.

Now the indexing payoff promised earlier. Put every weight of layer \ell into a matrix W()W^{(\ell)} whose entry in row jj, column kk is wjk()w^{(\ell)}_{jk}. Because we indexed destination-first, the jj-th row of W()W^{(\ell)} is exactly the list of weights belonging to neuron jj. So the matrix-vector product W()a(1)W^{(\ell)} \mathbf{a}^{(\ell-1)} produces, in its jj-th entry, the dot product of neuron jj's weights with the previous layer's activations, which is precisely the weighted sum neuron jj wants. No transposes, no rearranging. That is the entire reason for the destination-first convention. The matrix W()W^{(\ell)} has dimensions n×n1n_\ell \times n_{\ell-1} (number of neurons in this layer, by number in the previous layer), and the bias b()Rn\mathbf{b}^{(\ell)} \in \mathbb{R}^{n_\ell} is a column with one entry per neuron. A whole layer is then:

z()=W()a(1)+b(),a()=σ(z()),\mathbf{z}^{(\ell)} = W^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}),

where σ\sigma applied to a vector means applied to each entry. To run the whole network you set a(0)=x\mathbf{a}^{(0)} = \mathbf{x}, the raw input, and apply this pair of equations for =1,2,,L\ell = 1, 2, \ldots, L, and the final activation a(L)\mathbf{a}^{(L)} is the prediction y^\hat{\mathbf{y}}.

A small worked example to see the matrix do its job. Say layer 1\ell-1 has 3 neurons and layer \ell has 2, so W()W^{(\ell)} is 2×32 \times 3:

W()=[w11()w12()w13()w21()w22()w23()],a(1)=[a1(1)a2(1)a3(1)].W^{(\ell)} = \begin{bmatrix} w^{(\ell)}_{11} & w^{(\ell)}_{12} & w^{(\ell)}_{13} \\ w^{(\ell)}_{21} & w^{(\ell)}_{22} & w^{(\ell)}_{23} \end{bmatrix}, \qquad \mathbf{a}^{(\ell-1)} = \begin{bmatrix} a^{(\ell-1)}_1 \\ a^{(\ell-1)}_2 \\ a^{(\ell-1)}_3 \end{bmatrix}.

The product is

W()a(1)=[w11()a1(1)+w12()a2(1)+w13()a3(1)w21()a1(1)+w22()a2(1)+w23()a3(1)],W^{(\ell)} \mathbf{a}^{(\ell-1)} = \begin{bmatrix} w^{(\ell)}_{11} a^{(\ell-1)}_1 + w^{(\ell)}_{12} a^{(\ell-1)}_2 + w^{(\ell)}_{13} a^{(\ell-1)}_3 \\ w^{(\ell)}_{21} a^{(\ell-1)}_1 + w^{(\ell)}_{22} a^{(\ell-1)}_2 + w^{(\ell)}_{23} a^{(\ell-1)}_3 \end{bmatrix},

and each row is one neuron's weighted sum. Add the biases, apply σ\sigma, and you have a()\mathbf{a}^{(\ell)}.

Stacking all the layers, the entire network is one deeply nested function:

y^=σ(W(L)σ(W(L1)σ(W(1)x+b(1))+b(L1))+b(L)).\hat{\mathbf{y}} = \sigma\Big(W^{(L)} \, \sigma\big(W^{(L-1)} \cdots \sigma(W^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) \cdots + \mathbf{b}^{(L-1)}\big) + \mathbf{b}^{(L)}\Big).

This is the object we are about to differentiate. It looks fearsome, but it is just our layer equation wrapped around itself LL times, and the Jacobian chain rule from the last section is built exactly for peeling apart compositions like this.

ŷ = σ(·σ(·σ( x + ) + ) + )evaluate inside-out:press Step to beginxsize 3size 4size 4size 2step 0 / 6
the network is just the layer equation composed 3 times
Fig 2.5 — Forward pass: the network is just the layer equation composed L times — step through it from the inside out.

Check your understanding

Why does the destination-first weight indexing let us write the layer as W()a(1)W^{(\ell)} \mathbf{a}^{(\ell-1)} with no transpose?

Show answer ▸

Because indexing destination-first makes row j of W the weights belonging to neuron j. The matrix-vector product then puts neuron j's weighted sum in entry j automatically, which is exactly what we want, so no rearranging is needed.

7. The Cost Function

The cost is the single number we minimize, and we treat it, as in Chapter 1, as a function of all the weights and biases. Collect them into θ\theta. The network defines y^=f(x;θ)\hat{\mathbf{y}} = f(\mathbf{x}; \theta), and we have training pairs (x(i),y(i))(\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) for i=1,,Ni = 1, \ldots, N, where here ii indexes training examples, not layers. The default cost for this chapter is mean squared error:

C=12Ni=1Ny(i)y^(i)2=12Ni=1Nj(yj(i)y^j(i))2.C = \frac{1}{2N} \sum_{i=1}^{N} \big\| \mathbf{y}^{(i)} - \hat{\mathbf{y}}^{(i)} \big\|^2 = \frac{1}{2N} \sum_{i=1}^{N} \sum_{j} \big( y^{(i)}_j - \hat{y}^{(i)}_j \big)^2.

Unpacking the symbols: the double bars 2\| \cdot \|^2 mean the squared length of a vector, which is just the sum of the squares of its entries, which is why the right-hand form expands into a sum over the output components jj. The outer sum averages over all NN training examples, and the 12\frac{1}{2} out front is pure convenience: when we differentiate a square we get a factor of 22, and the 12\frac{1}{2} is there to cancel it and keep the algebra tidy. It changes nothing about where the minimum is.

For the rest of the derivations we make one simplifying move: pretend there is a single training example, so C=12yy^2C = \frac{1}{2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2. The full-dataset cost is just the average of the single-example costs, and the gradient of an average is the average of the gradients, so the structure of every derivative we find is identical; we would just average at the end. Carrying the sum around would only clutter the page.

Check your understanding

Why is there a factor of 12\frac{1}{2} in front of the mean squared error?

Show answer ▸

Pure convenience. Differentiating the square produces a factor of 2, and the 1/2 cancels it so the derivatives stay clean. It does not change where the minimum is.

8. Differentiating a Neuron's Operations

To differentiate the whole cost we first need the derivatives of the small operations a neuron performs, because the chain rule will glue these together. A layer does three things in sequence: it forms weighted sums (a matrix-vector product, or for one neuron a dot product, or for one weight a single multiplication), it adds the bias (a vector addition), and it applies the activation (an elementwise function). Let us find the Jacobian of each.

Begin with elementwise operations, since the activation is one and they have that pleasant diagonal structure. A binary elementwise function combines two vectors entry by entry: fi(u,v)=g(ui,vi)f_i(\mathbf{u}, \mathbf{v}) = g(u_i, v_i). The Hadamard product uv\mathbf{u} \odot \mathbf{v} (entry-wise multiplication) is the standard example. Because output ii depends only on uiu_i and viv_i and nothing else, the Jacobian with respect to either input is diagonal. For the Hadamard product specifically, fi=uivif_i = u_i v_i, so fiui=vi\frac{\partial f_i}{\partial u_i} = v_i, giving

(uv)u=diag(v),(uv)v=diag(u).\frac{\partial (\mathbf{u} \odot \mathbf{v})}{\partial \mathbf{u}} = \mathrm{diag}(\mathbf{v}), \qquad \frac{\partial (\mathbf{u} \odot \mathbf{v})}{\partial \mathbf{v}} = \mathrm{diag}(\mathbf{u}).

In words, the derivative with respect to one operand is the diagonal matrix of the other operand. And recall the collapse from section 3: a diagonal Jacobian times a vector is just an element-wise product. So when a Hadamard product appears in a chain rule, the incoming gradient simply gets multiplied entry-by-entry by the other operand. This is exactly why the backprop equations are dotted with σ\sigma' rather than carrying around full matrices.

Next, addition, which is the bias step. Take f(z,b)=z+b\mathbf{f}(\mathbf{z}, \mathbf{b}) = \mathbf{z} + \mathbf{b}, so fi=zi+bif_i = z_i + b_i. Each output depends on its own ziz_i and bib_i with derivative 11 and on nothing else, so both Jacobians are the identity matrix II (ones on the diagonal, zeros elsewhere):

fz=I,fb=I.\frac{\partial \mathbf{f}}{\partial \mathbf{z}} = I, \qquad \frac{\partial \mathbf{f}}{\partial \mathbf{b}} = I.

This is the lazy case: addition passes gradients straight through, unchanged. Whenever you see a sum in the forward pass, the matching Jacobian is an identity, and identities do nothing in a product, so you can mentally skip them.

Finally, the lone sum that collapses a vector to a scalar, as in z=kwkxk+bz = \sum_k w_k x_k + b for one neuron. Treating "sum" as the function s(u)=iuis(\mathbf{u}) = \sum_i u_i, its derivative with respect to each input is 11, so its gradient is the all-ones vector 1=[1,1,,1]\mathbf{1} = [1, 1, \ldots, 1]^\top. Differentiating a sum just adds up the upstream contributions equally. Pulling these together for a single neuron's z=wx+bz = \mathbf{w}^\top \mathbf{x} + b, the three derivatives we will reach for again and again are

zwk=xk,zb=1,zxk=wk.\frac{\partial z}{\partial w_k} = x_k, \qquad \frac{\partial z}{\partial b} = 1, \qquad \frac{\partial z}{\partial x_k} = w_k.

Each reads naturally: the weighted input is sensitive to a weight in proportion to the input riding on it (xkx_k), sensitive to the bias with rate 11, and sensitive to an input in proportion to that input's weight (wkw_k).

That leaves the activation. For a single neuron a=σ(z)a = \sigma(z), the derivative is just dadz=σ(z)\frac{da}{dz} = \sigma'(z). For a whole layer, σ\sigma is applied elementwise, so by the diagonal rule the Jacobian is

az=diag(σ(z1),,σ(zn))=diag(σ(z)).\frac{\partial \mathbf{a}}{\partial \mathbf{z}} = \mathrm{diag}\big(\sigma'(z_1), \ldots, \sigma'(z_n)\big) = \mathrm{diag}(\sigma'(\mathbf{z})).

The single most useful instance is the sigmoid, because its derivative is unusually clean. With σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}, differentiating gives

σ(z)=ez(1+ez)2=11+ezez1+ez=σ(z)(1σ(z)).\sigma'(z) = \frac{e^{-z}}{(1 + e^{-z})^2} = \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} = \sigma(z)\big(1 - \sigma(z)\big).

That last equality is worth savoring. The derivative of the sigmoid is expressible entirely in terms of the sigmoid's own value. So during the backward pass, if you already computed σ(z)\sigma(z) in the forward pass, you get its derivative almost for free, just multiply by (1σ(z))(1 - \sigma(z)), with no exponentials to recompute. For ReLU the derivative is even simpler: σ(z)=1\sigma'(z) = 1 when z>0z > 0 and 00 when z<0z < 0 (it is undefined right at 00, a single point we ignore in practice).

-6-4-2246-11f(z) = 0.500slope f'(z) = 0.250z = 0.00
Fig 2.6 — The sigmoid and its derivative σ′ = σ(1−σ), which peaks at 0.25 and decays in the tails — the seed of the vanishing gradient. Toggle ReLU for a flat slope of 1.

Check your understanding

The sigmoid's derivative is σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)). Why is that convenient during the backward pass?

Show answer ▸

Because sigma(z) was already computed in the forward pass, you can get its derivative just by multiplying by (1 - sigma(z)), with no need to recompute any exponentials. The forward computation is reused in the backward pass.

9. The Cost Derivative for a Tiny Network

Now we differentiate the cost by hand, on the smallest network that still shows the structure: one neuron per layer, two layers. Watching this done explicitly is what makes the eventual backprop equations feel obvious rather than arbitrary. The forward pass is

z(1)=w(1)x+b(1),a(1)=σ(z(1)),z^{(1)} = w^{(1)} x + b^{(1)}, \quad a^{(1)} = \sigma(z^{(1)}),
z(2)=w(2)a(1)+b(2),a(2)=σ(z(2)),z^{(2)} = w^{(2)} a^{(1)} + b^{(2)}, \quad a^{(2)} = \sigma(z^{(2)}),
C=12(ya(2))2.C = \tfrac{1}{2}(y - a^{(2)})^2.

We want Cw(2)\frac{\partial C}{\partial w^{(2)}} and Cw(1)\frac{\partial C}{\partial w^{(1)}}.

Start with the last-layer weight, which is the short walk. The cost depends on w(2)w^{(2)} only through z(2)z^{(2)}, then through a(2)a^{(2)}. So the chain rule runs three links deep:

Cw(2)=Ca(2)a(2)z(2)z(2)w(2).\frac{\partial C}{\partial w^{(2)}} = \frac{\partial C}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial w^{(2)}}.

Each link we already know how to compute. Differentiating 12(ya(2))2\tfrac{1}{2}(y - a^{(2)})^2 with respect to a(2)a^{(2)} gives Ca(2)=(ya(2))=a(2)y\frac{\partial C}{\partial a^{(2)}} = -(y - a^{(2)}) = a^{(2)} - y. The activation link is a(2)z(2)=σ(z(2))\frac{\partial a^{(2)}}{\partial z^{(2)}} = \sigma'(z^{(2)}). And since z(2)=w(2)a(1)+b(2)z^{(2)} = w^{(2)} a^{(1)} + b^{(2)}, the last link is z(2)w(2)=a(1)\frac{\partial z^{(2)}}{\partial w^{(2)}} = a^{(1)}. Multiply them:

  Cw(2)=(a(2)y)σ(z(2))a(1).  \boxed{\;\frac{\partial C}{\partial w^{(2)}} = (a^{(2)} - y)\,\sigma'(z^{(2)})\,a^{(1)}.\;}

Now the first-layer weight, the longer walk. The cost depends on w(1)w^{(1)} through z(1)z^{(1)}, then a(1)a^{(1)}, then z(2)z^{(2)}, then a(2)a^{(2)}, so the chain is five links:

Cw(1)=Ca(2)a(2)z(2)z(2)a(1)a(1)z(1)z(1)w(1).\frac{\partial C}{\partial w^{(1)}} = \frac{\partial C}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}}.

The two new links: z(2)a(1)=w(2)\frac{\partial z^{(2)}}{\partial a^{(1)}} = w^{(2)} (because z(2)=w(2)a(1)+b(2)z^{(2)} = w^{(2)} a^{(1)} + b^{(2)}), and a(1)z(1)=σ(z(1))\frac{\partial a^{(1)}}{\partial z^{(1)}} = \sigma'(z^{(1)}), and finally z(1)w(1)=x\frac{\partial z^{(1)}}{\partial w^{(1)}} = x. Multiply the whole string:

  Cw(1)=(a(2)y)σ(z(2))w(2)σ(z(1))x.  \boxed{\;\frac{\partial C}{\partial w^{(1)}} = (a^{(2)} - y)\,\sigma'(z^{(2)})\,w^{(2)}\,\sigma'(z^{(1)})\,x.\;}

Stare at those two boxed results, because the patterns in them are the whole of backpropagation in miniature.

First, reuse. The first two factors of Cw(1)\frac{\partial C}{\partial w^{(1)}}, namely (a(2)y)σ(z(2))(a^{(2)} - y)\,\sigma'(z^{(2)}), are exactly the leading factors of Cw(2)\frac{\partial C}{\partial w^{(2)}}. We recomputed them. If instead we had saved that piece, we could have reached the first-layer gradient by just tacking on w(2)σ(z(1))xw^{(2)}\,\sigma'(z^{(1)})\,x. That saved, reused quantity is the seed of the error signal δ\delta.

Second, the **role of σ\sigma'**. Every layer the chain passes through contributes one factor of σ(z())\sigma'(z^{(\ell)}). For the sigmoid, σ\sigma' never exceeds 0.250.25, so in a deep network you are multiplying many numbers each at most a quarter, and the product collapses toward zero. That is the vanishing gradient problem from Chapter 1, now visible as a literal product of small factors stacking up.

Third, a structural rhythm. Both formulas share one skeleton: an error at the output, (a(2)y)(a^{(2)} - y), propagated backward through the network, and then, at the very last step, multiplied by the input that fed the weight in question, which is a(1)a^{(1)} for the second-layer weight and xx for the first. That skeleton, error-from-the-end times input-that-fed-the-weight, is precisely the form the four equations will take.

The bias derivatives confirm the rhythm. Same network, but now the final link changes, because z(2)b(2)=1\frac{\partial z^{(2)}}{\partial b^{(2)}} = 1 replaces the a(1)a^{(1)}, and z(1)b(1)=1\frac{\partial z^{(1)}}{\partial b^{(1)}} = 1 replaces the xx:

Cb(2)=(a(2)y)σ(z(2)),Cb(1)=(a(2)y)σ(z(2))w(2)σ(z(1)).\frac{\partial C}{\partial b^{(2)}} = (a^{(2)} - y)\,\sigma'(z^{(2)}), \qquad \frac{\partial C}{\partial b^{(1)}} = (a^{(2)} - y)\,\sigma'(z^{(2)})\,w^{(2)}\,\sigma'(z^{(1)}).

So the bias gradient is the weight gradient with the trailing input factor stripped off. Said another way, the bias gradient is the propagated error at that neuron, which is the strongest hint yet that we should give that propagated error a name. We are about to.

∂z1/∂w1 = x∂a1/∂z1 = σ'(z1)∂z2/∂a1 = w2∂a2/∂z2 = σ'(z2)∂C/∂a2 = a2 − yxz1a1z2a2C∂C/∂w2 = (a2 − y)·σ'(z2)·a1the shared green factors are the reusable error signalδ = (a2 − y)·σ'(z2) — this is what backprop saves and propagates.
target:green = shared δ (reused across both gradients)
Fig 2.7 — Chain-rule walk on a two-layer network: pick a weight and watch its gradient assemble. The shared leading factors are the reusable error signal δ.

Check your understanding

In Cw(1)\frac{\partial C}{\partial w^{(1)}}, which factors did we also already compute for Cw(2)\frac{\partial C}{\partial w^{(2)}}, and why does that matter?

Show answer ▸

The leading factors (a2 - y) and sigma'(z2) appear in both. Recomputing them is wasted work; if we save that shared quantity and reuse it, we get the earlier-layer gradient almost for free. That reusable quantity becomes the error signal delta in backprop.

10. Why the Naive Way Doesn't Scale

We just differentiated a network with two parameters by hand. The obvious thought is to do the same for a real network: write a chain rule for every weight and grind through it. Let us see exactly why that is hopeless, because the failure points straight at the fix.

First, what we are even computing. For a weight matrix W()W^{(\ell)}, the gradient CW()\frac{\partial C}{\partial W^{(\ell)}} is itself a matrix the same shape as W()W^{(\ell)}, whose entry (j,k)(j, k) is Cwjk()\frac{\partial C}{\partial w^{(\ell)}_{jk}}. So we need one partial derivative per weight, arranged in a grid.

Now the cost of getting them the naive way. A network with LL layers and roughly nn neurons per layer has on the order of Ln2L \cdot n^2 weights. For each one, a chain rule walk back to the output passes through on the order of LnL \cdot n intermediate quantities. Multiply: the total work is on the order of L2n3L^2 \cdot n^3 per training example. For anything beyond a toy, that is absurd, and worse, it is absurd while being wildly redundant, because, as we saw in the last section, the walks for different weights share almost all of their factors and we would be recomputing the same shared pieces over and over.

There is also an ugliness of representation lurking. The moment you try to write the Jacobian of a layer's activation vector with respect to its weight matrix, you are differentiating a vector with respect to a matrix, which is a three-dimensional array of numbers, a rank-3 tensor. The notation and bookkeeping get genuinely unpleasant fast.

Both problems have the same cure, and the last section already whispered it. Identify the one quantity that is reused across all of a layer's weight gradients, compute it once per layer, and propagate it backward instead of redoing full chain walks. That quantity is the error of a neuron. Naming it and propagating it turns the cost from O(L2n3)O(L^2 n^3) down to roughly the cost of a single forward pass, linear in the number of parameters. That move is backpropagation, and it is the rest of the chapter.

(For completeness: once we have all the gradients, we feed them to gradient descent, which we covered fully in Chapter 1. The one-line recap is that we update each parameter by stepping against its gradient, θθηθC\theta \leftarrow \theta - \eta \nabla_\theta C, and in practice we average the gradient over a small mini-batch of examples per step rather than the whole dataset, repeating over many epochs. Nothing about that changes here; backprop is just the efficient way to get the θC\nabla_\theta C that gradient descent consumes.)

naive (re-trace each weight)backprop (sweep once)naive ops: 0backprop ops: 0weights processed: 0 / 17same gradients, hugely different cost — backprop never recomputes a shared subpath
naive 0 vs backprop 0 ops — the gap widens with every weight.
Fig 2.8 — Naive vs backprop: the naive way recomputes shared subpaths over and over; backprop computes each neuron's error once in one backward sweep.

Check your understanding

What single idea turns the naive O(L2n3)O(L^2 n^3) cost into something roughly as cheap as one forward pass?

Show answer ▸

Identify the quantity that all of a layer's weight gradients share (the neuron's error signal), compute it once per neuron in a single backward pass, and reuse it, instead of re-walking a full chain rule for every weight. That reuse is backpropagation.

11. The Error of a Node

Here is the quantity that fixes everything. For each neuron in each layer, define its error as the sensitivity of the cost to that neuron's weighted input:

δj():=Czj().\delta^{(\ell)}_j := \frac{\partial C}{\partial z^{(\ell)}_j}.

In words, δj()\delta^{(\ell)}_j asks: if I wiggle the weighted input zj()z^{(\ell)}_j of neuron jj in layer \ell by a hair, how much does the final cost move? It is a single number per neuron, and we gather a layer's worth into a column vector δ()Rn\boldsymbol{\delta}^{(\ell)} \in \mathbb{R}^{n_\ell}.

Why define the error at zz, the weighted input, rather than at aa, the activation, or at the weights directly? Because zz sits at the perfect hinge. Everything upstream of zj()z^{(\ell)}_j, namely the weights wjk()w^{(\ell)}_{jk}, the bias bj()b^{(\ell)}_j, and the incoming activations, feeds into zj()z^{(\ell)}_j through plain linear operations, so once we know how sensitive the cost is to zj()z^{(\ell)}_j, the sensitivities to those upstream parameters are a trivial extra step. And everything downstream of zj()z^{(\ell)}_j, the entire rest of the network up to the cost, is already bundled inside δj()\delta^{(\ell)}_j by definition. So δ\delta cleanly separates the easy local part from the hard global part, and the global part is exactly what we will pass backward.

The strategy is now sharp. Compute δ(L)\boldsymbol{\delta}^{(L)} at the output layer. Then use it to compute δ(L1)\boldsymbol{\delta}^{(L-1)}, then δ(L2)\boldsymbol{\delta}^{(L-2)}, and so on backward to δ(1)\boldsymbol{\delta}^{(1)}. Then, for every layer, read off the weight and bias gradients from δ()\boldsymbol{\delta}^{(\ell)} in one cheap step. Four equations make this concrete: one to start δ\boldsymbol{\delta} at the output, one to pass it back a layer, and two to read off the parameter gradients. We derive them next.

Check your understanding

Why define the error δ\delta at the weighted input zz rather than at the activation aa?

Show answer ▸

Because z is the hinge between the linear part (weights, bias, incoming activations all feed z linearly) and the nonlinear part (the activation and everything after). Knowing the cost's sensitivity to z makes the weight and bias gradients a trivial next step, while everything downstream is already captured inside delta.

12. The Four Equations of Backpropagation

These are the four equations Chapter 1 handed you as facts. Now we derive each one.

BP1: the error at the output layer

The cost depends on the output neuron's weighted input zj(L)z^{(L)}_j only through its activation aj(L)=σ(zj(L))a^{(L)}_j = \sigma(z^{(L)}_j). So a two-link chain rule gives

δj(L)=Caj(L)aj(L)zj(L)=Caj(L)σ(zj(L)).\delta^{(L)}_j = \frac{\partial C}{\partial a^{(L)}_j} \cdot \frac{\partial a^{(L)}_j}{\partial z^{(L)}_j} = \frac{\partial C}{\partial a^{(L)}_j} \cdot \sigma'(z^{(L)}_j).

Stacking this over all output neurons jj, and using the diagonal-Jacobian collapse (an elementwise activation contributes a diagonal Jacobian, which acts as element-wise multiplication), the vector form is

  δ(L)=a(L)Cσ(z(L)).  (BP1)\boxed{\;\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} C \,\odot\, \sigma'(\mathbf{z}^{(L)}).\;} \tag{BP1}

Here a(L)C\nabla_{\mathbf{a}^{(L)}} C is the gradient of the cost with respect to the output activations, and \odot is the element-wise product. For the mean squared error cost, a(L)C=a(L)y\nabla_{\mathbf{a}^{(L)}} C = \mathbf{a}^{(L)} - \mathbf{y}, so δ(L)=(a(L)y)σ(z(L))\boldsymbol{\delta}^{(L)} = (\mathbf{a}^{(L)} - \mathbf{y}) \odot \sigma'(\mathbf{z}^{(L)}). Notice this matches the leading factors (a(2)y)σ(z(2))(a^{(2)} - y)\,\sigma'(z^{(2)}) from our hand derivation in section 9. Good sign.

BP2: the error of any earlier layer

This is the keystone, the equation that does the actual propagating. We want δ()\boldsymbol{\delta}^{(\ell)} assuming we already have δ(+1)\boldsymbol{\delta}^{(\ell+1)} from the layer ahead.

Start from the definition δj()=Czj()\delta^{(\ell)}_j = \frac{\partial C}{\partial z^{(\ell)}_j} and ask how zj()z^{(\ell)}_j reaches the cost. It does so only through the next layer, and specifically through every neuron of the next layer, because neuron jj's activation aj()a^{(\ell)}_j fans out to all of them. That fan-out is the sum-over-paths situation from section 4. So

δj()=kCzk(+1)zk(+1)zj()=kδk(+1)zk(+1)zj(),\delta^{(\ell)}_j = \sum_k \frac{\partial C}{\partial z^{(\ell+1)}_k} \cdot \frac{\partial z^{(\ell+1)}_k}{\partial z^{(\ell)}_j} = \sum_k \delta^{(\ell+1)}_k \cdot \frac{\partial z^{(\ell+1)}_k}{\partial z^{(\ell)}_j},

where we recognized Czk(+1)\frac{\partial C}{\partial z^{(\ell+1)}_k} as exactly δk(+1)\delta^{(\ell+1)}_k, the error we already have. Now we just need zk(+1)zj()\frac{\partial z^{(\ell+1)}_k}{\partial z^{(\ell)}_j}. Write out the next layer's weighted input:

zk(+1)=mwkm(+1)am()+bk(+1)=mwkm(+1)σ(zm())+bk(+1).z^{(\ell+1)}_k = \sum_m w^{(\ell+1)}_{km} a^{(\ell)}_m + b^{(\ell+1)}_k = \sum_m w^{(\ell+1)}_{km} \sigma(z^{(\ell)}_m) + b^{(\ell+1)}_k.

Differentiate with respect to zj()z^{(\ell)}_j. Only the m=jm = j term in the sum depends on it, and it brings down wkj(+1)w^{(\ell+1)}_{kj} times the activation's derivative:

zk(+1)zj()=wkj(+1)σ(zj()).\frac{\partial z^{(\ell+1)}_k}{\partial z^{(\ell)}_j} = w^{(\ell+1)}_{kj}\,\sigma'(z^{(\ell)}_j).

Substitute back, and pull the σ(zj())\sigma'(z^{(\ell)}_j) out of the sum since it does not depend on kk:

δj()=kδk(+1)wkj(+1)σ(zj())=σ(zj())kwkj(+1)δk(+1).\delta^{(\ell)}_j = \sum_k \delta^{(\ell+1)}_k\, w^{(\ell+1)}_{kj}\,\sigma'(z^{(\ell)}_j) = \sigma'(z^{(\ell)}_j) \sum_k w^{(\ell+1)}_{kj}\, \delta^{(\ell+1)}_k.

Look closely at that remaining sum, kwkj(+1)δk(+1)\sum_k w^{(\ell+1)}_{kj}\, \delta^{(\ell+1)}_k. In the original matrix W(+1)W^{(\ell+1)}, the row index kk is the destination and the column index jj is the source. Here we are summing over the destination kk and the free index is the source jj, which means jj is now playing the role of a row. That is exactly the transpose W(+1)W^{(\ell+1)\top}. So the sum is the jj-th entry of (W(+1))δ(+1)(W^{(\ell+1)})^\top \boldsymbol{\delta}^{(\ell+1)}, and the whole thing becomes

  δ()=((W(+1))δ(+1))σ(z()).  (BP2)\boxed{\;\boldsymbol{\delta}^{(\ell)} = \left( (W^{(\ell+1)})^\top \boldsymbol{\delta}^{(\ell+1)} \right) \odot \sigma'(\mathbf{z}^{(\ell)}).\;} \tag{BP2}

This is the equation that propagates the error backward. Read it as two moves. The transposed weight matrix takes the error from the layer ahead and sends it back through the linear connections, handing each upstream neuron a share of the blame in proportion to how strongly it fed forward. Then the element-wise σ(z())\odot \, \sigma'(\mathbf{z}^{(\ell)}) scales each neuron's share by how responsive its activation actually was at that point. And here is the vanishing gradient made precise: if a neuron sat on a flat tail of the sigmoid, σ(zj())\sigma'(z^{(\ell)}_j) is nearly zero, so it throttles that neuron's error toward nothing, and every neuron further back gets starved of signal. The transpose is the same matrix doing the forward fan-out, now run in reverse.

BP3: the gradient with respect to any bias

Easy now that we have δ\delta. The bias bj()b^{(\ell)}_j enters the cost only through zj()z^{(\ell)}_j, and zj()bj()=1\frac{\partial z^{(\ell)}_j}{\partial b^{(\ell)}_j} = 1 from section 8. So

Cbj()=Czj()zj()bj()=δj()1=δj(),\frac{\partial C}{\partial b^{(\ell)}_j} = \frac{\partial C}{\partial z^{(\ell)}_j} \cdot \frac{\partial z^{(\ell)}_j}{\partial b^{(\ell)}_j} = \delta^{(\ell)}_j \cdot 1 = \delta^{(\ell)}_j,

or in vector form,

  Cb()=δ().  (BP3)\boxed{\;\frac{\partial C}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}.\;} \tag{BP3}

The bias gradient simply is the error. This formalizes the hint we noticed at the end of section 9.

BP4: the gradient with respect to any weight

The weight wjk()w^{(\ell)}_{jk} enters the cost only through zj()z^{(\ell)}_j, and since zj()=mwjm()am(1)+bj()z^{(\ell)}_j = \sum_m w^{(\ell)}_{jm} a^{(\ell-1)}_m + b^{(\ell)}_j, only the m=km = k term depends on it, giving zj()wjk()=ak(1)\frac{\partial z^{(\ell)}_j}{\partial w^{(\ell)}_{jk}} = a^{(\ell-1)}_k. So

Cwjk()=Czj()zj()wjk()=δj()ak(1),\frac{\partial C}{\partial w^{(\ell)}_{jk}} = \frac{\partial C}{\partial z^{(\ell)}_j} \cdot \frac{\partial z^{(\ell)}_j}{\partial w^{(\ell)}_{jk}} = \delta^{(\ell)}_j \, a^{(\ell-1)}_k,

or, putting destination and source in their natural reading order,

  Cwjk()=ak(1)δj().  (BP4)\boxed{\;\frac{\partial C}{\partial w^{(\ell)}_{jk}} = a^{(\ell-1)}_k \, \delta^{(\ell)}_j.\;} \tag{BP4}

This is the structural rhythm from section 9, now exact and general: a weight's gradient is the error of the neuron it feeds (δj()\delta^{(\ell)}_j) times the activation it receives (ak(1)a^{(\ell-1)}_k). Two numbers, multiplied.

Vectorizing BP4

Equation BP4 gives one matrix entry at a time. To get the gradient of the whole matrix W()W^{(\ell)} in a single expression, notice that CW()\frac{\partial C}{\partial W^{(\ell)}} has the same shape as W()W^{(\ell)}, that is n×n1n_\ell \times n_{\ell-1}, with entry (j,k)(j, k) equal to δj()ak(1)\delta^{(\ell)}_j a^{(\ell-1)}_k. A matrix whose (j,k)(j,k) entry is the product of the jj-th entry of one vector and the kk-th entry of another is exactly an outer product:

  CW()=δ()(a(1)).  (BP4-vec)\boxed{\;\frac{\partial C}{\partial W^{(\ell)}} = \boldsymbol{\delta}^{(\ell)} \, (\mathbf{a}^{(\ell-1)})^\top.\;} \tag{BP4-vec}

The outer product uv\mathbf{u}\,\mathbf{v}^\top of a column u\mathbf{u} and a row v\mathbf{v}^\top is the matrix whose (j,k)(j,k) entry is ujvku_j v_k. Here δ()\boldsymbol{\delta}^{(\ell)} is n×1n_\ell \times 1 and (a(1))(\mathbf{a}^{(\ell-1)})^\top is 1×n11 \times n_{\ell-1}, so their product is n×n1n_\ell \times n_{\ell-1}, matching W()W^{(\ell)} exactly. Shapes check. Those four boxed equations are the complete mathematical content of backpropagation.

inh1h2outpick an equationWalk the four equations across one network, one backward step at a time.
Fig 2.9 — The four equations on one network: BP1 starts δ at the output, BP2 propagates it back through Wᵀ, BP3 and BP4 read off the gradients.
aᵀ (row)δ (col)1.00a_12.00a_2-0.50a_31.50a_40.50δ_1-1.00δ_20.80δ_30.501.00-0.250.75-1.00-2.000.50-1.500.801.60-0.401.20hover a matrix cell to see one δ times one activation

BP4-vec: ∂C/∂W = δ aᵀ — every entry is one δ times one activation.

Fig 2.10 — BP4 as an outer product: the whole weight-gradient matrix is one error column δ times one activation row aᵀ.

Check your understanding

In BP2, why does the weight matrix appear transposed?

Show answer ▸

Because we sum over the destination index k of the next layer while the free index is the source j. In W the row is the destination and the column is the source, so summing over the destination and keeping the source as the row is exactly the transpose. Intuitively, the same weights that fanned the signal forward now route the error backward.

13. Tying It Back Together

Before trusting the four equations, let us confirm they reproduce the answers we got the slow way in section 9, on the two-layer, one-neuron network. (For a single neuron per layer, every vector is just a number and every matrix transpose is itself, so the equations look scalar.)

From BP1: δ(2)=(a(2)y)σ(z(2))\delta^{(2)} = (a^{(2)} - y)\,\sigma'(z^{(2)}).

From BP4: Cw(2)=a(1)δ(2)=a(1)(a(2)y)σ(z(2))\frac{\partial C}{\partial w^{(2)}} = a^{(1)}\,\delta^{(2)} = a^{(1)}(a^{(2)} - y)\sigma'(z^{(2)}). That is exactly the boxed Cw(2)\frac{\partial C}{\partial w^{(2)}} from section 9.

From BP2: δ(1)=w(2)δ(2)σ(z(1))=w(2)(a(2)y)σ(z(2))σ(z(1))\delta^{(1)} = w^{(2)}\,\delta^{(2)}\,\sigma'(z^{(1)}) = w^{(2)}(a^{(2)} - y)\sigma'(z^{(2)})\sigma'(z^{(1)}).

From BP4 again: Cw(1)=xδ(1)=xw(2)(a(2)y)σ(z(2))σ(z(1))\frac{\partial C}{\partial w^{(1)}} = x\,\delta^{(1)} = x\,w^{(2)}(a^{(2)} - y)\sigma'(z^{(2)})\sigma'(z^{(1)}). Exactly the section 9 result.

And from BP3, Cb(2)=δ(2)\frac{\partial C}{\partial b^{(2)}} = \delta^{(2)} and Cb(1)=δ(1)\frac{\partial C}{\partial b^{(1)}} = \delta^{(1)}, matching the bias derivatives we found by hand.

Identical answers. So what did backprop actually buy us, if the results are the same? The computation. The slow way rederived each parameter's gradient from scratch with a fresh long chain walk, repeating shared factors every time. Backprop computes δ(L),δ(L1),,δ(1)\boldsymbol{\delta}^{(L)}, \boldsymbol{\delta}^{(L-1)}, \ldots, \boldsymbol{\delta}^{(1)} once each, in a single backward sweep, then reads off every gradient in one cheap multiply. Same destination, vastly cheaper road, and the cost is now linear in the number of parameters instead of quadratic.

Check your understanding

Backprop and the slow chain-rule method give the same gradients. So what is the actual advantage of backprop?

Show answer ▸

Only the cost of computing them. Both produce identical gradients, but the slow method redoes shared work for every parameter, while backprop computes each layer's error once in a single backward pass and reuses it, dropping the cost from roughly quadratic to linear in the number of parameters.

14. The Algorithm, Start to Finish

Here is everything assembled, the loop that trains a network.

For each training example (x,y)(\mathbf{x}, \mathbf{y}):

1. Forward pass. Set a(0)=x\mathbf{a}^{(0)} = \mathbf{x}. For =1,2,,L\ell = 1, 2, \ldots, L, compute and store

z()=W()a(1)+b(),a()=σ(z()).\mathbf{z}^{(\ell)} = W^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}).

Keep every z()\mathbf{z}^{(\ell)} and a()\mathbf{a}^{(\ell)}, because the backward pass needs them.

2. Output error. Compute, by BP1,

δ(L)=a(L)Cσ(z(L)).\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} C \,\odot\, \sigma'(\mathbf{z}^{(L)}).

3. Backpropagate. For =L1,L2,,1\ell = L-1, L-2, \ldots, 1, compute, by BP2,

δ()=((W(+1))δ(+1))σ(z()).\boldsymbol{\delta}^{(\ell)} = \left( (W^{(\ell+1)})^\top \boldsymbol{\delta}^{(\ell+1)} \right) \odot \sigma'(\mathbf{z}^{(\ell)}).

4. Read off the gradients. For every layer, by BP3 and BP4-vec,

Cb()=δ(),CW()=δ()(a(1)).\frac{\partial C}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}, \qquad \frac{\partial C}{\partial W^{(\ell)}} = \boldsymbol{\delta}^{(\ell)} (\mathbf{a}^{(\ell-1)})^\top.

5. Update. For a mini-batch, average the gradients over its examples, then step each parameter against its gradient with learning rate η\eta:

W()W()ηCW(),b()b()ηCb().W^{(\ell)} \leftarrow W^{(\ell)} - \eta \, \frac{\partial C}{\partial W^{(\ell)}}, \qquad \mathbf{b}^{(\ell)} \leftarrow \mathbf{b}^{(\ell)} - \eta \, \frac{\partial C}{\partial \mathbf{b}^{(\ell)}}.

Repeat over many mini-batches and many epochs, and the network learns. That update step is the gradient descent from Chapter 1; backpropagation is simply the efficient engine that supplies the gradients it runs on.

Check your understanding

During the forward pass, why do we store every z()\mathbf{z}^{(\ell)} and a()\mathbf{a}^{(\ell)} instead of discarding them?

Show answer ▸

Because the backward pass needs them. BP2 needs each layer's z to evaluate sigma'(z), and BP4 needs each layer's incoming activation a^(l-1) to form the weight gradient. Recomputing them would waste the very work the forward pass already did.