ML Bible · Chapter 2
Math of Neural Networks
The calculus underneath Chapter 1 — gradients, Jacobians, the chain rule, and the four equations of backpropagation, derived from scratch.
1. What This Chapter Is For
In Chapter 1 we built the whole picture of how a network learns, and we stated the rules without proving them. We said the error at the output is the prediction minus the truth, that it propagates backward through the transposed weights, that the weight gradient is an outer product of an error signal and an activation. We used those facts. We never earned them.
This chapter earns them. Everything here is the calculus underneath Chapter 1, derived step by step so that by the end the four equations of backpropagation are not magic words you memorize but results you could rederive on a napkin. We are not going to re-explain what a neuron is or why gradient descent works; you have that. We are going deeper into the math of how.
The plan, in order. First we fix the notation, which matters more here than in any other subject because of the swarm of indices. Then we review the two pieces of calculus we lean on: gradients and Jacobians, and the chain rule that ties them together. Then we write forward propagation as clean math. Then we differentiate it the slow, honest way for a tiny network, watch a pattern appear, and watch that slow way fail to scale. Finally we fix the scaling problem by naming one reusable quantity and propagating it backward, which is backpropagation, and we derive its four equations and prove they match the slow way.
A quick word on what you need. From Chapter 1 you already have derivatives (a slope, a rate of change), partial derivatives (the slope with respect to one variable while the rest are frozen), the gradient (all the partials stacked into a vector), the dot product, and the basic shape of a matrix. We will build on those rather than restate them. The genuinely new piece of machinery is the Jacobian, and we will take that one slowly.
Check your understanding
What is the goal of this chapter, as opposed to Chapter 1?
Show answer ▸Hide answer ▾
Chapter 1 gave the intuition and stated the rules of forward propagation and backpropagation. This chapter derives those rules from calculus, so the four backprop equations become results you can prove rather than facts you accept.
2. Notation, Set Up Carefully
Most confusion in this subject is not confusion about ideas. It is misreading an index. So before any derivation, here is the bookkeeping, and it is worth fixing in your head now.
We use four typographic conventions. A lowercase italic letter like , , or is a single number, a scalar. A lowercase bold letter like , , or is a vector, an ordered list of numbers, which we always treat as a column unless we say otherwise. An uppercase italic letter like is a matrix, a grid of numbers. And a superscript in parentheses, like or , is a layer label, not an exponent; means "the activations of layer ," and the parentheses are there precisely so you never mistake it for raising something to the power .
Subscripts pick out a component. The one to internalize is the weight index. We write
for the weight in layer on the connection going into the -th neuron of layer , from the -th neuron of layer . Read the order out loud: destination first (), source second (). This feels backward, because when you draw an arrow you naturally think source-then-destination. There is a concrete payoff for the inversion, and we will cash it in two sections from now: it is exactly what makes the layer's matrix multiplication work with no stray transposes.
Two more symbols carry the whole story. We write for the weighted input to neuron in layer , the value before the activation function, and for the activation, the value after it. The cost is . And one quantity we will define carefully later, the error of a neuron, is
You do not need to understand that last line yet. Just register that (delta) lives at the heart of backpropagation and is defined as a partial derivative of the cost with respect to a neuron's weighted input. Keep this notation nearby. When something looks impenetrable later, nine times out of ten it is a misread index, not a hard idea.
Check your understanding
In , which index is the destination neuron and which is the source?
Show answer ▸Hide answer ▾
j is the destination (the neuron in layer l that the connection feeds into), and k is the source (the neuron in layer l-1 the connection comes from). Destination first, source second.
3. Gradients and Jacobians
Start with what you know and stretch it by one step. A gradient is the object you get when a function takes in a vector and returns a single number. Take , read as "a function from -dimensional vectors to single numbers" (the symbol means the ordinary real numbers, and means a list of of them). For example . You can take a partial derivative with respect to each input, and the gradient simply stacks them into a column:
The symbol (read "del" or "gradient of") just means "collect all the partials." For the example, , where the small ("transpose") is flipping the row I wrote on the page into the column it should be. We met the geometric meaning in Chapter 1: the gradient points in the direction of steepest ascent, and the negative gradient points downhill, which is why we walk against it. We will treat the gradient as a column throughout, because it makes the Jacobian conventions line up cleanly, which is the new idea we turn to now.
What if the function returns not one number but several? That is the common case in a network: a layer eats a vector and produces a vector. Take , a vector in and a vector out, so has output components, each depending on all inputs. Stack the gradient of each output as a row and you get the Jacobian matrix:
There is one rule to hold onto, and it governs everything: rows are outputs, columns are inputs. Entry of the Jacobian is , the sensitivity of output to input . So row is the gradient of the -th output, and the matrix is rows tall (one per output) by columns wide (one per input). That shape, output-by-input, is what makes the dimensions click together when we chain things in the next section.
The gradient is just a Jacobian in disguise. If , meaning the function returns a single number, the Jacobian collapses to a single row. Transpose it and you have the column-vector gradient from before. So gradients and Jacobians are the same animal; the gradient is the Jacobian of a function whose output happens to be one number.
One special shape shows up constantly in networks and is worth memorizing, because it makes the algebra later evaporate. Suppose acts elementwise, meaning each output depends only on the input in the same position: for some single-variable function . Activation functions are exactly this; hits each entry on its own. Then output does not depend on input at all when , so every off-diagonal partial is zero, and the Jacobian is diagonal, carrying down the diagonal:
A diagonal matrix is one whose only nonzero entries sit on the top-left-to-bottom-right diagonal. Why care? Because a diagonal Jacobian, when it multiplies a vector inside a chain rule, behaves exactly like multiplying entry by entry. If and is any vector, then , where is the Hadamard product, plain element-wise multiplication. That single fact is what turns the intimidating matrix expressions in backprop into the friendly you saw in Chapter 1.
A concrete worked example to make the rows-are-outputs rule stick. Let , with two inputs and two outputs. Differentiate each output by each input:
Row 1 is the gradient of (its partials are and ), and row 2 is the gradient of (its partials are and ). Rows are outputs, columns are inputs. Always.
Check your understanding
Why is the Jacobian of an elementwise function (like an activation) diagonal?
Show answer ▸Hide answer ▾
Because each output depends only on the input in the same position, so the partial of output i with respect to input j is zero whenever i and j differ. Only the same-position partials survive, and they sit on the diagonal as g'(x_i).
4. The Chain Rule, as a Sum Over Paths
You know the basic chain rule from Chapter 1: if and , then . Sensitivities multiply along a chain. Before we lift this to vectors, there is a slightly richer version that makes the vector case feel inevitable instead of surprising.
Suppose depends on through two intermediate variables, both of which are functions of : . Now has two separate routes to influence , one through and one through . The total derivative adds up both routes:
Read the structure, not just the symbols. Each term is one path from to : travel from to (that is ), then from to (that is ), and multiply the two sensitivities along that path. Do the same for the path through . Then sum over all the paths. This "multiply along a path, sum over paths" idea is the entire content of the multivariable chain rule, and once it is in your head the matrix version below is just bookkeeping for doing many paths at once.
This matters for networks specifically because a single neuron's output usually fans out to every neuron in the next layer. So when we ask how wiggling one neuron changes the cost, there is not one path back to the cost, there are many, one through each downstream neuron, and we will be summing over exactly those paths. That sum is where the matrix transpose in backprop comes from. Keep the picture handy.
Check your understanding
If influences through two intermediate variables, how do you combine the two routes?
Show answer ▸Hide answer ▾
Multiply the sensitivities along each path from x to y, then add the path totals together. Two routes means two products summed. This sum-over-paths is the multivariable chain rule.
5. The Jacobian Chain Rule
Now the generalization that runs the whole machine. Suppose and , so a vector produces a vector , which produces a vector . The chain rule says: multiply the Jacobians.
The only thing to check is that the shapes fit, and they do, by design. Say , , and . The first Jacobian has rows for and columns for , so it is . The second, , is . A matrix product is allowed exactly when the inner dimensions match, and they do (both are ), and the result is , which is precisely the shape should have, with rows for and columns for . The output-by-input convention from the last section is what guarantees this lines up.
Write out a single entry of that product and the sum-over-paths idea from the previous section walks right back in:
Each intermediate component is one path from input to output , the term is the first leg and is the second, and we sum over all paths. Matrix multiplication is quite literally an automated way of doing sum-over-paths for every input-output pair at once.
For a deeper stack you just keep multiplying. If , then
That is, in one line, what backpropagation does. A network is a deep composition of functions, and computing how the cost depends on an early weight means multiplying a string of Jacobians together as you travel from the output back to that weight. Everything from here is figuring out what those particular Jacobians are and how to multiply them in an order that avoids redoing work.
Check your understanding
When you chain (size ) with (size ), what size is the result and why?
Show answer ▸Hide answer ▾
It is m-by-n. The inner dimensions (both k) match and cancel in the matrix product, leaving rows from the first matrix (m, the outputs y) and columns from the second (n, the inputs x), which is exactly the shape of the Jacobian of y with respect to x.
6. Forward Propagation, Written Out
We covered the forward pass conceptually in Chapter 1. Here we write it precisely, with the indexing from section 2, because the derivations later depend on getting these expressions exactly right.
A single neuron takes its inputs, forms a weighted sum, adds a bias, and applies the activation:
Here is the dot product (multiply matching entries, add them up), is the bias scalar, is the weighted input, and is the activation after the nonlinearity . Nothing new yet; this is Chapter 1 in symbols.
Now the indexing payoff promised earlier. Put every weight of layer into a matrix whose entry in row , column is . Because we indexed destination-first, the -th row of is exactly the list of weights belonging to neuron . So the matrix-vector product produces, in its -th entry, the dot product of neuron 's weights with the previous layer's activations, which is precisely the weighted sum neuron wants. No transposes, no rearranging. That is the entire reason for the destination-first convention. The matrix has dimensions (number of neurons in this layer, by number in the previous layer), and the bias is a column with one entry per neuron. A whole layer is then:
where applied to a vector means applied to each entry. To run the whole network you set , the raw input, and apply this pair of equations for , and the final activation is the prediction .
A small worked example to see the matrix do its job. Say layer has 3 neurons and layer has 2, so is :
The product is
and each row is one neuron's weighted sum. Add the biases, apply , and you have .
Stacking all the layers, the entire network is one deeply nested function:
This is the object we are about to differentiate. It looks fearsome, but it is just our layer equation wrapped around itself times, and the Jacobian chain rule from the last section is built exactly for peeling apart compositions like this.
Check your understanding
Why does the destination-first weight indexing let us write the layer as with no transpose?
Show answer ▸Hide answer ▾
Because indexing destination-first makes row j of W the weights belonging to neuron j. The matrix-vector product then puts neuron j's weighted sum in entry j automatically, which is exactly what we want, so no rearranging is needed.
7. The Cost Function
The cost is the single number we minimize, and we treat it, as in Chapter 1, as a function of all the weights and biases. Collect them into . The network defines , and we have training pairs for , where here indexes training examples, not layers. The default cost for this chapter is mean squared error:
Unpacking the symbols: the double bars mean the squared length of a vector, which is just the sum of the squares of its entries, which is why the right-hand form expands into a sum over the output components . The outer sum averages over all training examples, and the out front is pure convenience: when we differentiate a square we get a factor of , and the is there to cancel it and keep the algebra tidy. It changes nothing about where the minimum is.
For the rest of the derivations we make one simplifying move: pretend there is a single training example, so . The full-dataset cost is just the average of the single-example costs, and the gradient of an average is the average of the gradients, so the structure of every derivative we find is identical; we would just average at the end. Carrying the sum around would only clutter the page.
Check your understanding
Why is there a factor of in front of the mean squared error?
Show answer ▸Hide answer ▾
Pure convenience. Differentiating the square produces a factor of 2, and the 1/2 cancels it so the derivatives stay clean. It does not change where the minimum is.
8. Differentiating a Neuron's Operations
To differentiate the whole cost we first need the derivatives of the small operations a neuron performs, because the chain rule will glue these together. A layer does three things in sequence: it forms weighted sums (a matrix-vector product, or for one neuron a dot product, or for one weight a single multiplication), it adds the bias (a vector addition), and it applies the activation (an elementwise function). Let us find the Jacobian of each.
Begin with elementwise operations, since the activation is one and they have that pleasant diagonal structure. A binary elementwise function combines two vectors entry by entry: . The Hadamard product (entry-wise multiplication) is the standard example. Because output depends only on and and nothing else, the Jacobian with respect to either input is diagonal. For the Hadamard product specifically, , so , giving
In words, the derivative with respect to one operand is the diagonal matrix of the other operand. And recall the collapse from section 3: a diagonal Jacobian times a vector is just an element-wise product. So when a Hadamard product appears in a chain rule, the incoming gradient simply gets multiplied entry-by-entry by the other operand. This is exactly why the backprop equations are dotted with rather than carrying around full matrices.
Next, addition, which is the bias step. Take , so . Each output depends on its own and with derivative and on nothing else, so both Jacobians are the identity matrix (ones on the diagonal, zeros elsewhere):
This is the lazy case: addition passes gradients straight through, unchanged. Whenever you see a sum in the forward pass, the matching Jacobian is an identity, and identities do nothing in a product, so you can mentally skip them.
Finally, the lone sum that collapses a vector to a scalar, as in for one neuron. Treating "sum" as the function , its derivative with respect to each input is , so its gradient is the all-ones vector . Differentiating a sum just adds up the upstream contributions equally. Pulling these together for a single neuron's , the three derivatives we will reach for again and again are
Each reads naturally: the weighted input is sensitive to a weight in proportion to the input riding on it (), sensitive to the bias with rate , and sensitive to an input in proportion to that input's weight ().
That leaves the activation. For a single neuron , the derivative is just . For a whole layer, is applied elementwise, so by the diagonal rule the Jacobian is
The single most useful instance is the sigmoid, because its derivative is unusually clean. With , differentiating gives
That last equality is worth savoring. The derivative of the sigmoid is expressible entirely in terms of the sigmoid's own value. So during the backward pass, if you already computed in the forward pass, you get its derivative almost for free, just multiply by , with no exponentials to recompute. For ReLU the derivative is even simpler: when and when (it is undefined right at , a single point we ignore in practice).
Check your understanding
The sigmoid's derivative is . Why is that convenient during the backward pass?
Show answer ▸Hide answer ▾
Because sigma(z) was already computed in the forward pass, you can get its derivative just by multiplying by (1 - sigma(z)), with no need to recompute any exponentials. The forward computation is reused in the backward pass.
9. The Cost Derivative for a Tiny Network
Now we differentiate the cost by hand, on the smallest network that still shows the structure: one neuron per layer, two layers. Watching this done explicitly is what makes the eventual backprop equations feel obvious rather than arbitrary. The forward pass is
We want and .
Start with the last-layer weight, which is the short walk. The cost depends on only through , then through . So the chain rule runs three links deep:
Each link we already know how to compute. Differentiating with respect to gives . The activation link is . And since , the last link is . Multiply them:
Now the first-layer weight, the longer walk. The cost depends on through , then , then , then , so the chain is five links:
The two new links: (because ), and , and finally . Multiply the whole string:
Stare at those two boxed results, because the patterns in them are the whole of backpropagation in miniature.
First, reuse. The first two factors of , namely , are exactly the leading factors of . We recomputed them. If instead we had saved that piece, we could have reached the first-layer gradient by just tacking on . That saved, reused quantity is the seed of the error signal .
Second, the **role of **. Every layer the chain passes through contributes one factor of . For the sigmoid, never exceeds , so in a deep network you are multiplying many numbers each at most a quarter, and the product collapses toward zero. That is the vanishing gradient problem from Chapter 1, now visible as a literal product of small factors stacking up.
Third, a structural rhythm. Both formulas share one skeleton: an error at the output, , propagated backward through the network, and then, at the very last step, multiplied by the input that fed the weight in question, which is for the second-layer weight and for the first. That skeleton, error-from-the-end times input-that-fed-the-weight, is precisely the form the four equations will take.
The bias derivatives confirm the rhythm. Same network, but now the final link changes, because replaces the , and replaces the :
So the bias gradient is the weight gradient with the trailing input factor stripped off. Said another way, the bias gradient is the propagated error at that neuron, which is the strongest hint yet that we should give that propagated error a name. We are about to.
Check your understanding
In , which factors did we also already compute for , and why does that matter?
Show answer ▸Hide answer ▾
The leading factors (a2 - y) and sigma'(z2) appear in both. Recomputing them is wasted work; if we save that shared quantity and reuse it, we get the earlier-layer gradient almost for free. That reusable quantity becomes the error signal delta in backprop.
10. Why the Naive Way Doesn't Scale
We just differentiated a network with two parameters by hand. The obvious thought is to do the same for a real network: write a chain rule for every weight and grind through it. Let us see exactly why that is hopeless, because the failure points straight at the fix.
First, what we are even computing. For a weight matrix , the gradient is itself a matrix the same shape as , whose entry is . So we need one partial derivative per weight, arranged in a grid.
Now the cost of getting them the naive way. A network with layers and roughly neurons per layer has on the order of weights. For each one, a chain rule walk back to the output passes through on the order of intermediate quantities. Multiply: the total work is on the order of per training example. For anything beyond a toy, that is absurd, and worse, it is absurd while being wildly redundant, because, as we saw in the last section, the walks for different weights share almost all of their factors and we would be recomputing the same shared pieces over and over.
There is also an ugliness of representation lurking. The moment you try to write the Jacobian of a layer's activation vector with respect to its weight matrix, you are differentiating a vector with respect to a matrix, which is a three-dimensional array of numbers, a rank-3 tensor. The notation and bookkeeping get genuinely unpleasant fast.
Both problems have the same cure, and the last section already whispered it. Identify the one quantity that is reused across all of a layer's weight gradients, compute it once per layer, and propagate it backward instead of redoing full chain walks. That quantity is the error of a neuron. Naming it and propagating it turns the cost from down to roughly the cost of a single forward pass, linear in the number of parameters. That move is backpropagation, and it is the rest of the chapter.
(For completeness: once we have all the gradients, we feed them to gradient descent, which we covered fully in Chapter 1. The one-line recap is that we update each parameter by stepping against its gradient, , and in practice we average the gradient over a small mini-batch of examples per step rather than the whole dataset, repeating over many epochs. Nothing about that changes here; backprop is just the efficient way to get the that gradient descent consumes.)
Check your understanding
What single idea turns the naive cost into something roughly as cheap as one forward pass?
Show answer ▸Hide answer ▾
Identify the quantity that all of a layer's weight gradients share (the neuron's error signal), compute it once per neuron in a single backward pass, and reuse it, instead of re-walking a full chain rule for every weight. That reuse is backpropagation.
11. The Error of a Node
Here is the quantity that fixes everything. For each neuron in each layer, define its error as the sensitivity of the cost to that neuron's weighted input:
In words, asks: if I wiggle the weighted input of neuron in layer by a hair, how much does the final cost move? It is a single number per neuron, and we gather a layer's worth into a column vector .
Why define the error at , the weighted input, rather than at , the activation, or at the weights directly? Because sits at the perfect hinge. Everything upstream of , namely the weights , the bias , and the incoming activations, feeds into through plain linear operations, so once we know how sensitive the cost is to , the sensitivities to those upstream parameters are a trivial extra step. And everything downstream of , the entire rest of the network up to the cost, is already bundled inside by definition. So cleanly separates the easy local part from the hard global part, and the global part is exactly what we will pass backward.
The strategy is now sharp. Compute at the output layer. Then use it to compute , then , and so on backward to . Then, for every layer, read off the weight and bias gradients from in one cheap step. Four equations make this concrete: one to start at the output, one to pass it back a layer, and two to read off the parameter gradients. We derive them next.
Check your understanding
Why define the error at the weighted input rather than at the activation ?
Show answer ▸Hide answer ▾
Because z is the hinge between the linear part (weights, bias, incoming activations all feed z linearly) and the nonlinear part (the activation and everything after). Knowing the cost's sensitivity to z makes the weight and bias gradients a trivial next step, while everything downstream is already captured inside delta.
12. The Four Equations of Backpropagation
These are the four equations Chapter 1 handed you as facts. Now we derive each one.
BP1: the error at the output layer
The cost depends on the output neuron's weighted input only through its activation . So a two-link chain rule gives
Stacking this over all output neurons , and using the diagonal-Jacobian collapse (an elementwise activation contributes a diagonal Jacobian, which acts as element-wise multiplication), the vector form is
Here is the gradient of the cost with respect to the output activations, and is the element-wise product. For the mean squared error cost, , so . Notice this matches the leading factors from our hand derivation in section 9. Good sign.
BP2: the error of any earlier layer
This is the keystone, the equation that does the actual propagating. We want assuming we already have from the layer ahead.
Start from the definition and ask how reaches the cost. It does so only through the next layer, and specifically through every neuron of the next layer, because neuron 's activation fans out to all of them. That fan-out is the sum-over-paths situation from section 4. So
where we recognized as exactly , the error we already have. Now we just need . Write out the next layer's weighted input:
Differentiate with respect to . Only the term in the sum depends on it, and it brings down times the activation's derivative:
Substitute back, and pull the out of the sum since it does not depend on :
Look closely at that remaining sum, . In the original matrix , the row index is the destination and the column index is the source. Here we are summing over the destination and the free index is the source , which means is now playing the role of a row. That is exactly the transpose . So the sum is the -th entry of , and the whole thing becomes
This is the equation that propagates the error backward. Read it as two moves. The transposed weight matrix takes the error from the layer ahead and sends it back through the linear connections, handing each upstream neuron a share of the blame in proportion to how strongly it fed forward. Then the element-wise scales each neuron's share by how responsive its activation actually was at that point. And here is the vanishing gradient made precise: if a neuron sat on a flat tail of the sigmoid, is nearly zero, so it throttles that neuron's error toward nothing, and every neuron further back gets starved of signal. The transpose is the same matrix doing the forward fan-out, now run in reverse.
BP3: the gradient with respect to any bias
Easy now that we have . The bias enters the cost only through , and from section 8. So
or in vector form,
The bias gradient simply is the error. This formalizes the hint we noticed at the end of section 9.
BP4: the gradient with respect to any weight
The weight enters the cost only through , and since , only the term depends on it, giving . So
or, putting destination and source in their natural reading order,
This is the structural rhythm from section 9, now exact and general: a weight's gradient is the error of the neuron it feeds () times the activation it receives (). Two numbers, multiplied.
Vectorizing BP4
Equation BP4 gives one matrix entry at a time. To get the gradient of the whole matrix in a single expression, notice that has the same shape as , that is , with entry equal to . A matrix whose entry is the product of the -th entry of one vector and the -th entry of another is exactly an outer product:
The outer product of a column and a row is the matrix whose entry is . Here is and is , so their product is , matching exactly. Shapes check. Those four boxed equations are the complete mathematical content of backpropagation.
BP4-vec: ∂C/∂W = δ aᵀ — every entry is one δ times one activation.
Check your understanding
In BP2, why does the weight matrix appear transposed?
Show answer ▸Hide answer ▾
Because we sum over the destination index k of the next layer while the free index is the source j. In W the row is the destination and the column is the source, so summing over the destination and keeping the source as the row is exactly the transpose. Intuitively, the same weights that fanned the signal forward now route the error backward.
13. Tying It Back Together
Before trusting the four equations, let us confirm they reproduce the answers we got the slow way in section 9, on the two-layer, one-neuron network. (For a single neuron per layer, every vector is just a number and every matrix transpose is itself, so the equations look scalar.)
From BP1: .
From BP4: . That is exactly the boxed from section 9.
From BP2: .
From BP4 again: . Exactly the section 9 result.
And from BP3, and , matching the bias derivatives we found by hand.
Identical answers. So what did backprop actually buy us, if the results are the same? The computation. The slow way rederived each parameter's gradient from scratch with a fresh long chain walk, repeating shared factors every time. Backprop computes once each, in a single backward sweep, then reads off every gradient in one cheap multiply. Same destination, vastly cheaper road, and the cost is now linear in the number of parameters instead of quadratic.
Check your understanding
Backprop and the slow chain-rule method give the same gradients. So what is the actual advantage of backprop?
Show answer ▸Hide answer ▾
Only the cost of computing them. Both produce identical gradients, but the slow method redoes shared work for every parameter, while backprop computes each layer's error once in a single backward pass and reuses it, dropping the cost from roughly quadratic to linear in the number of parameters.
14. The Algorithm, Start to Finish
Here is everything assembled, the loop that trains a network.
For each training example :
1. Forward pass. Set . For , compute and store
Keep every and , because the backward pass needs them.
2. Output error. Compute, by BP1,
3. Backpropagate. For , compute, by BP2,
4. Read off the gradients. For every layer, by BP3 and BP4-vec,
5. Update. For a mini-batch, average the gradients over its examples, then step each parameter against its gradient with learning rate :
Repeat over many mini-batches and many epochs, and the network learns. That update step is the gradient descent from Chapter 1; backpropagation is simply the efficient engine that supplies the gradients it runs on.
Check your understanding
During the forward pass, why do we store every and instead of discarding them?
Show answer ▸Hide answer ▾
Because the backward pass needs them. BP2 needs each layer's z to evaluate sigma'(z), and BP4 needs each layer's incoming activation a^(l-1) to form the weight gradient. Recomputing them would waste the very work the forward pass already did.