Backpropagation Explained using English Words*

Propagating you all way the from the back.

Mathematics

Creating Models

*Most words are in English.

Author

Salman Naqvi

Published

Monday, 07 August 2023

This post was edited on Wednesday, 9 August 2023

This guide assumes a basic understanding of derivatives and matrices.

Backpropagation sounds and looks daunting. It doesn’t need to be. In fact, backpropagation is really just a fancy word for the chain rule. Implementing a backpropagation algorithm is simply implementing one big fat chain rule equation.

Let’s remind ourselves of the chain rule. The chain rule lets us figure out how much a given variable indirectly changes with respect to another variable. Take the example below.

\[ \begin{align} y &= 3u \\ u &= 7 + x^2 \end{align} \]

We want to figure out how much \(y\) changes with each increment in \(x\). The problem is that \(x\) doesn’t direcly change \(y\). Rather, \(x\) changes \(u\) which in turn changes \(y\).

The chain rule allows us to solve this problem. In this case, the chain rule tells us that we can figure out how much \(x\) indirecly changes \(y\) by multiplying the derivative of \(y\) with respect to \(u\), and the derivative of \(u\) with respect to \(x\).

\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \]

Aaand I’ve just described backpropagation in a nutshell. That’s all there really is to it. The only difference is that in a neural network there are many more intermediate variables and functions, and that we want to find out how the weights indirectly change the loss.

Let’s see this tangibly in action.

We have the following neural network comprised of two layers: the first layer contains the affine function¹ together with the ReLU, while the second layer contains only the affine function. The loss, which is MSE (Mean Squared Error), will then be calculated from the output of the second layer.

¹ Affine function is a fancy name for the linear function

flowchart LR
  subgraph A [Layer 1]
    direction LR
    id1[Affine Function] --> id2[ReLU]
  end

  subgraph B [Layer 2]
    direction LR
    id2 --> id3[Affine Function]
  end
  
  subgraph C [Loss Function]
    direction LR
    id3 --> id4[MSE]
  end

Mathematically speaking, the first layer with a single sample \(x\) looks like this.

\[ \text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \]

The second layer looks like this.

\[ \text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 \]

And the loss function looks like this.

\[ \frac{(y - (\text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2}{2} \]

MSE in its most basic form looks like this. \[ \text{MSE} = \frac{(y - x)^2}{1} \]

If we have multiple data points, then it looks like this.

\[ \text{MSE} = \frac{(y_1 - x_1)^2+(y_2 - x_2)^2+(y_3 - x_3)^2}{3} \]

However, when working with multiple samples, the mean squared error comes out looking like this, where \(N\) represents the total number of samples.

\[ \frac{(\vec{\rm{y}}_1 - (\text{max}(0, \vec{\rm{x}}_1 \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 + (\vec{\rm{y}}_2 - (\text{max}(0, \vec{\rm{x}}_2 \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 + \cdots + (\vec{\rm{y}}_N - (\text{max}(0, \vec{\rm{x}}_N \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2}{N} \]

Or more simply…²

² \(\sum\) is known as the summation or sigma operator. If we have the equation \(\sum^4_{x=1}2x\), it means sum the equation \(2x\) for all values of \(x\) from 1 to 4. Find out more here.

\[ \frac{\sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2}{N} \]

…or even more simply.

\[ \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 \]

Our goal for the rest of this guide is to derive the gradients of \(w_1\).

The equation above looks quite the mouthful though. One might even say scary. How would you even apply the chain rule here? How would you use the chain rule to derive the gradients of the weights and biases?

Let’s simplify things by introducing a bunch of intermediate variables. We’ll begin by substituting the innermost pieces of the equation, and then gradually make our way out.

\[ \begin{align} u_1 &= \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 \\ u_2 &= \text{max}(0, u_1) \\ u_3 &= u_2 \cdot \vec{\rm{w}}_2 + b_2 \\ u_4 &= \vec{\rm{y}}_i - u_3 \\ u_5 &= u_4^2 \end{align} \]

The menacing equation above now gradually simplifies into the cute equation below.

\[ \begin{align} \text{MSE} &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 \\ &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, u_1) \cdot \vec{\rm{w}}_2 + b_2))^2 \\ &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (u_2 \cdot \vec{\rm{w}}_2 + b_2))^2 \\ &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - u_3)^2 \\ &= \frac{1}{N} \sum^N_{i=1} (u_4)^2 \\ &= \frac{1}{N} \sum^N_{i=1} u_5 \end{align} \]

Very cute, hey?

In this cuter version of the equation, it is visible that incrementing \(\vec{\rm{w}}_1\) does not directly change the MSE. Rather, incrementing \(\vec{\rm{w}}_1\) changes \(u_1\), which changes \(u_2\), which changes \(u_3\), which changes \(u_4\), which in turn changes \(u_5\).

\[ \frac{\partial}{\partial \vec{\rm{w}}_1} \text{MSE} = \frac{\partial}{\partial \vec{\rm{w}}_1} \frac{1}{N} \sum^N_{i=1} u_5 = \frac{1}{N} \sum^N_{i=1} \frac{\partial u^5}{\partial \vec{\rm{w}}_1} = \frac{1}{N} \sum^N_{i=1} \frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2} \cdot \frac{\partial u_2}{\partial u_1} \cdot \frac{\partial u_1}{\partial \vec{\rm{w}}_1} \]

See? Just a big, fat, and simple chain rule problem.

\(\partial\) is a curly “d” and can be read as “curly d”, or simply as “d”. \(\partial\) notation will be used below, due to a concept known as partial derivatives. We will not go into this concept here, however, this is a great brief rundown on partial derivatives.

Now we can tackle finding the gradients for \(w_1\). To do so, let’s find the gradients of each intermediate variable.³ ⁴

³ If needed, get a refresher of the derivative rules here.

⁴ \(\begin{cases}\end{cases}\) denotes a piecewise function. The most simplest piecewise function returns one calculation if a condition is met, and another calculation if the condition is not met. It can be thought of as an if-else statement in programming. Find out more here.

\[ \begin{align*} \text{gradient of } u_4 &= \frac{\partial u_5}{\partial u_4} &&= \frac{\partial}{\partial u_4} u_4^2 &&&= 2u_4 \\ \text{gradient of } u_3 &= \frac{\partial u_4}{\partial u_3} &&= \frac{\partial}{\partial u_3} \vec{\rm{y}}_i - u_3 &&&= -1 \\ \text{gradient of } u_2 &= \frac{\partial u_3}{\partial u_2} &&= \frac{\partial}{\partial u_2} u_2 \cdot \vec{\rm{w}}_2 + b_2 &&&= \vec{\rm{w}}^T_2 \\ \text{gradient of } u_1 &= \frac{\partial u_2}{\partial u_1} &&= \frac{\partial}{\partial u_1} \text{max}(0, u_1) &&&= \begin{cases} 0 & u_1 ≤ 0 \\ 1 & u_1 > 0 \end{cases} \\ \text{gradient of } \vec{\rm{w}}_1 &= \frac{\partial u_1}{\partial \vec{\rm{w}}_1} &&= \frac{\partial}{\partial w_1} \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 &&&= \vec{\rm{x}}^T_i \end{align*} \]

Now we multiply everything together.

\[ \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \frac{\partial u_5}{\partial \vec{\rm{w}}_1} = \frac{1}{N} \sum^N_{i=1} (2u_4) \cdot (-1) \cdot \left(\vec{\rm{w}}^T_2\right) \cdot \left(\begin{cases} 0 & u_1 ≤ 0 \\ 1 & u_1 > 0 \end{cases}\right) \cdot \left(\vec{\rm{x}}^T_i\right) \]

And it all eventually expands out to the following.

Expansion

\[ \begin{align} \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \frac{\partial u_5}{\partial \vec{\rm{w}}_1} &= \frac{1}{N}\sum^N_{i=1} (2u_4) \cdot (-1) \cdot (\vec{\rm{w}}^T_2) \cdot \begin{cases}0 & u_1 ≤ 0 \\ 1 & u_1 > 0\end{cases} \cdot \vec{\rm{x}}_i^T \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2u_4 \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - u_3) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - (u_2 \cdot \vec{\rm{w}}_2 + b_2)) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - (\text{max}(0, u_1) \cdot \vec{\rm{w}}_2 + b_2)) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2)) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0\end{cases} \end{align} \]

\[ \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{1}{N} \sum^N_{i=1} -2(\vec{\rm{y}}_i - \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2) \cdot \vec{\rm{w}}^T_2 \cdot\vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \]

We can further simplify by taking \(-1\) and \(2\) common.

\[ \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \vec{\rm{w}}^T_2 \cdot \vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \]

We can simplify even further, by letting \(e_i = \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i\). The \(e\) stands for “error”.

\[ \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} e_i \cdot \vec{\rm{w}}^T_2 \cdot \vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \]

And there you go! We’ve derived the formula that will allow us to calculate the gradients of \(\vec{\rm{w}}_1\).

When implementing backpropagation in a program, it is often better to implement the entire equation in pieces, as opposed to a single line of code, through storing the result of each intermediate gradient. These intermediate gradients can be reused to calculate the gradients of another variable, such as the bias \(b_1\).

Instead of implementing the following in a single line of code.

\[ \frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2} \cdot \frac{\partial u_2}{\partial u_1} \cdot \frac{\partial u_1}{\partial \vec{w}_1} \]

We can instead first calculate the gradients of \(u_4\).

\[ u_{4_g} = \frac{\partial u_5}{\partial u_4} \]

Then calculate the gradients of \(u_3\) and multiply it with it with the gradients of \(u_4\).

\[ u_{3_g} = u_{4_g} \cdot \frac{\partial u_4}{\partial u_3} = \left(\frac{\partial u_5}{\partial u_4}\right) \cdot \frac{\partial u_4}{\partial u_3} \]

Then multiply the product above with the gradients of \(u_2\).

\[ u_{2_g} = u_{3_g} \cdot \frac{\partial u_3}{\partial u_2} = \left(\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3}\right) \cdot \frac{\partial u_3}{\partial u_2} \]

Then multiply the product above with the gradients of \(u_1\).

\[ u_{1_g} = u_{2_g} \cdot \frac{\partial u_2}{\partial u_1} = \left(\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2}\right) \cdot \frac{\partial u_2}{\partial u_1} \]

And finally multiply the product above with the gradients of \(\vec{\rm{w}}_1\)

\[ \vec{\rm{w}}_{1_g} = u_{1_g} \cdot \frac{\partial u_1}{\partial \vec{w}_1} = \left(\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2} \cdot \frac{\partial u_2}{\partial u_1}\right) \cdot \frac{\partial u_1}{\partial \vec{w}_1} \]

Let’s see this using Python instead.

The following is our neural network.

1l1 = relu(lin(trn_x, w1, b1))
2l2 = lin(l1, w2, b2)
3loss = mse(l2, trn_y)

1: l1 is the first layer, \(\text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1)\).
2: l2 is the second layer, \(\text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 = l_1 \cdot \vec{\rm{w}}_2 + b_2\).
3: loss is the MSE, \(\frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 = \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - l_2)^2\).

Substitutions

\[ \begin{align} u_1 &= \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 \\ u_2 &= \text{max}(0, u_1) \\ u_3 &= u_2 \cdot \vec{\rm{w}}_2 + b_2 \\ u_4 &= \vec{\rm{y}}_i - u_3 \\ u_5 &= u_4^2 \end{align} \]

Derivatives of the Substitutions

\[ \begin{alignat}{3} \text{gradient of } u_4 &= \frac{\partial u_5}{\partial u_4} &&= \frac{\partial}{\partial u_4} u_4^2 &&&= 2u_4 \\ \text{gradient of } u_3 &= \frac{\partial u_4}{\partial u_3} &&= \frac{\partial}{\partial u_3} \vec{\rm{y}}_i - u_3 &&&= -1 \\ \text{gradient of } u_2 &= \frac{\partial u_3}{\partial u_2} &&= \frac{\partial}{\partial u_2} u_2 \cdot \vec{\rm{w}}_2 + b_2 &&&= \vec{\rm{w}}^T_2 \\ \text{gradient of } u_1 &= \frac{\partial \vec{\rm{u}}_2}{\partial \vec{\rm{u}}_1} &&= \frac{\partial}{\partial u_1} \text{max}(0, u_1) &&&= \begin{cases} 0 & \vec{\rm{u}}_1 ≤ 0 \\ 1 & \vec{\rm{u}}_1 > 0 \end{cases} \\ \text{gradient of } \vec{\rm{w}}_1 &= \frac{\partial u_1}{\partial \vec{\rm{w}}_1} &&= \frac{\partial}{\partial w_1} \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 &&&= \vec{\rm{x}}^T_i \end{alignat} \]

First we need to calculate the gradients of \(u_4\).

diff = (trn_y - l2)
1loss.g = (2/trn_x.shape[0]) * diff

1: trn_x.shape[0], in this case, returns the total number of samples.

Next are the gradients of \(u_3\)

diff.g = loss.g * -1

Then the gradients of \(u_2\)

l2.g = diff.g @ w2.T

Then the gradients of \(u_1\)

l1.g = l2.g * (l1 > 0).float()

And finally the gradients of \(\vec{\rm{w}}_1\).

w1.g = (l1.g * trn_x).sum()

The equation for the gradient of \(b_1\) is almost the same as the equation for the gradients of \(w_1\), save for the last line where we do not have to matrix multiply with \(\vec{\rm{x}}_i\). Therefore, we can reuse all previous gradient calculations to find the gradient of \(b_1\).

b1.g = (l1.g * 1).sum()

When multiplying various tensors together, make sure their shapes are compatible. Shape manipulations have been omitted above for simplicity.

Conclusion

And that’s all there really is to backpropagation; think of it a one big chain rule problem.

To make sure you’ve got it hammered down, get out a pen and paper and derivate the equations that would compute the gradients of \(\vec{\rm{x}}_i\), \(b_1\), \(\vec{\rm{w}}_2\), and \(u_2\) respectively with respect to the MSE.

And if you really want to hammer down your understanding on what’s happening, then I highly recommend reading The Matrix Calculus You Need For Deep Learning. I’ve also compiled backpropagation practice questions from this paper!

Answers

\[ \begin{align} \frac{\partial \text{MSE}}{\partial b_1} &= \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \vec{\rm{w}}_2^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \\ \frac{\partial \text{MSE}}{\partial \vec{\rm{x}}_i} &= \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \vec{\rm{w}}^T_2 \cdot \vec{\rm{w}}_1^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \\ \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_2} &= \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \\ \frac{\partial \text{MSE}}{\partial b_2} &= \frac{2}{N} \sum^N_{i=1} \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i \end{align} \]

If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!