Deep Learning Fundamentals - Classic Edition

A newer version of this course is available! Check here for details!

Backpropagation explained | Part 4 - Calculating the gradient

video

expand_more chevron_left

text

expand_more chevron_left

Backpropagation explained | Part 4 - Calculating the gradient

Hey, what's going on everyone? In this episode, we're finally going to see how backpropagation calculates the gradient of the loss function with respect to the weights in a neural network.

So let's get to it.

Our Task

We're now on episode number four in our journey through understanding backpropagation. In the last episode, we focused on how we can mathematically express certain facts about the training process.

Now we're going to be using these expressions to help us differentiate the loss of the neural network with respect to the weights.

Recall from the episode that covered the intuition for backpropagation that for stochastic gradient descent to update the weights of the network, it first needs to calculate the gradient of the loss with respect to these weights.

Calculating this gradient is exactly what we'll be focusing on in this episode.

We're first going to start out by checking out the equation that backprop uses to differentiate the loss with respect to weights in the network.

Then, we'll see that this equation is made up of multiple terms. This will allow us to break down and focus on each of these terms individually.

Lastly, we'll take the results from each term and combine them to obtain the final result, which will be the gradient of the loss function.

Alright, let's begin.

Derivative of the Loss Function with Respect to the Weights

Let's look at a single weight that connects node \(2\) in layer \(L-1\) to node \(1\) in layer \(L\).

This weight is denoted as

\[ w_{12}^{\left( L\right) }\text{.} \]

The derivative of the loss \( C_{0} \) with respect to this particular weight \( w_{12}^{(L)} \) is denoted as

\[ \frac{\partial C_{0}}{\partial w_{12}^{(L)}}\text{.} \]

Since \( C_{0} \) depends on \( a_{1}^{\left( L\right) }\text{,} \) and \( a_{1}^{\left( L\right) } \) depends on \( z_{1}^{(L)}\text{,}\) and \( z_{1}^{(L)} \) depends on \( w_{12}^{(L)}\text{,} \) the chain rule tells us that to differentiate \( C_{0} \) with respect to \( w_{12}^{(L)}\text{,} \) we take the product of the derivatives of the composed function.

This is expressed as

\[ \frac{\partial C_{0}}{\partial w_{12}^{(L)}} = \left(\frac{\partial C_{0}}{\partial a_{1}^{(L)}} \right) \left(\frac{\partial a_{1}^{(L)}}{\partial z_{1}^{(L)}} \right) \left(\frac{\partial z_{1}^{(L)}}{\partial w_{12}^{(L)}}\right) \text{.} \]

Let's break down each term from the expression on the right hand side of the above equation.

The first term: \( \frac{\partial C_{0}}{\partial a_{1}^{(L)}} \)

We know that

\[ C_{0}=\sum_{j=0}^{n-1}\left( a_{j}^{(L)}-y_{j}\right) ^{2}\text{.} \]

Therefore,

\[ \frac{\partial C_{0}}{\partial a_{1}^{(L)}} = \frac{\partial }{\partial a_{1}^{(L)}} \left( \sum_{j=0}^{n-1}\left( a_{j}^{(L)}-y_{j}\right)^{2}\right) \text{.} \]

Expanding the sum, we see

\[ \begin{eqnarray*} \frac{\partial}{\partial a_{1}^{(L)}} \left( \sum_{j=0}^{n-1}\left( a_{j}^{(L)}-y_{j}\right)^{2}\right) &=& \frac{\partial }{\partial a_{1}^{(L)}} \left( \left( a_{0}^{(L)}-y_{0}\right)^{2} + \left( a_{1}^{(L)}-y_{1}\right)^{2} + \left( a_{2}^{(L)}-y_{2}\right)^{2} + \left( a_{3}^{(L)}-y_{3}\right)^{2} \right) \\&=& \frac{\partial }{\partial a_{1}^{(L)}}\left( \left( a_{0}^{(L)}-y_{0}\right)^{2}\right) + \frac{\partial }{\partial a_{1}^{(L)}} \left( \left( a_{1}^{(L)}-y_{1}\right)^{2}\right) + \frac{\partial }{\partial a_{1}^{(L)}}\left( \left( a_{2}^{(L)}-y_{2}\right)^{2}\right) +\frac{\partial }{\partial a_{1}^{(L)}}\left( \left( a_{3}^{(L)}-y_{3}\right)^{2}\right) \\&=&2 \left( a_{1}^{\left( L\right) }-y_{1}\right) \text{.} \end{eqnarray*} \]

Observe that the loss from the network for a single input sample will respond to a small change in the activation output from node \( 1 \) in layer \( L \) by an amount equal to two times the difference of the activation output \(a_{1}\) for node \( 1 \) and the desired output \( y_{1} \) for node \( 1 \).

The second term: \( \frac{\partial a_{1}^{(L)}}{\partial z_{1}^{(L)}} \)

We know that for each node \( j \) in the output layer \( L \), we have

\[ a_{j}^{\left( L\right) }=g^{\left( L\right) }\left( z_{j}^{\left( L\right)}\right) \text{,} \]

and since \( j=1 \), we have

\[ a_{1}^{\left( L\right) }=g^{\left( L\right) }\left( z_{1}^{\left( L\right)}\right) \text{.} \]

Therefore,

\[ \begin{eqnarray*} \frac{\partial a_{1}^{(L)}}{\partial z_{1}^{(L)}} &=& \frac{\partial }{\partial z_{1}^{(L)}}\left( g^{\left( L\right) }\left( z_{1}^{\left(L\right) }\right) \right) \\ &=& g^{^{\prime }\left( L\right) }\left( z_{1}^{\left( L\right) }\right) \text{.} \end{eqnarray*} \]

Therefore, this is just the direct derivative of \( a_{1}^{(L)} \) since \( a_{1}^{(L)} \) is a direct function of \( z_{1}^{\left(L\right)} \).

The third term: \( \frac{\partial z_{1}^{(L)}}{\partial w_{12}^{(L)}} \)

We know that, for each node \( j \) in the output layer \( L \), we have

\[ z_{j}^{(L)}=\sum_{k=0}^{n-1}w_{jk}^{(L)}a_{k}^{(L-1)} \text{.} \]

Since \( j=1 \), we have

\[ z_{1}^{(L)}=\sum_{k=0}^{n-1}w_{1k}^{(L)}a_{k}^{(L-1)} \text{.} \]

Therefore,

\[ \frac{\partial z_{1}^{(L)}}{\partial w_{12}^{(L)}} = \frac{\partial }{\partial w_{12}^{(L)}}\left( \sum_{k=0}^{n-1}w_{1k}^{(L)}a_{k}^{(L-1)}\right) \text{.} \]

Expanding the sum, we see

\[ \begin{eqnarray*} \frac{\partial }{\partial w_{12}^{(L)}}\left(\sum_{k=0}^{n-1}w_{1k}^{(L)}a_{k}^{(L-1)}\right) &=& \frac{\partial }{\partial w_{12}^{(L)}} \left( w_{10}^{(L)}a_{0}^{(L-1)} + w_{11}^{(L)}a_{1}^{(L-1)} + w_{12}^{(L)}a_{2}^{(L-1)} + \cdots + w_{15}^{(L)}a_{5}^{(L-1)} \right) \\ &=& \frac{\partial }{\partial w_{12}^{(L)}} w_{10}^{(L)}a_{0}^{(L-1)} + \frac{\partial }{\partial w_{12}^{(L)}} w_{11}^{(L)}a_{1}^{(L-1)} + \frac{\partial }{\partial w_{12}^{(L)}}w_{12}^{(L)}a_{2}^{(L-1)} + \cdots + \frac{\partial }{\partial w_{12}^{(L)}}w_{15}^{(L)}a_{5}^{(L-1)} \\ &=& a_{2}^{(L-1)} \end{eqnarray*} \]

The input for node \( 1 \) in layer \( L \) will respond to a change in the weight \( w_{12}^{(L)} \) by an amount equal to the activation output for node \( 2 \) in the previous layer, \( L-1 \).

Combining terms

Combining all terms, we have

\[ \begin{eqnarray*} \frac{\partial C_{0}}{\partial w_{12}^{(L)}} &=& \left( \frac{\partial C_{0}}{\partial a_{1}^{(L)}}\right) \left( \frac{\partial a_{1}^{(L)}}{\partial z_{1}^{(L)}}\right) \left( \frac{\partial z_{1}^{(L)}}{\partial w_{12}^{(L)}}\right) \\ &=& 2 \left( a_{1}^{\left( L\right) }-y_{1}\right) \left( g^{\prime \left(L\right) }\left( z_{1}^{\left( L\right) }\right) \right) \left( a_{2}^{(L-1)}\right) \end{eqnarray*} \]

We Conclude

We've seen how to calculate the derivative of the loss with respect to one individual weight for one individual training sample.

To calculate the derivative of the loss with respect to this same particular weight, \( w_{12} \), for all \( n \) training samples, we calculate the average derivative of the loss function over all \( n \) training samples.

This can be expressed as

\[ \frac{\partial C}{\partial w_{12}^{(L)}} = \frac{1}{n}\sum_{i=0}^{n-1}\frac{\partial C_{i}}{\partial w_{12}^{(L)}} \text{.} \]

We would then do this same process for each weight in the network to calculate the derivative of \( C \) with respect to each weight.

Wrapping Up

At this point, we should now understand mathematically how backpropagation calculates the gradient of the loss with respect to the weights in the network.

We should also have a solid grip on all of the intermediate steps needed to do this calculation, and we should now be able to generalize the result we obtained for a single weight and a single sample to all the weights in the network for all training samples.

Now, we still haven't hit the point completely home by discussing the math that underlies the backwards movement of backpropagation that we discussed whenever we covered the intuition for backpropagation. Don't worry, we'll be doing that in the next episode. Thanks for reading. See you next time.

quiz

expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results

resources

expand_more chevron_left
We're now on number 4 in our journey through understanding backpropagation. In our last video, we focused on how we can mathematically express certain facts about the training process. Now we're going to be using these expressions to help us differentiate the loss of the neural network with respect to the weights. Recall from our video that covered the intuition for backpropagation, that, for stochastic gradient descent to update the weights of the network, it first needs to calculate the gradient of the loss with respect to these weights. And calculating this gradient, is exactly what we'll be focusing on in this video. We're first going to start out by checking out the equation that backprop uses to differentiate the loss with respect to weights in the network. We'll see that this equation is made up of multiple terms, so next we'll break down and focus on each of these terms individually. Lastly, we'll take the results from each term and combine them to obtain the final result, which will be the gradient of the loss function. πŸ•’πŸ¦Ž VIDEO SECTIONS πŸ¦ŽπŸ•’ 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:58 Agenda 01:28 Derivative Calculations 05:45 Calculation Breakdown - First term 07:36 Calculation Breakdown - Second term 08:52 Calculation Breakdown - Third term 11:56 Summary 13:56 Collective Intelligence and the DEEPLIZARD HIVEMIND πŸ’₯🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎πŸ’₯ πŸ‘‹ Hey, we're Chris and Mandy, the creators of deeplizard! πŸ‘€ CHECK OUT OUR VLOG: πŸ”— https://youtube.com/deeplizardvlog πŸ’ͺ CHECK OUT OUR FITNESS CHANNEL: πŸ”— https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: πŸ”— https://neurohacker.com/shop?rfsn=6488344.d171c6 ❀️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime πŸ‘€ Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard πŸŽ“ Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd πŸŽ“ Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y πŸ›’ Check out products deeplizard recommends on Amazon: πŸ”— https://amazon.com/shop/deeplizard πŸ“• Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: πŸ”— https://amzn.to/2yoqWRn 🎡 deeplizard uses music by Kevin MacLeod πŸ”— https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❀️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!


All relevant updates for the content on this page are listed below.