Deep Learning Fundamentals - Classic Edition

A newer version of this course is available! Check here for details!

Vanishing & Exploding Gradient explained | A problem resulting from backpropagation

video

expand_more chevron_left

text

expand_more chevron_left

Vanishing & Exploding Gradient

What's going on, everyone?! In this episode, we're going to discuss a problem that creeps up time and time again during the training process of an artificial neural network.

explode

This is the problem of unstable gradients and is most popularly referred to as the vanishing gradient problem. Without further ado, let's get to it.

Setting Things Up

What do we already know about gradients as it pertains to neural networks?

Well, for one, when we use the word gradient by itself, we're typically referring to the gradient of the loss function with respect to the weights in the network.

We also know how this gradient is calculated, using backpropagation, which we covered in our earlier episodes dedicated solely to backprop.

Finally, as we saw in the episode that demonstrates how a neural network learns, we know what to do with this gradient after it's calculated. We update our weights with it!

Well, we don't per se, but stochastic gradient descent does, with the goal in mind to find the most optimal weight for each connection that will minimize the total loss of the network.

With this understanding, we're now going to talk about the vanishing gradient problem.

We're first going to answer, well, what the heck is the vanishing gradient problem anyway?

Here, we'll cover the idea conceptually. Then, we'll move our discussion to talking about how this problem occurs, and with the understanding that we'll have developed up to this point, we'll discuss the problem of exploding gradients.

We'll see that the exploding gradient problem is actually very similar to the vanishing gradient problem, and so we'll be able to take what we learned about that problem and apply it to this new one.

Alright, let's begin.

What is the Vanishing Gradient Problem?

What is the vanishing gradient problem anyway?

In general, the vanishing gradient problem is a problem that causes major difficulty when training a neural network. More specifically, this is a problem that involves weights in earlier layers of the network.

Recall that, during training, stochastic gradient descent (or SGD) works to calculate the gradient of the loss with respect to weights in the network.

Now, sometimes the gradient with respect to weights in earlier layers of the network becomes really small, like vanishingly small. Hence, vanishing gradient.

Ok, what's the big deal with a small gradient?

Small Gradients

Well, once SGD calculates the gradient with respect to a particular weight, it uses this value to update that weight, and the weight gets updated in some way that is proportional to the gradient. If the gradient is vanishingly small, then this update is, in turn, going to be vanishingly small as well.

Therefore, if this newly updated value of the weight has just barely moved from its original value, then it's not really doing much for the network. This change is not going to carry through the network very well to help to reduce the loss because it has barely changed at all from where it was before the update occurred.

As a result, this weight becomes kind of stuck, never really updating enough to even get close to its optimal value which has implications for the remainder of the network to the right of this one weight and impairs the ability of the network to learn well.

This is the problem, and now that we know what this problem is, how exactly does this problem occur?

Well, we know from what we learned about backpropagation that the gradient of the loss with respect to any given weight is going to be the product of some derivatives that depend on components that reside later in the network.

Given this, we can deduce that the earlier in the network a weight lives, the more terms will be needed in the product we just mentioned to get the gradient of the loss with respect to this weight.

The key now is to understand what happens if the terms in this product, or at least some of them, are small? And by small, we mean less than one, small.

Well, the product of a bunch of numbers less than one is going to give us an even smaller number, right?

Ok, cool.

As we mentioned earlier, we now take this result, the small number, and update our weight with it. Recall that we do this update by first multiplying this number by our learning rate, which it itself is a small number, usually ranging between .01 and .0001.

Now the result of this product is even a smaller number. After this smaller number is obtained, we subtract the number from the weight, and the final result of this difference is going to be the value of the updated weight.

Stuck Weights

Now, we can think about if the gradient that we obtain with respect to this weight is already really small, i.e., vanishing, then by the time we multiply it by the learning rate, the product is going to be even smaller, and so when we subtract this teeny tiny number from the weight, it's just barely going to move the weight at all.

Essentially, the weight gets into this kind of stuck state. Not moving, not learning, and therefore not really helping to meet the overall objective of minimizing the loss of the network.

We can see why earlier weights are subject to this problem. Because, as we said, the earlier in the network the weight resides, the more terms are going to be included in the product to calculate the gradient.

The more terms we're multiplying together that are less than one, the quicker the gradient is going to vanish.

Exploding Gradient

Now let's talk about this problem in the opposite direction. Not a gradient that vanishes, but rather, a gradient that explodes.

Think about the conversation we just had about how the vanishing gradient problem occurs with weights early in the network due to a product of, at least some, relatively small values.

Now think about calculating the gradient with respect to the same weight, but instead of really small terms, what if they were large? And by large, we mean greater than one.

Well, if we multiply a bunch of terms together that are all greater than one, we're going to get something greater than one, and perhaps even a lot greater than one.

The same argument holds here that we discussed about the vanishing gradient, where, the earlier in the network a weight lives, the more terms will be needed in the product we just mentioned.

As a result, we can see that the more of these larger valued terms we have being multiplied together, the larger the gradient is going to be, thus essentially exploding in size.

With this gradient, we go through the same process to proportionally update our weight with it that we talked about earlier.

However, this time, instead of barely moving our weight with this update, we're going to greatly move it, So much so, that the optimal value for this weight won't be achieved because the proportion to which the weight becomes updated with each epoch is just too large and continues to move further and further away from its optimal value.

Conclusion

A main-takeaway that we should be able to gain from this discussion is that the problem of vanishing gradients and exploding gradients is actually a more general problem of unstable gradients.

quiz

expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results

resources

expand_more chevron_left
Let's discuss a problem that creeps up time-and-time during the training process of an artificial neural network. This is the problem of unstable gradients, and is most popularly referred to as the vanishing gradient problem. We're first going to answer the question, what is the vanishing gradient problem, anyway? Here, we'll cover the idea conceptually. We'll then move our discussion to talking about how this problem occurs. Then, with the understanding that we'll have developed up to this point, we'll discuss the problem of exploding gradients, which we'll see is actually very similar to the vanishing gradient problem, and so we'll be able to take what we learned about that problem and apply it to this new one. πŸ•’πŸ¦Ž VIDEO SECTIONS πŸ¦ŽπŸ•’ 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:28 Gradient review 01:18 Agenda 01:45 The vanishing gradient problem 03:27 The cause of the vanishing gradients 05:30 Exploding gradient 07:13 Collective Intelligence and the DEEPLIZARD HIVEMIND πŸ’₯🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎πŸ’₯ πŸ‘‹ Hey, we're Chris and Mandy, the creators of deeplizard! πŸ‘€ CHECK OUT OUR VLOG: πŸ”— https://youtube.com/deeplizardvlog πŸ’ͺ CHECK OUT OUR FITNESS CHANNEL: πŸ”— https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: πŸ”— https://neurohacker.com/shop?rfsn=6488344.d171c6 ❀️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime πŸ‘€ Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard πŸŽ“ Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd πŸŽ“ Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y πŸ›’ Check out products deeplizard recommends on Amazon: πŸ”— https://amazon.com/shop/deeplizard πŸ“• Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: πŸ”— https://amzn.to/2yoqWRn 🎡 deeplizard uses music by Kevin MacLeod πŸ”— https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❀️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!


All relevant updates for the content on this page are listed below.