## Weight Initialization explained | A way to reduce the vanishing gradient problem

### text

### Weight initialization explained

In this episode, we'll talk about how the weights in an artificial neural network are initialized, how this initialization affects the training process, and what we can do about it!

In an artificial neural network, we know that weights are what connect the nodes between layers. To kick off our discussion on weight initialization, we're first going to discuss how these weights are initialized, and how these initialized values might negatively affect the training process.

With this in mind, we'll then explore what we can do to influence how this initialization occurs. Then, we'll see how we can specify how the weights for a given model are initialized in code using Keras.

### How are weights initialized?

So, how are weights even initialized in the first place? We briefly touched on this concept in our
episodes on backpropagation. Recall there, we mentioned that weights are
*randomly initialized*.

To elaborate, whenever we build and compile a network, the values for the weights will be set with random numbers. One random number per weight. Typically, these random numbers will be normally distributed such that the distribution of these numbers has a mean of \(0\) and standard deviation of \(1\).

So, how does this random initialization impact training? To see this, let's consider the following example.

#### Random initialization example

Suppose that our neural network's input layer has \(250\) nodes, and for simplicity, suppose that the value of each of these \(250\) nodes is \(1\).

Now, let's focus only on the weights that connect the input layer to a single node in the first hidden layer. In total, there will be \(250\) weights connecting this node in our first hidden layer to all the nodes in the input layer. (See the corresponding video for an illustration.)

Now, each of these weights were randomly generated and normally distributed with a mean of \(0\) and a standard deviation of \(1\). So, what does this mean for the weighted sum, \(z\), that this node accepts as input?

Note, in our case, all the input nodes have a value of \(1\), so each weight in \(z\) will be multiplied by a \(1\), so \(z\) becomes just a sum of the weights.

So back to how this random initialization affects \(z\), more specifically we want to know what this means for the variance of \(z\). (Stick with me for a sec, we'll see why we care about this in a minute.)

Well, \(z\), as a sum of normally distributed numbers with a mean of \(0\), will also be normally distributed around \(0\), but its variance, and similarly its derived standard deviation, will be larger than \(1\).

That's because the variance of a sum of independent random numbers is the sum of the variances of each of these numbers. So, since the variance for each of our random numbers is \(1\), that means the variance of \(z\), which is the sum of these \(250\) numbers, is \(250\).

Taking the square root of this value, we see that \(z\) has a standard deviation of \(15.811\).

So looking at the normal distribution of \(z\), we see that it's quite broader than a normal distribution with a standard deviation of \(1\).

With this larger standard deviation, the value of \(z\) is more likely to take on a number that is significantly larger or smaller than \(1\). When we pass this value to our activation function, then, if we're using sigmoid, for example, we know that most positive inputs, especially these that we're saying will be significantly larger than \(1\), will be mapped to the value \(1\). Similarly, most negative inputs will be mapped to \(0\).

If you're a little shaky on the concept of activation functions, be sure to check out the earlier episode that covers this concept in detail.

#### Problems with random initialization

If the desired output from our activation function is on the opposite side from where it saturated, then during training, when SGD updates the weights in attempts to influence the activation output, it will only make very small changes in the value of this activation output, barely even incrementally moving it in the right direction.

Thus, the network's ability to learn becomes hindered, and training is stuck running in this slow and inefficient state.

These problems that we've discussed so far with weight initialization also contribute to the vanishing and exploding gradient problem.

Given this random weight initialization causes issues and instabilities with training, can we do anything to help ourselves out here? Can we change how weights are initialized?

I'm glad you asked. Yes, yes we can.

### Xavier initialization

In hindsight, we should be able to look back at the problems we've discussed and trace them back to being caused by the weighted sum taking on a variance that is decently larger, or smaller, than \(1\). So to tackle this problem, what we can do is force this variance to be smaller.

How do we do this?

Well, since the variance of the input for a given node is determined by the variance of the weights connected to this node from the previous layer, we need to shrink the variance of these weights, which will shrink the variance of the weighted sum.

Some researchers identified a value for the variance of the weights that seems to work pretty well to mitigate the earlier problems we discussed. The value for the variance of the weights connected to a given node is \(1/n\), where \(n\) is the number of weights connected to this node from the previous layer.

So, rather than the distribution of these weights be centered around \(0\) with a variance of \(1\), which is what we had earlier, they are now still centered around \(0\), but with a significantly smaller variance, \(1/n\).

It turns out that, to get these weights to have this variance of \(1/n\), what we do is, after randomly generating the weights centered around \(0\) with variance \(1\), we multiply each of them by \(\sqrt{1/n}\). Doing this causes the variance of these
weights to shift from \(1\) to \(1/n\). This type of initialization is referred to as
*Xavier initialization* and also
*Glorot initialization*.

It's important to note that actually, if we're using relu as our activation function, which is highly likely, then this ideal value for the variance is \(2/n\) rather than \(1/n\). Besides that, everything else stated so far for this solution is the same. This value just happens to be what works better for relu.

Also, note that, given how we defined \(n\) as being the number of weights connected to a given node from the previous layer, we can see that this weight initialization process occurs on a per-layer basis.

Another thing also worth noting that when this Xavier initialization was originally announced, it was suggested to use \(2/n_{in} + n_{out}\) as the variance where \(n_{in}\) is defined as the number of weights coming into this neuron, and \(n_{out}\) is the number of weights coming out of this neuron. You may still still see this value referenced in some places.

Now, we've talked a lot about Xavier initialization. Aside from this one, there are other initialization techniques that you can explore, but this Xavier is currently one of the most popular and has an aim to reduce the vanishing and exploding gradient problem.

### Weight initialization in Keras

Let's see now how we can specify a weight initializer for our own model in code using Keras.

We're going to use this arbitrary model that has two hidden `Dense`

layers and an output layer with two nodes.

```
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential([
Dense(16, input_shape=(1,5), activation='relu'),
Dense(32, activation='relu', kernel_initializer='glorot_uniform'),
Dense(2, activation='softmax')
])
```

Now, almost everything shown in this model has been covered in previous episodes of this series, so we won't touch on these items individually. Instead, we'll focus on the one single thing we haven't yet seen before, which is the
`kernel_initializer`

parameter in the second hidden layer.

This is the parameter we use to specify what type of weight initialization we want to use for a given layer. Here, I've set the value equal to `glorot_uniform`

. This is the Xavier initialization
using a uniform distribution. You can also use `glorot_normal`

for Xavier initialization using a normal distribution.

Now actually, if you specify nothing at all, by default Keras initializes the weights in each layer with the `glorot_uniform`

initialization. This is true for other layer types as well, not just
`Dense`

. Convolutional layers, for example, also use the `glorot_uniform`

initializer by default as well.

There are several other initializers that Keras supports besides the two we just mentioned, and they're all documented in the Keras docs.

So, for each layer, we can just choose to leave the initializer as this default one, or, if we don't want to use `glorot_uniform`

, we can explicitly state the value for the
`kernel_initializer`

parameter that we want to use, like `glorot_normal`

, for example.

So hey, lucky us! Keras has been initializing these weights for us using Xavier initialization this whole time without us even knowing.

### Wrapping up

What we can draw from this entire discussion is that weight initialization plays a key role in how well and how quickly we can train our networks. It all links back to the idea of the inputs to our neurons having large variance when we just randomly generate a normally distributed set of weights.

So, to combat this, we can just shrink this variance, and we saw that doing this wasn't actually that bad. In Keras, it's not bad at all because it's already done for us!

So what do you think of this concept of weight initialization? After our discussion, is it clear how this can negatively impact training? See ya in the next one!

### quiz

### resources

### updates

Committed by on