## Batch Normalization (“batch norm”) explained

### text

### Introducing Batch Normalization (Batch Norm)

In this post, we'll be discussing
*batch normalization*, otherwise known as
*batch norm*, and how it applies to training artificial neural networks.

We'll also see how to implement batch norm in code with Keras.

### Normalization techniques

Before getting to the details about batch normalization, let's quickly first discuss regular normalization techniques.

$$z=\frac{x-mean}{std}$$

Generally speaking, when training a neural network, we want to normalize or standardize our data in some way ahead of time as part of the pre-processing step. This is the step where we prepare our data to get it ready for training.

Normalization and standardization have the same objective of transforming the data to put all the data points on the same scale.

A typical normalization process consists of scaling numerical data down to be on a scale from zero to one, and a typical standardization process consists of subtracting the mean of the dataset from each data point, and then dividing that difference by the data set's standard deviation.

This forces the standardized data to take on a mean of zero and a standard deviation of one. In practice, this standardization process is often just referred to as normalization as well.

### Why use normalization techniques?

In general, this all boils down to putting our data on some type of known or standard scale. Why do we do this?

Well, if we didn't normalize our data in some way, we can imagine that we may have some numerical data points in our data set that might be very high, and other that might be very low.

For example, suppose we have data on the number of miles individuals have driven a car over the last `5`

years. We may have someone who has driven `100,000`

miles total, and we may
have someone else who's only driven `1000`

miles total. This data has a relatively wide range and isn't necessarily on the same scale.

Additionally, each one of the features for each of our samples could vary widely as well. If we have one feature which corresponds to an individual's age and the other feature corresponds to the number of miles that individual has driven a car over the last five years, then, again, we can see that these two pieces of data, age and miles driven, will not be on the same scale.

The larger data points in these non-normalized data sets can cause instability in neural networks because the relatively large inputs can cascade down through the layers in the network, which may cause imbalanced gradients, which may therefore cause the famous exploding gradient problem. This topic is covered here.

For now, understand that this imbalanced, non-normalized data may cause problems with our network that make it drastically harder to train. Additionally, non-normalized data can significantly decrease our training speed.

When we normalize our inputs, however, we put all of our data on the same scale, in attempts to increase training speed as well as avoid the problem we just discussed because we won't have this relatively wide range between data points.

This is good, but there is another problem that can arise even with normalized data. From our previous post on how a neural network learns, we know how the weights in our model become updated over each epoch during training via the process of stochastic gradient descent.

#### Weights that tip the scale

What if, during training, one of the weights ends up becoming drastically larger than the other weights?

Well, this large weight will then cause the output from its corresponding neuron to be extremely large, and this imbalance will, again, continue to cascade through the network, causing instability. This is where batch normalization comes into play.

### Applying batch norm to a layer

Batch norm is applied to layers that we choose within our network.

*Batch normalization*is applied to layers.

When applying batch norm to a layer, the first thing batch norm does is normalize the output from the activation function. Recall from our post on activation functions that the output from a layer is passed to an activation function, which transforms the output in some way depending on the function itself, before being passed to the next layer as input.

After normalizing the output from the activation function, batch norm multiplies this normalized output by some arbitrary parameter and then adds another arbitrary parameter to this resulting product.

Step | Expression | Description |
---|---|---|

1 | $$z=\frac{x-mean}{std}$$ | Normalize output \(x\) from activation function. |

2 | $$z*g$$ | Multiply normalized output \(z\) by arbitrary parameter \(g\). |

3 | $$(z*g) + b$$ | Add arbitrary parameter \(b\) to resulting product \((z*g)\). |

#### Trainable parameters

This calculation with the two arbitrary parameters sets a new standard deviation and mean for the data. The two arbitrarily set parameters, \(g\) and \(b\) are trainable, meaning that they will be become learned and optimized during the training process.

Parameter | Trainable |
---|---|

\(g\) | Yes |

\(b\) | Yes |

This process makes it so that the weights within the network don't become imbalanced with extremely high or low values since the normalization is included in the gradient process.

This addition of batch norm to our model can greatly increase the speed in which training occurs and reduce the ability of outlying large weights to over-influence the training process.

When we spoke about normalizing our input data in the pre-processing step before training occurs, we understand that this normalization happens to the data before being passed to the input layer.

With batch norm, we can normalize the output data from the activation functions for individual layers within our model as well. This means we have normalized data coming in, and we also have normalized data within the model.

#### Normalizing per batch

Everything we just mentioned about the batch normalization process occurs on a per-batch basis, hence the name
*batch norm*.

These batches are determined by the batch size we set when we train our model. If we're not yet familiar with training batches or batch size, check out this post on the topic.

Now that we have an understanding of batch norm, let's look at how we can add batch norm to a model in code using Keras.

### Working with Code in Keras

Here, we've just copied the code for a model that we've built in a previous post.

```
model = Sequential([
Dense(units=16, input_shape=(1,5), activation='relu'),
Dense(units=32, activation='relu'),
BatchNormalization(axis=1),
Dense(units=2, activation='softmax')
])
```

We have a model with two hidden layers with `16`

and `32`

nodes respectively, both using `relu()`

as their activation functions, and an output layer with two output categories
using the `softmax()`

activation function.

The only difference here is the line between the last hidden layer and the output layer.

```
BatchNormalization(axis=1)
```

This is how we specify batch normalization in Keras. Following the layer for which we want the activation output normalized, we specify a `BatchNormalization`

object. To do this, we need to import
`BatchNormalization`

from Keras, as shown below.

```
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization
```

The only parameter that we're specifying for `BatchNormalization`

is the `axis`

parameter, and that is just to specify the axis from the data that should be normalized, which
is typically the features axis.

There are several other parameters that we can optionally specify, including two called `beta_initializer`

and `gamma_initializer`

. These are the initializers for the arbitrarily set
parameters that we mentioned when we were describing how batch norm works.

These are set by default to `0`

and `1`

by Keras, but we can optionally change these, along with several other optionally specified parameters.

### Wrapping up

That's all there is for implementing batch norm in Keras. We should now have an understanding of what batch norm is, how it works, and why it makes sense to apply it to a neural network. I'll see ya in the next one!

### quiz

### resources

### updates

Committed by on