## PyTorch Dataset Normalization - torchvision.transforms.Normalize()

### text

###
PyTorch Dataset Normalization - `torchvision.transforms.Normalize()`

Welcome to deeplizard. My name is Chris. In this episode, we're going to learn how to normalize a dataset. We'll see how dataset normalization is carried out in code, and we'll see how normalization affects the neural network training process.

Without further ado, let's get started.

### Data Normalization

The idea of data normalization is an general concept that refers to the act of transforming the original values of a dataset to new values. The new values are typically encoded relative to the dataset itself and are scaled in some way.

#### Feature scaling

For this reason, another name for data normalization that is sometimes used is feature scaling. This term refers to the fact that when normalizing data, we often transform different features of a given dataset to a similar scale.

In this case, we are not just thinking of a dataset of values but rather, a dataset of elements that have multiple features, each with their on value.

Suppose for example that we are dealing with a dataset of people, and we have two relevant features in our dataset, age and weight. In this case, we can observe that the magnitudes or scales of these these two feature sets are different, i.e., the weights on average ar larger than the age.

This difference in magnitude can be problematic when comparing or computing using machine learning algorithms. Hence, this *can be* one reason we might want to scale the values of these features to
some similar scale via feature scaling.

#### Normalization Example

When we normalize a dataset, we said that we typically encode some form of information about each particular value relative to the dataset at large and rescale the data. Let's consider an example.

Suppose we have a set \(S\) of positive numbers. Now, suppose we choose a random value \(x\) from the set \(s\) and ask the following question:

In this case, the answer is that
**we don't know**. We simply don't have enough information to answer the question.

However, let's suppose now that we are told that the set \(S\) has been normalized by dividing every value by the largest value inside the set. Given this normalization process, the information of which value is largest has been encoded and the data has been rescaled.

The largest member of the set is \(1\), and the data has been scaled to the interval \([0,1]\).

#### What is Standardization

Data
standardization is a specific type of normalization technique. It is sometimes referred to as *z-score normalization*. The z-score, a.k.a. standard score, is the transformed value for each data
point.

To normalize a dataset using standardization, we take every value \(x\) inside the dataset and transform it to its corresponding \(z\) value using the following formula:

After performing this computation on every \(x\) value inside our dataset, we have a new normalized dataset of \(z\) values. The mean and standard deviation values are with respect to the dataset as a whole.

Suppose that a given set \(S\) of numbers has \(n\) members.

The mean of the set \(S\) is given by the following equation:

The standard deviation of the set \(S\) is given by the following equation:

We have seen how normalizing by dividing by the largest value had the effect of transforming the largest value to \(1\), this standardization process transforms the dataset's mean value to \(0\) and its standard deviation to \(1\).

It's important to note that when we normalize a dataset, we typically group these operations by feature. This means that the mean and standard deviation values are relative to each feature set that's being normalized. If we are working with images, the features are the RGB color channels, so we normalize each color channel with respect to the mean and standard deviation values calculated across all pixels in every images for the respective color channel.

### Normalize a Dataset in Code

Let's jump into a code example. The first step is to initialize our dataset, so in this example we'll use the Fashion MNIST dataset that we've been working with up to this point in the series.

train_set = torchvision.datasets.FashionMNIST( root='./data' ,train=True ,download=True ,transform=transforms.Compose([ transforms.ToTensor() ]) )

PyTorch allows us to normalize our dataset using the standardization process we've just seen by passing in the mean and standard deviation values for each color channel to the `Normalize()`

transform.

torchvision.transforms.Normalize( [meanOfChannel1, meanOfChannel2, meanOfChannel3] , [stdOfChannel1, stdOfChannel2, stdOfChannel3] )

Since the images inside our dataset only have a single channel, we only need to pass in solo mean and standard deviation values. In order to do this we need to first calculate these values. Sometimes the values might be posted online somewhere, so we can get them that way. However, when in doubt, we can just calculate the manually.

There are two ways it can be done. The easy way, and the harder way. The easy way can be achieved if the dataset is small enough to fit into memory all at once. Otherwise, we have to iterate over the data which is slightly harder.

####
Calculating `mean`

and `std`

the Easy Way

The easy way is easy. All we have to do is load the dataset using the data loader and get a single batch tensor that contains all the data. To do this we set the batch size to be equal to the training set length.

loader = DataLoader(train_set, batch_size=len(train_set), num_workers=1) data = next(iter(loader)) data[0].mean(), data[0].std() (tensor(0.2860), tensor(0.3530))

Here, we can obtain the mean and standard deviation values by simply using the corresponding PyTorch tensor methods.

####
Calculating `mean`

and `std`

the Hard Way

The hard way is hard because we need to manually implement the formulas for the mean and standard deviation and iterate over smaller batches of the dataset.

First, we create a data loader with a smaller batch size.

loader = DataLoader(train_set, batch_size=1000, num_workers=1)

Then, we calculate our \(n\) value or total number of pixels:

num_of_pixels = len(train_set) * 28 * 28

Note that the \(28 * 28\) is the height and width of the images inside our dataset. Now, we sum the pixels values by iterating over each batch, and we calculate the mean by dividing this sum by the total number of pixels.

total_sum = 0 for batch in loader: total_sum += batch[0].sum() mean = total_sum / num_of_pixels

Next, we calculate the sum of the squared errors by iterating thorough each batch, and this allows us to calculate the standard deviation by dividing the sum of the squared errors by the total number of pixels and square rooting the result.

sum_of_squared_error = 0 for batch in loader: sum_of_squared_error += ((batch[0] - mean).pow(2)).sum() std = torch.sqrt(sum_of_squared_error / num_of_pixels)

This gives us:

mean, std (tensor(0.2860), tensor(0.3530))

####
Using the `mean`

and `std`

Values

Our task is to use these values to transform the pixel values inside our dataset to their corresponding standardized values. To do this we create a new train_set only this time we pass a normalization transform to the transforms composition.

train_set_normal = torchvision.datasets.FashionMNIST( root='./data' ,train=True ,download=True ,transform=transforms.Compose([ transforms.ToTensor() , transforms.Normalize(mean, std) ]) )

Note that the order of the transforms matters inside the composition. The images are loaded as Python PIL objects, so we must add the `ToTensor()`

transform before the
`Normalize()`

transform due to the fact that the `Normalize()`

transform expects a tensor as input.

Now, that our dataset has a Normalize() transform, the data will be normalized when it is loaded by the data loader. Remember, for each image the following transform will be applied to every pixel in the image.

This has the effect of rescaling our data relative to the mean and standard deviation of the dataset. Let's see this in action by recalculating these values.

loader = DataLoader( train_set_normal , batch_size=len(train_set) , num_workers=1 ) data = next(iter(loader)) data[0].mean(), data[0].std() (tensor(1.2368e-05), tensor(1.0000))

Here, we can see that the mean value is now \(0\) and the standard deviation value is now \(1\).

### Training with Normalized Data

Let's see now how training with and without normalized data affects the training process. To this test, we'll do \(20\) epochs under each condition.

Let's create a dictionary of training sets that we can use to run the test in the framework that we've been building throughout the course.

trainsets = { 'not_normal': train_set ,'normal': train_set_normal }

Now, we can add these two train_sets to our configuration and access the values inside our runs loop.

params = OrderedDict( lr = [.01] , batch_size = [1000] , num_workers = [1] , device = ['cuda'] , trainset = ['not_normal', 'normal'] ) m = RunManager() for run in RunBuilder.get_runs(params): device = torch.device(run.device) network = Network().to(device) loader = DataLoader( trainsets[run.trainset] , batch_size=run.batch_size , num_workers=run.num_workers ) optimizer = optim.Adam(network.parameters(), lr=run.lr) m.begin_run(run, network, loader) for epoch in range(20): m.begin_epoch() for batch in loader: images = batch[0].to(device) labels = batch[1].to(device) preds = network(images) # Pass Batch loss = F.cross_entropy(preds, labels) # Calculate Loss optimizer.zero_grad() # Zero Gradients loss.backward() # Calculate Gradients optimizer.step() # Update Weights m.track_loss(loss, batch) m.track_num_correct(preds, labels) m.end_epoch() m.end_run() m.save('results')

#### Training Results

Let's sort the results by accuracy.

pd.DataFrame.from_dict(m.run_data).sort_values('accuracy', ascending=False)

run | epoch | loss | accuracy | epoch duration | run duration | lr | batch_size | num_workers | device | trainset |
---|---|---|---|---|---|---|---|---|---|---|

2 | 20 | 0.2240 | 0.9148 | 9.56 | 198.35 | 0.01 | 1000 | 1 | cuda | normal |

2 | 19 | 0.2267 | 0.9142 | 9.63 | 188.69 | 0.01 | 1000 | 1 | cuda | normal |

2 | 18 | 0.2310 | 0.9126 | 9.58 | 178.96 | 0.01 | 1000 | 1 | cuda | normal |

2 | 17 | 0.2332 | 0.9122 | 9.68 | 169.29 | 0.01 | 1000 | 1 | cuda | normal |

2 | 16 | 0.2338 | 0.9121 | 9.77 | 159.52 | 0.01 | 1000 | 1 | cuda | normal |

1 | 20 | 0.2425 | 0.9102 | 7.08 | 148.37 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 15 | 0.2410 | 0.9099 | 9.67 | 149.66 | 0.01 | 1000 | 1 | cuda | normal |

2 | 14 | 0.2453 | 0.9085 | 9.71 | 139.90 | 0.01 | 1000 | 1 | cuda | normal |

1 | 19 | 0.2477 | 0.9084 | 7.04 | 141.20 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 17 | 0.2557 | 0.9058 | 7.01 | 126.94 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 16 | 0.2579 | 0.9057 | 7.04 | 119.84 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 18 | 0.2540 | 0.9055 | 7.03 | 134.06 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 13 | 0.2555 | 0.9041 | 9.70 | 130.10 | 0.01 | 1000 | 1 | cuda | normal |

1 | 14 | 0.2652 | 0.9021 | 7.02 | 105.58 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 15 | 0.2647 | 0.9018 | 7.06 | 112.72 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 12 | 0.2633 | 0.9011 | 9.71 | 120.29 | 0.01 | 1000 | 1 | cuda | normal |

1 | 13 | 0.2707 | 0.8997 | 7.01 | 98.48 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 10 | 0.2704 | 0.8986 | 9.62 | 100.62 | 0.01 | 1000 | 1 | cuda | normal |

2 | 11 | 0.2668 | 0.8986 | 9.78 | 110.49 | 0.01 | 1000 | 1 | cuda | normal |

1 | 12 | 0.2766 | 0.8983 | 7.13 | 91.39 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 11 | 0.2863 | 0.8942 | 7.05 | 84.18 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 9 | 0.2810 | 0.8937 | 9.67 | 90.90 | 0.01 | 1000 | 1 | cuda | normal |

2 | 8 | 0.2925 | 0.8904 | 9.71 | 81.14 | 0.01 | 1000 | 1 | cuda | normal |

1 | 10 | 0.2981 | 0.8904 | 6.99 | 77.03 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 9 | 0.3036 | 0.8888 | 7.01 | 69.96 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 7 | 0.3011 | 0.8864 | 9.65 | 71.33 | 0.01 | 1000 | 1 | cuda | normal |

1 | 8 | 0.3164 | 0.8848 | 7.08 | 62.87 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 6 | 0.3109 | 0.8835 | 9.68 | 61.58 | 0.01 | 1000 | 1 | cuda | normal |

1 | 7 | 0.3330 | 0.8780 | 7.04 | 55.70 | 0.01 | 1000 | 1 | cuda | not_normal |

1 | 6 | 0.3414 | 0.8756 | 7.00 | 48.57 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 5 | 0.3353 | 0.8734 | 9.73 | 51.81 | 0.01 | 1000 | 1 | cuda | normal |

1 | 5 | 0.3687 | 0.8662 | 7.07 | 41.48 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 4 | 0.3637 | 0.8632 | 9.89 | 41.98 | 0.01 | 1000 | 1 | cuda | normal |

1 | 4 | 0.4049 | 0.8516 | 7.37 | 34.33 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 3 | 0.4033 | 0.8501 | 9.96 | 31.99 | 0.01 | 1000 | 1 | cuda | normal |

1 | 3 | 0.4736 | 0.8242 | 7.09 | 26.88 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 2 | 0.4837 | 0.8186 | 10.00 | 21.89 | 0.01 | 1000 | 1 | cuda | normal |

1 | 2 | 0.5890 | 0.7730 | 7.01 | 19.70 | 0.01 | 1000 | 1 | cuda | not_normal |

2 | 1 | 0.8644 | 0.6671 | 9.72 | 11.80 | 0.01 | 1000 | 1 | cuda | normal |

1 | 1 | 1.0888 | 0.5851 | 8.20 | 12.59 | 0.01 | 1000 | 1 | cuda | not_normal |

Here, we can see that after \(20\) epochs our network has higher accuracy when using the normalized data. This is sometimes referred to as faster convergence of the network.

It's important to note that it's not always better to normalize our data. A good rule of thumb is to try it both ways when in doubt.

Also, have you ever heard of batch normalization? We'll batch normalization or batch norm is this same process performed inside the network's layers on the output activation from each layer. Cool right? I'll see you in the next one!

### quiz

### resources

### updates

Committed by on