CNN Training Loop Refactoring - Simultaneous Hyperparameter Testing
text
Refactoring the CNN Training Loop
Welcome to this neural network programming series. In this episode, we will see how we can experiment with large numbers of hyperparameter values easily while still keeping our training loop and our results organized.

Without further ado, let's get started.
Cleaning Up the Training Loop and Extracting Classes
When we left off with our training loop a couple of episodes back, we built out quite a lot of functionality that allowed us to experiment with many different parameters and values, and we also made the calls need inside our training loop that would get our results into TensorBoard.
All of this work has helped, but our training loop is pretty crowded now. In this episode, we're going to clean up our training loop and set the stage for more experimentation up by using the
RunBuilder
class that we built last time and by building a new class called RunManager
.
Our goal is to be able to add parameters and values at the top, and have all the values tested or tried during multiple training runs.
For example, in this case, we are saying that we want to use two parameters, lr
and batch_size
, and for the batch_size
we want to try two different values. This gives
us a total of two training runs. Both runs will have the same learning rate while the batch size varies.
params = OrderedDict( lr = [.01] ,batch_size = [1000, 2000] )
For the results, we'd like to see and be able to compare the both runs.
run | epoch | loss | accuracy | epoch duration | run duration | lr | batch_size |
---|---|---|---|---|---|---|---|
1 | 1 | 0.983 | 0.618 | 48.697 | 50.563 | 0.01 | 1000 |
1 | 2 | 0.572 | 0.777 | 19.165 | 69.794 | 0.01 | 1000 |
1 | 3 | 0.468 | 0.827 | 19.366 | 89.252 | 0.01 | 1000 |
1 | 4 | 0.428 | 0.843 | 18.840 | 108.176 | 0.01 | 1000 |
1 | 5 | 0.389 | 0.857 | 19.082 | 127.320 | 0.01 | 1000 |
2 | 1 | 1.271 | 0.528 | 18.558 | 19.627 | 0.01 | 2000 |
2 | 2 | 0.623 | 0.757 | 19.822 | 39.520 | 0.01 | 2000 |
2 | 3 | 0.526 | 0.791 | 21.101 | 60.694 | 0.01 | 2000 |
2 | 4 | 0.478 | 0.814 | 20.332 | 81.110 | 0.01 | 2000 |
2 | 5 | 0.440 | 0.835 | 20.413 | 101.600 | 0.01 | 2000 |
The Two Classes We Will Build
To do this, we need to build two new classes. We built the first class called RunBuilder
in the last episode. It's being called at the top.
for run in RunBuilder.get_runs(params):
Now, we need to build this RunManager
class that will allow us to manage each run inside our run loop. The RunManager
instance will allow us to pull out a lot of the tedious TensorBoard
calls and allow us to add additional functionality as well.
We'll see that as our number of parameters and runs get larger, TensorBoard will start to breakdown as a viable solution for reviewing our results.
The RunManager
will be invoked at different stages inside each of our runs. We'll have calls at the start and end of both the run and the epoch phases. We'll also have calls to track
the loss and the number of correct predictions inside each epoch. Finally, at the end, we'll save the run results to disk.
Let's see how to build this RunManager
class.
Building the RunManger
for Training Loop Runs
Let's kick things off with our imports:
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import torchvision import torchvision.transforms as transforms from torch.utils.data import DataLoader from torch.utils.tensorboard import SummaryWriter from IPython.display import display, clear_output import pandas as pd import time import json from itertools import product from collections import namedtuple from collections import OrderedDict
First, we declare the class using the class keyword.
class RunManager():
Next, we'll define the class constructor.
def __init__(self): self.epoch_count = 0 self.epoch_loss = 0 self.epoch_num_correct = 0 self.epoch_start_time = None self.run_params = None self.run_count = 0 self.run_data = [] self.run_start_time = None self.network = None self.loader = None self.tb = None
For now, we'll take no arguments in the constructor, and we'll just define some attributes that will enable us to keep track of data across runs and across epochs.
We'll track the following:
- The number of epochs.
- The running loss for an epoch.
- The number of correct predictions for an epoch.
- The start time of the epoch.
Remember we saw that the RunManager
class has two methods with epoch in the name. We have begin_epoch()
and end_epoch()
. These two methods will allow us to manage these
values across the epoch lifecycle.
Now, next we have some attributes for the runs. We have an attribute called run_params
. This is the run definition in terms for the run parameters. It's value will be one of the runs returned
by the
RunBuilder
class.
Next, we have attributes to track the run_count
, and the run_data
. The run_count
gives us the run number and the run_data
is a list we'll use to keep
track of the parameter values and the results of each epoch for each run, and so we'll see that we add a value to this list for each epoch. Then, we have the run start time which will be used to calculate
the run duration.
Alright, next we will save the network and the data loader that are being used for the run, as well as a SummaryWriter
that we can use to save data for TensorBoard.
What are Code Smells?
Do you smell that? There's something that doesn't smell right about this code. Have you heard of code smells before? Have you smelled them? A code smell is a term used to describe a condition where something about the code in front of our eyes doesn't seem right. It's like a gut feeling for software developers.
A code smell doesn't mean that something is definitely wrong. A code smell does not mean the code is incorrect. It just means that there is likely a better way. In this case, the code smell is the fact that we have several variable names that have a prefix. The use of the prefix here indicates that the variables somehow belong together.
Anytime we see this, we need to be thinking about removing these prefixes. Data that belongs together should be together. This is done by encapsulating the data inside of a class. After all, if the data belongs together, object oriented languages give us the ability to express this fact using classes.
Refactoring by Extracting a Class
It's fine to leave this code in now, but later we might want to refactor this code by doing what is referred to as extracting a class. This is a refactoring technique where we remove these prefixes and create a class called Epoch
,
that has these attributes, count
, loss
, num_correct
, and start_time
.
class Epoch(): def __init__(self): self.count = 0 self.loss = 0 self.num_correct = 0 self.start_time = None
Then, we'll replace these class variable with an instance of the Epoch
class. We might even change the count variable to have a more intuitive name, like say number
or
id
. The reason we can leave this now is because refactoring is an iterative process, and this is our first iteration.
Extracting Classes Creates Layers of Abstraction
Actually, what we are doing now by building this class is extracting a class from our main training loop program. The code smell that we were addressing is the fact that our loop was becoming cluttered and beginning to appear overly complex.
When we write a main program and then refactor it, we can think of this creating layers of abstraction that make the main program more and more readable and easier to understand. Each part of the program should be very easy to understand in its own right.
When we extract code into its own class or method, we are creating additional layers of abstraction, and if we want to understand the implementation details of any of the layers, we dive in so to speak.
In an iterative way, we can think of starting with one single program, and then, later extracting the code that creates deeper and deeper layers. The process can be view as a branching tree-like structure.
Beginning a Training Loop Run
Anyway, let's look at the first method of this class which extracts the code needed to begin a run.
def begin_run(self, run, network, loader): self.run_start_time = time.time() self.run_params = run self.run_count += 1 self.network = network self.loader = loader self.tb = SummaryWriter(comment=f'-{run}') images, labels = next(iter(self.loader)) grid = torchvision.utils.make_grid(images) self.tb.add_image('images', grid) self.tb.add_graph(self.network, images)
First, we capture the start time for the run. Then, we save the passed in run parameters and increment the run count by one. After this, we save our network and our data loader, and then, we initialize a SummaryWriter
for TensorBoard. Notice how we are passing our run as the comment argument. This will allow us to uniquely identify our run inside TensorBoard.
Alright, next we just have some TensorBoard calls that we made in our training loop before. These calls add our network and a batch of images to TensorBoard.
When we end a run, all we have to do is close the TensorBoard handle and set the epoch count back to zero to be ready for the next run.
def end_run(self): self.tb.close() self.epoch_count = 0
For starting an epoch, we first save the start time. Then, we increment the epoch_count
by one and set the epoch_loss
and epoch_number_correct
to zero.
def begin_epoch(self): self.epoch_start_time = time.time() self.epoch_count += 1 self.epoch_loss = 0 self.epoch_num_correct = 0
Now, let's look at where the bulk of the action occurs which is ending an epoch.
def end_epoch(self): epoch_duration = time.time() - self.epoch_start_time run_duration = time.time() - self.run_start_time loss = self.epoch_loss / len(self.loader.dataset) accuracy = self.epoch_num_correct / len(self.loader.dataset) self.tb.add_scalar('Loss', loss, self.epoch_count) self.tb.add_scalar('Accuracy', accuracy, self.epoch_count) for name, param in self.network.named_parameters(): self.tb.add_histogram(name, param, self.epoch_count) self.tb.add_histogram(f'{name}.grad', param.grad, self.epoch_count) ...
We start by calculating the epoch duration and the run duration. Since we are at the end of an epoch, the epoch duration is final, but the run duration here represents the running time of the current run. The value will keep running until the run ends. However, we'll still save it with each epoch.
Next, we compute the epoch_loss
and accuracy
, and we do it relative to the size of the training set. This gives us the average loss per sample. Then, we pass both of these values
to TensorBoard.
Next, we pass our network's weights and gradient values to TensorBoard like we did before.
Tracking Our Training Loop Performance
We're ready now for whats new in this processing. This is the part that we are adding to give us additional insight when we preform large numbers of runs. We're going to save all of the data ourselves so we can analyze it outsize of TensorBoard.
def end_epoch(self): ... results = OrderedDict() results["run"] = self.run_count results["epoch"] = self.epoch_count results['loss'] = loss results["accuracy"] = accuracy results['epoch duration'] = epoch_duration results['run duration'] = run_duration for k,v in self.run_params._asdict().items(): results[k] = v self.run_data.append(results) df = pd.DataFrame.from_dict(self.run_data, orient='columns') ...
Here, we are building a dictionary that contains the keys and values we care about for our run. We add in the run_count
, the epoch_count
, the loss
, the
accuracy
, the epoch_duration
, and the
run_duration
.
Then, we iterate over the keys and values inside our run parameters adding them to the results dictionary. This will allow us to see the parameters that are associated with the performance results.
Finally, we append the results to the run_data
list.
Once the data is added to the list, we turn the data list into a pandas
data frame so we can have formatted output.
The next two lines are specific to Jupyter notebook. We clear the current output and display the new data frame.
clear_output(wait=True) display(df)
Alright, that ends an epoch. One thing you may be wondering is how the epoch_loss
and epoch_num_correct
values were tracked. We'll we have two methods just below for that.
def track_loss(self, loss, batch): self.epoch_loss += loss.item() * batch[0].shape[0] def track_num_correct(self, preds, labels): self.epoch_num_correct += self.get_num_correct(preds, labels)
We have a method called track_loss()
and a method called track_num_correct()
. These methods are called inside the training loop after each batch. The loss is passed into the
track_loss()
method and the predictions and labels are passed into the track_num_correct()
method.
To calculate the number of correct predictions, we are using the same get_num_correct()
function that we defined in previous episodes. The difference here is that the function is now encapsulated
inside our RunManager
class.
def _get_num_correct(self, preds, labels): return preds.argmax(dim=1).eq(labels).sum().item()
Lastly, we have a method called save()
that saves the run_data in two formats, json and csv. This output goes to disk and makes it available for other apps to consume. For example, we can open
the csv file in excel or we can even build our own even better TensorBoard with the data.
def save(self, fileName): pd.DataFrame.from_dict( self.run_data, orient='columns' ).to_csv(f'{fileName}.csv') with open(f'{fileName}.json', 'w', encoding='utf-8') as f: json.dump(self.run_data, f, ensure_ascii=False, indent=4)
That's it. Now, we can use this RunManager
class inside our training loop.
If we use the following parameters below:
params = OrderedDict( lr = [.01] ,batch_size = [1000, 2000] ,shuffle = [True, False] )
These are the results we get:
run | epoch | loss | accuracy | epoch duration | run duration | lr | batch_size | shuffle |
---|---|---|---|---|---|---|---|---|
1 | 1 | 0.979 | 0.624 | 20.056 | 22.935 | 0.01 | 1000 | True |
1 | 2 | 0.514 | 0.805 | 19.786 | 43.141 | 0.01 | 1000 | True |
1 | 3 | 0.425 | 0.843 | 20.117 | 63.342 | 0.01 | 1000 | True |
1 | 4 | 0.378 | 0.861 | 19.556 | 82.969 | 0.01 | 1000 | True |
1 | 5 | 0.342 | 0.872 | 18.706 | 101.752 | 0.01 | 1000 | True |
2 | 1 | 0.965 | 0.632 | 18.846 | 19.390 | 0.01 | 1000 | False |
2 | 2 | 0.503 | 0.806 | 20.276 | 39.758 | 0.01 | 1000 | False |
2 | 3 | 0.409 | 0.849 | 19.741 | 59.579 | 0.01 | 1000 | False |
2 | 4 | 0.360 | 0.866 | 19.358 | 79.015 | 0.01 | 1000 | False |
2 | 5 | 0.330 | 0.877 | 19.523 | 98.616 | 0.01 | 1000 | False |
3 | 1 | 1.298 | 0.513 | 18.831 | 20.039 | 0.01 | 2000 | True |
3 | 2 | 0.665 | 0.745 | 18.872 | 38.988 | 0.01 | 2000 | True |
3 | 3 | 0.548 | 0.789 | 18.947 | 58.012 | 0.01 | 2000 | True |
3 | 4 | 0.485 | 0.819 | 19.325 | 77.416 | 0.01 | 2000 | True |
3 | 5 | 0.443 | 0.838 | 19.629 | 97.121 | 0.01 | 2000 | True |
4 | 1 | 1.305 | 0.497 | 19.242 | 20.465 | 0.01 | 2000 | False |
4 | 2 | 0.693 | 0.727 | 18.858 | 39.406 | 0.01 | 2000 | False |
4 | 3 | 0.572 | 0.777 | 18.839 | 58.321 | 0.01 | 2000 | False |
4 | 4 | 0.503 | 0.809 | 18.774 | 77.168 | 0.01 | 2000 | False |
4 | 5 | 0.462 | 0.831 | 19.028 | 96.274 | 0.01 | 2000 | False |
What does it Feel Like to be Wrong?
Whoa. Like jinkies. Hello. Don't mind me. I've just been here refactoring code and pondering this question. Oh. You are wondering what the question is. Well, the question is this. What does it feel like to be wrong?
Maybe we might describe it as feeling bad. Or, maybe we might describe it as embarrassing, or humiliating.
Well, No. Actually, This is not the way it feels to be wrong. These are the feelings we feel after we know we are wrong, in which case we are no longer wrong anymore.
From this fact, we can deduce what it actually feels like to be wrong. That is. What it feels like to be wrong before we realize it, is what it feels like to be right.
quiz
resources
updates
Committed by on