PyTorch - Python Deep Learning Neural Network API

Deep Learning Course 4 of 6 - Level: Intermediate

CNN Training Loop Refactoring - Simultaneous Hyperparameter Testing


expand_more chevron_left


expand_more chevron_left

Refactoring the CNN Training Loop

Welcome to this neural network programming series. In this episode, we will see how we can experiment with large numbers of hyperparameter values easily while still keeping our training loop and our results organized.


Without further ado, let's get started.

Cleaning Up the Training Loop and Extracting Classes

When we left off with our training loop a couple of episodes back, we built out quite a lot of functionality that allowed us to experiment with many different parameters and values, and we also made the calls need inside our training loop that would get our results into TensorBoard.

All of this work has helped, but our training loop is pretty crowded now. In this episode, we're going to clean up our training loop and set the stage for more experimentation up by using the RunBuilder class that we built last time and by building a new class called RunManager.

Our goal is to be able to add parameters and values at the top, and have all the values tested or tried during multiple training runs.

For example, in this case, we are saying that we want to use two parameters, lr and batch_size, and for the batch_size we want to try two different values. This gives us a total of two training runs. Both runs will have the same learning rate while the batch size varies.

params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 2000]

For the results, we'd like to see and be able to compare the both runs.

run epoch loss accuracy epoch duration run duration lr batch_size
1 1 0.983 0.618 48.697 50.563 0.01 1000
1 2 0.572 0.777 19.165 69.794 0.01 1000
1 3 0.468 0.827 19.366 89.252 0.01 1000
1 4 0.428 0.843 18.840 108.176 0.01 1000
1 5 0.389 0.857 19.082 127.320 0.01 1000
2 1 1.271 0.528 18.558 19.627 0.01 2000
2 2 0.623 0.757 19.822 39.520 0.01 2000
2 3 0.526 0.791 21.101 60.694 0.01 2000
2 4 0.478 0.814 20.332 81.110 0.01 2000
2 5 0.440 0.835 20.413 101.600 0.01 2000

The Two Classes We Will Build

To do this, we need to build two new classes. We built the first class called RunBuilder in the last episode. It's being called at the top.

for run in RunBuilder.get_runs(params):

Now, we need to build this RunManager class that will allow us to manage each run inside our run loop. The RunManager instance will allow us to pull out a lot of the tedious TensorBoard calls and allow us to add additional functionality as well.

We'll see that as our number of parameters and runs get larger, TensorBoard will start to breakdown as a viable solution for reviewing our results.

The RunManager will be invoked at different stages inside each of our runs. We'll have calls at the start and end of both the run and the epoch phases. We'll also have calls to track the loss and the number of correct predictions inside each epoch. Finally, at the end, we'll save the run results to disk.

Let's see how to build this RunManager class.

Building the RunManger for Training Loop Runs

Let's kick things off with our imports:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

from import DataLoader
from torch.utils.tensorboard import SummaryWriter
from IPython.display import display, clear_output
import pandas as pd
import time
import json

from itertools import product
from collections import namedtuple
from collections import OrderedDict

First, we declare the class using the class keyword.

class RunManager():

Next, we'll define the class constructor.

def __init__(self):

    self.epoch_count = 0
    self.epoch_loss = 0
    self.epoch_num_correct = 0
    self.epoch_start_time = None

    self.run_params = None
    self.run_count = 0
    self.run_data = []
    self.run_start_time = None = None
    self.loader = None
    self.tb = None

For now, we'll take no arguments in the constructor, and we'll just define some attributes that will enable us to keep track of data across runs and across epochs.

We'll track the following:

  • The number of epochs.
  • The running loss for an epoch.
  • The number of correct predictions for an epoch.
  • The start time of the epoch.

Remember we saw that the RunManager class has two methods with epoch in the name. We have begin_epoch() and end_epoch(). These two methods will allow us to manage these values across the epoch lifecycle.

Now, next we have some attributes for the runs. We have an attribute called run_params. This is the run definition in terms for the run parameters. It's value will be one of the runs returned by the RunBuilder class.

Next, we have attributes to track the run_count, and the run_data. The run_count gives us the run number and the run_data is a list we'll use to keep track of the parameter values and the results of each epoch for each run, and so we'll see that we add a value to this list for each epoch. Then, we have the run start time which will be used to calculate the run duration.

Alright, next we will save the network and the data loader that are being used for the run, as well as a SummaryWriter that we can use to save data for TensorBoard.

What are Code Smells?

Do you smell that? There's something that doesn't smell right about this code. Have you heard of code smells before? Have you smelled them? A code smell is a term used to describe a condition where something about the code in front of our eyes doesn't seem right. It's like a gut feeling for software developers.

A code smell doesn't mean that something is definitely wrong. A code smell does not mean the code is incorrect. It just means that there is likely a better way. In this case, the code smell is the fact that we have several variable names that have a prefix. The use of the prefix here indicates that the variables somehow belong together.

Anytime we see this, we need to be thinking about removing these prefixes. Data that belongs together should be together. This is done by encapsulating the data inside of a class. After all, if the data belongs together, object oriented languages give us the ability to express this fact using classes.

Refactoring by Extracting a Class

It's fine to leave this code in now, but later we might want to refactor this code by doing what is referred to as extracting a class. This is a refactoring technique where we remove these prefixes and create a class called Epoch, that has these attributes, count, loss, num_correct, and start_time.

class Epoch():
    def __init__(self):
        self.count = 0
        self.loss = 0
        self.num_correct = 0
        self.start_time = None 

Then, we'll replace these class variable with an instance of the Epoch class. We might even change the count variable to have a more intuitive name, like say number or id. The reason we can leave this now is because refactoring is an iterative process, and this is our first iteration.

Extracting Classes Creates Layers of Abstraction

Actually, what we are doing now by building this class is extracting a class from our main training loop program. The code smell that we were addressing is the fact that our loop was becoming cluttered and beginning to appear overly complex.

When we write a main program and then refactor it, we can think of this creating layers of abstraction that make the main program more and more readable and easier to understand. Each part of the program should be very easy to understand in its own right.

When we extract code into its own class or method, we are creating additional layers of abstraction, and if we want to understand the implementation details of any of the layers, we dive in so to speak.

In an iterative way, we can think of starting with one single program, and then, later extracting the code that creates deeper and deeper layers. The process can be view as a branching tree-like structure.

Beginning a Training Loop Run

Anyway, let's look at the first method of this class which extracts the code needed to begin a run.

def begin_run(self, run, network, loader):

    self.run_start_time = time.time()

    self.run_params = run
    self.run_count += 1 = network
    self.loader = loader
    self.tb = SummaryWriter(comment=f'-{run}')

    images, labels = next(iter(self.loader))
    grid = torchvision.utils.make_grid(images)

    self.tb.add_image('images', grid)
    self.tb.add_graph(, images)

First, we capture the start time for the run. Then, we save the passed in run parameters and increment the run count by one. After this, we save our network and our data loader, and then, we initialize a SummaryWriter for TensorBoard. Notice how we are passing our run as the comment argument. This will allow us to uniquely identify our run inside TensorBoard.

Alright, next we just have some TensorBoard calls that we made in our training loop before. These calls add our network and a batch of images to TensorBoard.

When we end a run, all we have to do is close the TensorBoard handle and set the epoch count back to zero to be ready for the next run.

def end_run(self):
    self.epoch_count = 0

For starting an epoch, we first save the start time. Then, we increment the epoch_count by one and set the epoch_loss and epoch_number_correct to zero.

def begin_epoch(self):
    self.epoch_start_time = time.time()

    self.epoch_count += 1
    self.epoch_loss = 0
    self.epoch_num_correct = 0

Now, let's look at where the bulk of the action occurs which is ending an epoch.

def end_epoch(self):

    epoch_duration = time.time() - self.epoch_start_time
    run_duration = time.time() - self.run_start_time

    loss = self.epoch_loss / len(self.loader.dataset)
    accuracy = self.epoch_num_correct / len(self.loader.dataset)

    self.tb.add_scalar('Loss', loss, self.epoch_count)
    self.tb.add_scalar('Accuracy', accuracy, self.epoch_count)

    for name, param in
        self.tb.add_histogram(name, param, self.epoch_count)
        self.tb.add_histogram(f'{name}.grad', param.grad, self.epoch_count)


We start by calculating the epoch duration and the run duration. Since we are at the end of an epoch, the epoch duration is final, but the run duration here represents the running time of the current run. The value will keep running until the run ends. However, we'll still save it with each epoch.

Next, we compute the epoch_loss and accuracy, and we do it relative to the size of the training set. This gives us the average loss per sample. Then, we pass both of these values to TensorBoard.

Next, we pass our network's weights and gradient values to TensorBoard like we did before.

Tracking Our Training Loop Performance

We're ready now for whats new in this processing. This is the part that we are adding to give us additional insight when we preform large numbers of runs. We're going to save all of the data ourselves so we can analyze it outsize of TensorBoard.

def end_epoch(self):

    results = OrderedDict()
    results["run"] = self.run_count
    results["epoch"] = self.epoch_count
    results['loss'] = loss
    results["accuracy"] = accuracy
    results['epoch duration'] = epoch_duration
    results['run duration'] = run_duration
    for k,v in self.run_params._asdict().items(): results[k] = v

    df = pd.DataFrame.from_dict(self.run_data, orient='columns')


Here, we are building a dictionary that contains the keys and values we care about for our run. We add in the run_count, the epoch_count, the loss, the accuracy, the epoch_duration, and the run_duration.

Then, we iterate over the keys and values inside our run parameters adding them to the results dictionary. This will allow us to see the parameters that are associated with the performance results.

Finally, we append the results to the run_data list.

Once the data is added to the list, we turn the data list into a pandas data frame so we can have formatted output.

The next two lines are specific to Jupyter notebook. We clear the current output and display the new data frame.


Alright, that ends an epoch. One thing you may be wondering is how the epoch_loss and epoch_num_correct values were tracked. We'll we have two methods just below for that.

def track_loss(self, loss, batch):
    self.epoch_loss += loss.item() * batch[0].shape[0]

def track_num_correct(self, preds, labels):
    self.epoch_num_correct += self.get_num_correct(preds, labels)

We have a method called track_loss() and a method called track_num_correct(). These methods are called inside the training loop after each batch. The loss is passed into the track_loss() method and the predictions and labels are passed into the track_num_correct() method.

To calculate the number of correct predictions, we are using the same get_num_correct() function that we defined in previous episodes. The difference here is that the function is now encapsulated inside our RunManager class.

def _get_num_correct(self, preds, labels):
    return preds.argmax(dim=1).eq(labels).sum().item()

Lastly, we have a method called save() that saves the run_data in two formats, json and csv. This output goes to disk and makes it available for other apps to consume. For example, we can open the csv file in excel or we can even build our own even better TensorBoard with the data.

def save(self, fileName):

        self.run_data, orient='columns'

    with open(f'{fileName}.json', 'w', encoding='utf-8') as f:
        json.dump(self.run_data, f, ensure_ascii=False, indent=4)

That's it. Now, we can use this RunManager class inside our training loop.

If we use the following parameters below:

params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 2000]
    ,shuffle = [True, False]

These are the results we get:

run epoch loss accuracy epoch duration run duration lr batch_size shuffle
1 1 0.979 0.624 20.056 22.935 0.01 1000 True
1 2 0.514 0.805 19.786 43.141 0.01 1000 True
1 3 0.425 0.843 20.117 63.342 0.01 1000 True
1 4 0.378 0.861 19.556 82.969 0.01 1000 True
1 5 0.342 0.872 18.706 101.752 0.01 1000 True
2 1 0.965 0.632 18.846 19.390 0.01 1000 False
2 2 0.503 0.806 20.276 39.758 0.01 1000 False
2 3 0.409 0.849 19.741 59.579 0.01 1000 False
2 4 0.360 0.866 19.358 79.015 0.01 1000 False
2 5 0.330 0.877 19.523 98.616 0.01 1000 False
3 1 1.298 0.513 18.831 20.039 0.01 2000 True
3 2 0.665 0.745 18.872 38.988 0.01 2000 True
3 3 0.548 0.789 18.947 58.012 0.01 2000 True
3 4 0.485 0.819 19.325 77.416 0.01 2000 True
3 5 0.443 0.838 19.629 97.121 0.01 2000 True
4 1 1.305 0.497 19.242 20.465 0.01 2000 False
4 2 0.693 0.727 18.858 39.406 0.01 2000 False
4 3 0.572 0.777 18.839 58.321 0.01 2000 False
4 4 0.503 0.809 18.774 77.168 0.01 2000 False
4 5 0.462 0.831 19.028 96.274 0.01 2000 False

What does it Feel Like to be Wrong?

Whoa. Like jinkies. Hello. Don't mind me. I've just been here refactoring code and pondering this question. Oh. You are wondering what the question is. Well, the question is this. What does it feel like to be wrong?

Maybe we might describe it as feeling bad. Or, maybe we might describe it as embarrassing, or humiliating.

Well, No. Actually, This is not the way it feels to be wrong. These are the feelings we feel after we know we are wrong, in which case we are no longer wrong anymore.

From this fact, we can deduce what it actually feels like to be wrong. That is. What it feels like to be wrong before we realize it, is what it feels like to be right.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results


expand_more chevron_left
Welcome to this neural network programming series. In this episode, we will see how we can experiment with large numbers of hyperparameter values easily while still keeping our training loop and our results organized. πŸ•’πŸ¦Ž VIDEO SECTIONS πŸ¦ŽπŸ•’ 00:00 Welcome to DEEPLIZARD - Go to for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 18:26 Collective Intelligence and the DEEPLIZARD HIVEMIND πŸ’₯🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎πŸ’₯ πŸ‘‹ Hey, we're Chris and Mandy, the creators of deeplizard! πŸ‘€ CHECK OUT OUR VLOG: πŸ”— πŸ’» DOWNLOAD ACCESS TO CODE FILES πŸ€– Available for members of the deeplizard hivemind: πŸ”— ❀️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Tammy BufferUnderrun Mano Prime πŸ‘€ Follow deeplizard: Our vlog: Facebook: Instagram: Twitter: Patreon: YouTube: πŸŽ“ Deep Learning with deeplizard: Deep Learning Dictionary - Deep Learning Fundamentals - Learn TensorFlow - Learn PyTorch - Reinforcement Learning - Generative Adversarial Networks - πŸŽ“ Other Courses: Data Science - Trading - πŸ›’ Check out products deeplizard recommends on Amazon: πŸ”— πŸ“• Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: πŸ”— 🎡 deeplizard uses music by Kevin MacLeod πŸ”— πŸ”— ❀️ Please use the knowledge gained from deeplizard content for good, not evil.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.