PyTorch - Python Deep Learning Neural Network API

Deep Learning Course 4 of 6 - Level: Intermediate

PyTorch on the GPU - Training Neural Networks with CUDA


expand_more chevron_left


expand_more chevron_left

Run PyTorch Code on a GPU - Neural Network Programming Guide

Welcome to deeplizard. My name is Chris. In this episode, we're going to learn how to use the GPU with PyTorch. We'll see how to use the GPU in general, and we'll see how to apply these general techniques to training our neural network.

Without further ado, let's get started.

Using a GPU for Deep Learning

If you haven't seen the episode on why deep learning and neural networks use GPUs, be sure to review that episode along side this one to get the best understanding of these concepts.

For now, we're going to hit the ground running with a PyTorch GPU example.

PyTorch GPU Example

PyTorch allows us to seamlessly move data to and from our GPU as we preform computations inside our programs.

When we go to the GPU, we can use the cuda() method, and when we go to the CPU, we can use the cpu() method.

We can also use the to() method. To go to the GPU, we write to('cuda') and to go to the CPU, we write to('cpu'). The to() method is the preferred way mainly because it is more flexible. We'll see one example using using the first two, and then we'll default to always using the to() variant.

cpu() cuda()
to('cpu') to('cuda')

To make use of our GPU during the training process, there are two essential requirements. These requirements are as follows, the data must be moved to the GPU, and the network must be moved to the GPU.

  1. Data on the GPU
  2. Network on the GPU

By default, when a PyTorch tensor or a PyTorch neural network module is created, the corresponding data is initialized on the CPU. Specifically, the data exists inside the CPU's memory.

Now, let's create a tensor and a network, and see how we make the move from CPU to GPU.

Here, we create a tensor and a network:

t = torch.ones(1,1,28,28)
network = Network()

Now, we call the cuda() method and reassign the tensor and network to returned values that have been copied onto the GPU:

t = t.cuda()
network = network.cuda()

Next, we can get a prediction from the network and see that the prediction tensor's device attribute confirms that the data is on cuda, which is the GPU:

> gpu_pred = network(t)
> gpu_pred.device

device(type='cuda', index=0)

Likewise, we can go in the opposite way:

> t = t.cpu()
> network = network.cpu()

> cpu_pred = network(t)
> cpu_pred.device


This is, in a nutshell, how we can utilize the GPU capabilities of PyTorch. What we should turn to now are some important details that are lurking beneath the surface of the code we've just seen.

For example, although we've used the cuda() and cpu() methods, they actually aren't our best options. Furthermore, what's the difference with the methods between the network instance and the tensor instance? These after all are different objects types, which means the two methods are different. Finally, we want to integrate this code into a working example and do a performance test.

General Idea of Using a GPU

The main takeaway at this point is that our network and our data must both exist on the GPU in order to perform computations using the GPU, and this applies to any programming language or framework.

As we'll see in our next demonstration, this is also true for the CPU. GPUs and CPUs are compute devices that compute on data, and so any two values that are directly being used with one another in a computation, must exist on the same device.

PyTorch Tensor Computations on a GPU

Let's dive deeper by demonstrating some tensor computations.

We'll start by creating two tensors:

t1 = torch.tensor([

t2 = torch.tensor([

Now, we'll check which device these tensors were initialized on by inspecting the device attribute:

> t1.device, t2.device

(device(type='cpu'), device(type='cpu'))

As we'd expect, we see that, indeed, both tensors are on the same device, which is the CPU. Let's move the first tensor t1 to the GPU.

> t1 ='cuda')
> t1.device

device(type='cuda', index=0)

We can see that this tensor's device has been changed to cuda, the GPU. Note the use of the to() method here. Instead of calling a particular method to move to a device, we call the same method and pass an argument that specifies the device. Using the to() method is the preferred way of moving data to and from devices.

Also, note the reassignment. The operation is not in-place, and so the reassignment is required.

Let's try an experiment. I'd like to test what we discussed earlier by attempting to perform a computation on these two tensors, t1 and t2, that we now know to be on different devices.

Since we expect an error, we'll wrap the call in a try and catch the exception:

    t1 + t2
except Exception as e:

expected device cuda:0 but got device cpu

By reversing the order of the operation, we can see that the error also changes:

    t2 + t1
except Exception as e: 

expected device cpu but got device cuda:0

Both of these errors are telling us that the binary plus operation expects the second argument to have the same device as the first argument. Understanding the meaning of this error can help when debugging these types of device mismatches.

Finally, for completion, let's move the second tensor to the cuda device to see the operation succeed.

> t2 ='cuda')
> t1 + t2

tensor([[ 6,  8],
        [10, 12]], device='cuda:0')

PyTorch nn.Module Computations on a GPU

We've just seen how tensors can be moved to and from devices. Now, let's see how this is done with PyTorch nn.Module instances.

More generally, we are interested in understanding how and what it means for a network to be on a device like a GPU or CPU. PyTorch aside, this is the essential issue.

We put a network on a device by moving the network's parameters to that said device. Let's create a network and take a look at what we mean.

network = Network()

Now, let's look at the network's parameters:

for name, param in network.named_parameters():
    print(name, '\t\t', param.shape)

conv1.weight        torch.Size([6, 1, 5, 5])
conv1.bias          torch.Size([6])
conv2.weight        torch.Size([12, 6, 5, 5])
conv2.bias          torch.Size([12])
fc1.weight          torch.Size([120, 192])
fc1.bias            torch.Size([120])
fc2.weight          torch.Size([60, 120])
fc2.bias            torch.Size([60])
out.weight          torch.Size([10, 60])
out.bias            torch.Size([10])

Here, we've created a PyTorch network, and we've iterated through the network's parameters. As we can see, the network's parameters are the weights and biases inside the network.

In other words, these are simply tensors that live on a device like we have already seen. Let's verify this by checking the device of each of the parameters.

for n, p in network.named_parameters():
    print(p.device, '', n)

cpu  conv1.weight
cpu  conv1.bias
cpu  conv2.weight
cpu  conv2.bias
cpu  fc1.weight
cpu  fc1.bias
cpu  fc2.weight
cpu  fc2.bias
cpu  out.weight
cpu  out.bias

This shows us that all the parameters inside the network are, by default, initialized on the CPU.

An important consideration of this is that it explains why nn.Module instances like networks don't actually have a device. It's not the network that lives on a device, but the tensors inside the network that live on a device.

Let's see what happens when we ask a network to be moved to() the GPU:'cuda')
    (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
    (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
    (fc1): Linear(in_features=192, out_features=120, bias=True)
    (fc2): Linear(in_features=120, out_features=60, bias=True)
    (out): Linear(in_features=60, out_features=10, bias=True)

Note here that a reassignment was not required. This is because the operation is in-place as far as the network instance is concerned. However, this operation can be used as a reassignment operation. This is preferred for consistency between nn.Module instances and PyTorch tensors.

Here, we can see that now, all the network parameters are have a device of cuda:

for n, p in network.named_parameters():
    print(p.device, '', n)

cuda:0  conv1.weight
cuda:0  conv1.bias
cuda:0  conv2.weight
cuda:0  conv2.bias
cuda:0  fc1.weight
cuda:0  fc1.bias
cuda:0  fc2.weight
cuda:0  fc2.bias
cuda:0  out.weight
cuda:0  out.bias

Passing a Sample to the Network

Let's round off this demonstration by passing a sample to the network.

sample = torch.ones(1,1,28,28)

torch.Size([1, 1, 28, 28])

This gives us a sample tensor we can pass like so:

except Exception as e: 

Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _thnn_conv2d_forward

Since our network is on the GPU and this newly created sample is on the CPU by default, we are getting an error. The error is telling us that the CPU tensor was expected to be a GPU tensor when calling the forward method of the first convolutional layer. This is precisely what we saw before when adding two tensors directly.

We can fix this issue by sending our sample to the GPU like so:

    pred = network('cuda'))
except Exception as e:

tensor([[-0.0685,  0.0201,  0.1223,  0.1075,  0.0810,  0.0686, -0.0336, -0.1088, -0.0995,  0.0639]]
, device='cuda:0'
, grad_fn=

Finally, everything works as expected, and we get a prediction.

Writing Device Agnostic PyTorch Code

Before we wrap up, we need to talk about writing device agnostic code. This term device agnostic means that our code doesn't depend on the underlying device. You may come across this terminology when reading PyTorch documentation.

For example, suppose we write code that uses the cuda() method everywhere, and then, we give the code to a user who doesn't have a GPU. This won't work. Don't worry. We've got options!

Remember earlier when we saw the cuda() and cpu() methods?

We'll, one of the reasons that the to() method is preferred, is because the to() method is parameterized, and this makes it easier to alter the device we are choosing, i.e. it's flexible!

For example, a user could pass in cpu or cuda as an argument to a deep learning program, and this would allow the program to be device agnostic.

Allowing the user of a program to pass an argument that determines the program's behavior is perhaps the best way to make a program be device agnostic. However, we can also use PyTorch to check for a supported GPU, and set our devices that way.


Like, if cuda is available, then use it!

PyTorch GPU Training Performance Test

Let's see now how to add the use of a GPU to the training loop. We're going to be doing this addition with the code we've been developing so far in the series.

This will allow us to easily compare times, CPU vs GPU.

Refactoring the RunManager Class

Before we update the training loop, we need to update the RunManager class. Inside the begin_run() method we need to modify the device of the images tensor that is passed to add_graph method.

It should look like this:

def begin_run(self, run, network, loader):

    self.run_start_time = time.time()

    self.run_params = run
    self.run_count += 1 = network
    self.loader = loader
    self.tb = SummaryWriter(comment=f'-{run}')

    images, labels = next(iter(self.loader))
    grid = torchvision.utils.make_grid(images)

    self.tb.add_image('images', grid)
        ,, 'device', 'cpu'))

Here, we are using the getattr() built in function to get the value of the device on the run object. If the run object doesn't have a device, then cpu is returned. This makes the code backward compatible. It will still work if we don't specify a device for our run.

Note that the network doesn't need to be moved to a device because it's device was set before being passed in. However, the images tensor is obtained from the loader.

Refactoring the Training Loop

We'll set our configuration parameters to have a device. The two logical options here are cuda and cpu.

params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 10000, 20000]
    , num_workers = [0, 1]
    , device = ['cuda', 'cpu']

With these device values added to our configuration, they'll now be available to be accessed inside our training loop.

At the top of our run, we'll create a device that will be passed around inside the run and inside the training loop.

device = torch.device(run.device)

The first place we'll use this device is when initializing our network.

network = Network().to(device)

This will ensure that the network is moved to the appropriate device. Finally, we'll update our images and labels tensors by unpacking them separately and sending them to the device like so:

images = batch[0].to(device)
labels = batch[1].to(device)

That's all there is to it, we're ready to run this code and see the results.

run epoch loss accuracy epoch duration run duration lr batch_size num_workers device
1 1 1.08 0.59 7.50 9.67 0.01 1000 0 cuda
2 1 1.04 0.60 20.83 21.88 0.01 1000 0 cpu
3 1 1.03 0.61 7.84 10.69 0.01 1000 1 cuda
4 1 1.02 0.61 16.49 19.21 0.01 1000 1 cpu
5 1 2.10 0.24 7.69 12.30 0.01 10000 0 cuda
6 1 2.09 0.24 19.89 28.85 0.01 10000 0 cpu
7 1 2.11 0.25 8.05 15.21 0.01 10000 1 cuda
8 1 2.17 0.20 17.09 28.68 0.01 10000 1 cpu
9 1 2.28 0.21 9.65 17.56 0.01 20000 0 cuda
10 1 2.29 0.10 19.63 36.19 0.01 20000 0 cpu
11 1 2.29 0.14 8.18 19.59 0.01 20000 1 cuda
12 1 2.29 0.12 17.68 38.08 0.01 20000 1 cpu

Here, we can see that the cuda device significantly out preformed the cpu by 2x to 3x. Results may vary.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Quiz Results


expand_more chevron_left
Welcome to this neural network programming series! In this episode, we will see how we can use the CUDA capabilities of PyTorch to run our code on the GPU. CUDA Explained - Why Deep Learning Uses GPUs: πŸ•’πŸ¦Ž VIDEO SECTIONS πŸ¦ŽπŸ•’ 00:00 Welcome to DEEPLIZARD - Go to for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 16:09 Collective Intelligence and the DEEPLIZARD HIVEMIND πŸ’₯🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎πŸ’₯ πŸ‘‹ Hey, we're Chris and Mandy, the creators of deeplizard! πŸ‘€ CHECK OUT OUR VLOG: πŸ”— πŸ’» DOWNLOAD ACCESS TO CODE FILES πŸ€– Available for members of the deeplizard hivemind: πŸ”— ❀️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Tammy BufferUnderrun Mano Prime πŸ‘€ Follow deeplizard: Our vlog: Facebook: Instagram: Twitter: Patreon: YouTube: πŸŽ“ Deep Learning with deeplizard: Deep Learning Dictionary - Deep Learning Fundamentals - Learn TensorFlow - Learn PyTorch - Reinforcement Learning - Generative Adversarial Networks - πŸŽ“ Other Courses: Data Science - Trading - πŸ›’ Check out products deeplizard recommends on Amazon: πŸ”— πŸ“• Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: πŸ”— 🎡 deeplizard uses music by Kevin MacLeod πŸ”— πŸ”— ❀️ Please use the knowledge gained from deeplizard content for good, not evil.


expand_more chevron_left
deeplizard logo DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

  • Updated
  • Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.