Neural Network Programming - Deep Learning with PyTorch
Deep Learning Course 3 of 4 - Level: Intermediate
PyTorch on the GPU - Training Neural Networks with CUDA
text
Run PyTorch Code on a GPU - Neural Network Programming Guide
Welcome to deeplizard. My name is Chris. In this episode, we're going to learn how to use the GPU with PyTorch. We'll see how to use the GPU in general, and we'll see how to apply these general techniques to training our neural network.

Without further ado, let's get started.
Using a GPU for Deep Learning
If you haven't seen the episode on why deep learning and neural networks use GPUs, be sure to review that episode along side this one to get the best understanding of these concepts.
For now, we're going to hit the ground running with a PyTorch GPU example.
PyTorch GPU Example
PyTorch allows us to seamlessly move data to and from our GPU as we preform computations inside our programs.
When we go to the GPU, we can use the cuda()
method, and when we go to the CPU, we can use the cpu()
method.
We can also use the to()
method. To go to the GPU, we write to('cuda')
and to go to the CPU, we write to('cpu')
. The to()
method is
the preferred way mainly because it is more flexible. We'll see one example using using the first two, and then we'll default to always using the to()
variant.
CPU | GPU |
---|---|
cpu() |
cuda() |
to('cpu') |
to('cuda') |
To make use of our GPU during the training process, there are two essential requirements. These requirements are as follows, the data must be moved to the GPU, and the network must be moved to the GPU.
- Data on the GPU
- Network on the GPU
By default, when a PyTorch tensor or a PyTorch neural network module is created, the corresponding data is initialized on the CPU. Specifically, the data exists inside the CPU's memory.
Now, let's create a tensor and a network, and see how we make the move from CPU to GPU.
Here, we create a tensor and a network:
t = torch.ones(1,1,28,28) network = Network()
Now, we call the cuda()
method and reassign the tensor and network to returned values that have been copied onto the GPU:
t = t.cuda() network = network.cuda()
Next, we can get a prediction from the network and see that the prediction tensor's device attribute confirms that the data is on cuda
, which is the GPU:
> gpu_pred = network(t) > gpu_pred.device device(type='cuda', index=0)
Likewise, we can go in the opposite way:
> t = t.cpu() > network = network.cpu() > cpu_pred = network(t) > cpu_pred.device device(type='cpu')
This is, in a nutshell, how we can utilize the GPU capabilities of PyTorch. What we should turn to now are some important details that are lurking beneath the surface of the code we've just seen.
For example, although we've used the cuda()
and cpu()
methods, they actually aren't our best options. Furthermore, what's the difference with the methods between
the network instance and the tensor instance? These after all are different objects types, which means the two methods are different. Finally, we want to integrate this code into a working example and do
a performance test.
General Idea of Using a GPU
The main takeaway at this point is that our network and our data must both exist on the GPU in order to perform computations using the GPU, and this applies to any programming language or framework.

As we'll see in our next demonstration, this is also true for the CPU. GPUs and CPUs are compute devices that compute on data, and so any two values that are directly being used with one another in a computation, must exist on the same device.
PyTorch Tensor
Computations on a GPU
Let's dive deeper by demonstrating some tensor computations.
We'll start by creating two tensors:
t1 = torch.tensor([ [1,2], [3,4] ]) t2 = torch.tensor([ [5,6], [7,8] ])
Now, we'll check which device these tensors were initialized on by inspecting the device attribute:
> t1.device, t2.device (device(type='cpu'), device(type='cpu'))
As we'd expect, we see that, indeed, both tensors are on the same device, which is the CPU. Let's move the first tensor t1
to the GPU.
> t1 = t1.to('cuda') > t1.device device(type='cuda', index=0)
We can see that this tensor's device has been changed to cuda
, the GPU. Note the use of the to()
method here. Instead of calling a particular method to move to a device, we
call the same method and pass an argument that specifies the device. Using the to()
method is the preferred way of moving data to and from devices.
Also, note the reassignment. The operation is not in-place, and so the reassignment is required.
Let's try an experiment. I'd like to test what we discussed earlier by attempting to perform a computation on these two tensors, t1
and
t2
, that we now know to be on different devices.
Since we expect an error, we'll wrap the call in a try
and catch the exception:
try: t1 + t2 except Exception as e: print(e) expected device cuda:0 but got device cpu
By reversing the order of the operation, we can see that the error also changes:
try: t2 + t1 except Exception as e: print(e) expected device cpu but got device cuda:0
Both of these errors are telling us that the binary plus operation expects the second argument to have the same device as the first argument. Understanding the meaning of this error can help when debugging these types of device mismatches.
Finally, for completion, let's move the second tensor to the cuda device to see the operation succeed.
> t2 = t2.to('cuda') > t1 + t2 tensor([[ 6, 8], [10, 12]], device='cuda:0')
PyTorch nn.Module
Computations on a GPU
We've just seen how tensors can be moved to and from devices. Now, let's see how this is done with PyTorch nn.Module
instances.
More generally, we are interested in understanding how and what it means for a network to be on a device like a GPU or CPU. PyTorch aside, this is the essential issue.
We put a network on a device by moving the network's parameters to that said device. Let's create a network and take a look at what we mean.
network = Network()
Now, let's look at the network's parameters:
for name, param in network.named_parameters(): print(name, '\t\t', param.shape) conv1.weight torch.Size([6, 1, 5, 5]) conv1.bias torch.Size([6]) conv2.weight torch.Size([12, 6, 5, 5]) conv2.bias torch.Size([12]) fc1.weight torch.Size([120, 192]) fc1.bias torch.Size([120]) fc2.weight torch.Size([60, 120]) fc2.bias torch.Size([60]) out.weight torch.Size([10, 60]) out.bias torch.Size([10])
Here, we've created a PyTorch network, and we've iterated through the network's parameters. As we can see, the network's parameters are the weights and biases inside the network.
In other words, these are simply tensors that live on a device like we have already seen. Let's verify this by checking the device of each of the parameters.
for n, p in network.named_parameters(): print(p.device, '', n) cpu conv1.weight cpu conv1.bias cpu conv2.weight cpu conv2.bias cpu fc1.weight cpu fc1.bias cpu fc2.weight cpu fc2.bias cpu out.weight cpu out.bias
This shows us that all the parameters inside the network are, by default, initialized on the CPU.
An important consideration of this is that it explains why nn.Module
instances like networks don't actually have a device. It's not the network that lives on a device, but the tensors
inside the network that live on a device.
Let's see what happens when we ask a network to be moved to()
the GPU:
network.to('cuda') Network( (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1)) (fc1): Linear(in_features=192, out_features=120, bias=True) (fc2): Linear(in_features=120, out_features=60, bias=True) (out): Linear(in_features=60, out_features=10, bias=True) )
Note here that a reassignment was not required. This is because the operation is in-place as far as the network instance is concerned. However, this operation can be used as a reassignment operation. This is preferred for consistency between
nn.Module
instances and PyTorch tensors.
Here, we can see that now, all the network parameters are have a device of cuda
:
for n, p in network.named_parameters(): print(p.device, '', n) cuda:0 conv1.weight cuda:0 conv1.bias cuda:0 conv2.weight cuda:0 conv2.bias cuda:0 fc1.weight cuda:0 fc1.bias cuda:0 fc2.weight cuda:0 fc2.bias cuda:0 out.weight cuda:0 out.bias
Passing a Sample to the Network
Let's round off this demonstration by passing a sample to the network.
sample = torch.ones(1,1,28,28) sample.shape torch.Size([1, 1, 28, 28])
This gives us a sample tensor we can pass like so:
try: network(sample) except Exception as e: print(e) Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _thnn_conv2d_forward
Since our network is on the GPU and this newly created sample is on the CPU by default, we are getting an error. The error is telling us that the CPU tensor was expected to be a GPU tensor when calling the forward method of the first convolutional layer. This is precisely what we saw before when adding two tensors directly.
We can fix this issue by sending our sample to the GPU like so:
try: pred = network(sample.to('cuda')) print(pred) except Exception as e: print(e) tensor([[-0.0685, 0.0201, 0.1223, 0.1075, 0.0810, 0.0686, -0.0336, -0.1088, -0.0995, 0.0639]] , device='cuda:0' , grad_fn=<AddmmBackward> )
Finally, everything works as expected, and we get a prediction.
Writing Device Agnostic PyTorch Code
Before we wrap up, we need to talk about writing device agnostic code. This term device agnostic
means that our code doesn't depend on the underlying device. You may come across this terminology
when reading PyTorch documentation.
For example, suppose we write code that uses the cuda()
method everywhere, and then, we give the code to a user who doesn't have a GPU. This won't work. Don't worry. We've
got options!
Remember earlier when we saw the cuda()
and cpu()
methods?
We'll, one of the reasons that the to()
method is preferred, is because the to()
method is parameterized, and this makes it easier to alter the device we are choosing, i.e.
it's flexible!
For example, a user could pass in cpu
or cuda
as an argument to a deep learning program, and this would allow the program to be device agnostic.
Allowing the user of a program to pass an argument that determines the program's behavior is perhaps the best way to make a program be device agnostic. However, we can also use PyTorch to check for a supported GPU, and set our devices that way.
torch.cuda.is_available() True
Like, if cuda
is available, then use it!
PyTorch GPU Training Performance Test
Let's see now how to add the use of a GPU to the training loop. We're going to be doing this addition with the code we've been developing so far in the series.
This will allow us to easily compare times, CPU vs GPU.
Refactoring the RunManager
Class
Before we update the training loop, we need to update the RunManager
class. Inside the begin_run()
method we need to modify the device
of the images
tensor
that is passed to
add_graph
method.
It should look like this:
def begin_run(self, run, network, loader): self.run_start_time = time.time() self.run_params = run self.run_count += 1 self.network = network self.loader = loader self.tb = SummaryWriter(comment=f'-{run}') images, labels = next(iter(self.loader)) grid = torchvision.utils.make_grid(images) self.tb.add_image('images', grid) self.tb.add_graph( self.network ,images.to(getattr(run, 'device', 'cpu')) )
Here, we are using the getattr()
built in function to get the value of the device
on the run
object. If the run
object doesn't have a
device
, then cpu
is returned. This makes the code backward compatible. It will still work if we don't specify a device
for our run
.
Note that the network doesn't need to be moved to a device because it's device was set before being passed in. However, the images tensor is obtained from the loader.
Refactoring the Training Loop
We'll set our configuration parameters to have a device. The two logical options here are cuda
and cpu
.
params = OrderedDict( lr = [.01] ,batch_size = [1000, 10000, 20000] , num_workers = [0, 1] , device = ['cuda', 'cpu'] )
With these device values added to our configuration, they'll now be available to be accessed inside our training loop.
At the top of our run, we'll create a device that will be passed around inside the run and inside the training loop.
device = torch.device(run.device)
The first place we'll use this device is when initializing our network.
network = Network().to(device)
This will ensure that the network is moved to the appropriate device. Finally, we'll update our images
and labels
tensors by unpacking them separately and sending them to the
device like so:
images = batch[0].to(device) labels = batch[1].to(device)
That's all there is to it, we're ready to run this code and see the results.
run | epoch | loss | accuracy | epoch duration | run duration | lr | batch_size | num_workers | device |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1.08 | 0.59 | 7.50 | 9.67 | 0.01 | 1000 | 0 | cuda |
2 | 1 | 1.04 | 0.60 | 20.83 | 21.88 | 0.01 | 1000 | 0 | cpu |
3 | 1 | 1.03 | 0.61 | 7.84 | 10.69 | 0.01 | 1000 | 1 | cuda |
4 | 1 | 1.02 | 0.61 | 16.49 | 19.21 | 0.01 | 1000 | 1 | cpu |
5 | 1 | 2.10 | 0.24 | 7.69 | 12.30 | 0.01 | 10000 | 0 | cuda |
6 | 1 | 2.09 | 0.24 | 19.89 | 28.85 | 0.01 | 10000 | 0 | cpu |
7 | 1 | 2.11 | 0.25 | 8.05 | 15.21 | 0.01 | 10000 | 1 | cuda |
8 | 1 | 2.17 | 0.20 | 17.09 | 28.68 | 0.01 | 10000 | 1 | cpu |
9 | 1 | 2.28 | 0.21 | 9.65 | 17.56 | 0.01 | 20000 | 0 | cuda |
10 | 1 | 2.29 | 0.10 | 19.63 | 36.19 | 0.01 | 20000 | 0 | cpu |
11 | 1 | 2.29 | 0.14 | 8.18 | 19.59 | 0.01 | 20000 | 1 | cuda |
12 | 1 | 2.29 | 0.12 | 17.68 | 38.08 | 0.01 | 20000 | 1 | cpu |
Here, we can see that the cuda
device significantly out preformed the cpu
by 2x
to 3x
. Results may vary.
quiz
resources
updates
935f3b3
to()
method on a tensor is not an in-place operation, so a reassignment is required.
Calling the to()
method on a nn.Module
is an in-place operation as far as the network is concerned, so a reassignment is not required. However, we still prefer to do a reassignment for consistency.
Committed by May 19, 2020
on