Neural Network Programming - Deep Learning with PyTorch

with deeplizard.

PyTorch DataLoader num_workers - Deep Learning Speed Limit Increase

September 28, 2019 by

Blog

PyTorch DataLoader num_workers Test - Speed Things Up

Welcome to this neural network programming series. In this episode, we will see how we can speed up the neural network training process by utilizing the multiple process capabilities of the PyTorch DataLoader class.

boat going fast

Without further ado, let's get started.

Speeding Up the Training Process

To speed up the training process, we will make use of the num_workers optional attribute of the DataLoader class.

The num_workers attribute tells the data loader instance how many sub-processes to use for data loading. By default, the num_workers value is set to zero, and a value of zero tells the loader to load the data inside the main process.

This means that the training process will work sequentially inside the main process. After a batch is used during the training process and another one is needed, we read the batch data from disk.

Now, if we have a worker process, we can make use of the fact that our machine has multiple cores. This means that the next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is where the speed up comes from. The batches are loaded using additional worker processes and are queued up in memory.

Optimal Value for the num_workers attribute

The natural question that arises is, how many worker processes should we add? There are a lot of factors that can affect the optimal number here, so the best way to find out is to test.

Testing Values for the num_workers Attribute

To set up this test, we'll create a list of num_workers values to try. We'll try the following values:

  • 0 (default)
  • 1
  • 2
  • 4
  • 8
  • 16

For each of these values, we'll vary the batch size by trying the following values:

  • 100
  • 1000
  • 10000

For the learning rate, we'll keep it at a constant value of .01 for all of the runs.

The last thing to mention about the setup here is the fact that we are only doing a single epoch for each of the runs.

Alright, let's see what we get.

Different num_workers Values: Results

Alright, we can see down below that we have the results. We completed a total of eighteen runs. We have three groups of differing batch sizes, and inside each of these groups, we varied the number of worker processes.

run epoch loss accuracy epoch duration run duration lr batch_size num_workers
1 1 0.566253 0.782583 23.281029 23.374832 0.01 100 0
2 1 0.573350 0.783917 18.125359 18.965940 0.01 100 1
3 1 0.574852 0.782133 18.161020 19.037995 0.01 100 2
4 1 0.593246 0.775067 18.637056 19.669869 0.01 100 4
5 1 0.587598 0.777500 18.631994 20.123626 0.01 100 8
6 1 0.596401 0.775983 20.110439 22.930428 0.01 100 16
7 1 1.105825 0.577500 21.254815 21.941008 0.01 1000 0
8 1 1.013017 0.612267 15.961835 17.457127 0.01 1000 1
9 1 0.881558 0.666200 16.060656 17.614599 0.01 1000 2
10 1 1.034153 0.606767 16.206196 17.883490 0.01 1000 4
11 1 0.963817 0.626400 16.700765 18.882340 0.01 1000 8
12 1 1.046822 0.601683 17.912993 21.747298 0.01 1000 16
13 1 2.173913 0.265983 22.219368 27.145123 0.01 10000 0
14 1 2.156031 0.191167 16.563987 23.368729 0.01 10000 1
15 1 2.182048 0.210250 16.128202 23.030015 0.01 10000 2
16 1 2.245768 0.200683 16.248334 22.108252 0.01 10000 4
17 1 2.177970 0.206483 16.921782 23.897321 0.01 10000 8
18 1 2.153342 0.208017 18.555999 26.654219 0.01 10000 16

The main take-away from these results is that, across all three batch sizes, having a single worker process in addition to the main process resulted in a speed up of about twenty percent.

20% Faster!

Additionally, adding additional worker processes after the first one didn't really show any further improvements.

Interpreting the Results

The twenty percent speed up that we see after adding a single worker process makes sense because the main process had less work to do.

While the main process is busy performing the forward and backward passes, the worker process is loading the next batch. By the time the main process is ready for another batch, the worker process already has it queued up in memory.

As a result, the main process doesn't have to read the data from disk. Instead, the data is already in memory, and this gives us the twenty percent speed up.

Now, why are we not seeing additional speed ups after adding more workers?

Make It Go Faster with More Workers?

We'll if one worker is enough to keep the queue full of data for the main process, then adding more batches of data to the queue isn't going to do anything. This is what I think we are seeing here.

Just because we are adding more batches to the queue doesn't mean the batches are being processes faster. Thus, we are bounded by the time it takes to forward and backward propagate a given batch.

We can even see that things start bogging as we get to 16 workers.

Hope this helps speed you up!

Description

Welcome to this neural network programming series. In this episode, we will see how we can speed up the neural network training process by utilizing the multiple process capabilities of the PyTorch DataLoader class. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning. num_workers (int, optional) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0) 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👀 OUR VLOG: 🔗 https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og 👉 Check out the blog post and other resources for this video: 🔗 https://deeplizard.com/learn/video/kWVgvsejXsE 💻 DOWNLOAD ACCESS TO CODE FILES 🤖 Available for members of the deeplizard hivemind: 🔗 https://www.patreon.com/posts/27743395 🧠 Support collective intelligence, join the deeplizard hivemind: 🔗 https://deeplizard.com/hivemind 🤜 Support collective intelligence, create a quiz question for this video: 🔗 https://deeplizard.com/create-quiz-question 🚀 Boost collective intelligence by sharing this video on social media! ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: yasser Prash 👀 Follow deeplizard: Our vlog: https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og Twitter: https://twitter.com/deeplizard Facebook: https://www.facebook.com/Deeplizard-145413762948316 Patreon: https://www.patreon.com/deeplizard YouTube: https://www.youtube.com/deeplizard Instagram: https://www.instagram.com/deeplizard/ 🎓 Deep Learning with deeplizard: Fundamental Concepts - https://deeplizard.com/learn/video/gZmobeGL0Yg Beginner Code - https://deeplizard.com/learn/video/RznKVRTFkBY Advanced Code - https://deeplizard.com/learn/video/v5cngxo4mIg Advanced Deep RL - https://deeplizard.com/learn/video/nyjbcRQ-uQ8 🎓 Other Courses: Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://www.amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard’s link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ 🔗 http://incompetech.com/ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.