Skip to main content

How to increase GPU training throughput

This is a guide for increasing GPU training throughput given a batch of training run, as opposed to increasing instantaneous throughput for one training run. Obviously, doing the latter can increase training throughput, but that is not in the scope of this guide.

Use case: suppose we are sweeping hyper-parameters and there are N training runs. We want to finish these N runs as fast as possible.

Technique - Improve disk IO #

If the training data is a large set of files, e.g. 150GB ImageNet, read them from the local drive, instead of from a network drive. Same applies if we write a lot of data to disk.

Technique - Parallelize training #

If the machine is not resource-bound, then running another training in parallel should increase throughput. This is especially true when GPU utilization is not close to 100% and GPU memory can fit another copy of our model. Other resources to be mindful of are CPU and memory; disk IO; and network IO. If one of the resources is close to 100% utilization, parallelizing training is probably not going to increase training throughput. Check sections below to see how to check resource utilizations.

There are simple and advanced ways to parallelize training. Here are two:

  1. Parameterize our python script (see argparse). Run the script multiple times, each with different parameters.

  2. Use a more generic hyper-paramater search library such as Ray Tune.

How to check CPU and memory utilization? #

Read CPU and memory utilization by running htop at the terminal.

![htop]({{ “/img/2021-10-31-how-to-increase-gpu-training-throughput/htop.png” | relative_url }})

To install htop #

CentOS:

sudo yum install htop

Ubuntu:

sudo apt-get install htop

How to check GPU and memory utilization? #

Read utilization by running nvidia-smi at the terminal.

![htop]({{ “/img/2021-10-31-how-to-increase-gpu-training-throughput/nvidia_smi.png” | relative_url }})

How to check disk IO utilization #

TBD

How to check network IO utilization #

TBD

Summary #

By using local drive and parallelize training runs, we can improve GPU training throughput.