Horovod – Machine Learning

How to configure Horovod in cluster Kabré

High performance computing demonstrated the strength of Deep Learning through tools that focus on improving your code. However, sometimes we spend most of the time figuring out how to adapt the execution of code in a new work environment. Kabré provides a manual for trying to prevent time losses for users that are starting with Horovod.

There are two ways to implement Horovod in different nodes: with GPUs or just CPUs. Before starting this tutorial is recommended to read Kabré’s User Guide. Check section Tule Nodes and Zárate Nodes to get information related with GPUs and CPUs.

GPUs

After setting the name and queue, start with the PBS file. You must configure how many nodes, process per node and GPUs will be need.

PBS -l nodes=1:ppn=1
PBS -l nodes=tule-01.cnca:ppn=4
PBS -l nodes=tule-01:ppn=4+tule-01:ppn=4

The first line shows a generic way to choose one node with one GPU and one process per node. You can also specify the name of the node as shown on the second line. Adding more than one will improve performance of parallel NN as the example on the third line.

Required libraries

We recommend using the follow libraries in case of using GPUs.

module load cuda/10.1.105
module load cudnn-9/7.0.4
module load intelpython/3.5
module load hdf5/1.10.0-patch1
module load openmpi/4.0.1
module load gcc/7.2.0

Command line to execute the code

Horovod command line is one of the most important steps because one small mistake here can bring you large hours of searching and investigation.

For a normal execution of your Neural Network with Horovod we recommend the following command line:

mpirun -np 2 \
-bind-to none -map-by slot -x HOROVOD_MPI_THREADS_DISABLE=1 \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib python3.5 test.py

In this context the flag -n represents the number of CPUs that will be used. For a deep explanation of all the flag and parameter click here.

Note:
Do not use the flag “-H“ for detailing the nodes or the specific CPUs and GPUs that we will be used. The PBS file does this job by configuring a specific node for the program. When Horovod runs, it takes the information through MPI. If you use the flag -H it would cause an issue.

Horovod provides some flags to see the behavior of your code:

-HOROVOD_TIMELINE=/yourhome/where/you/want/file_timeline.json
-HOROVOD_TIMELINE_MARK_CYCLES=1

The first line gives a profiling (in seconds) of the MPI function that Horovod was using. With the second line you can place a mark for each cycle of clock.

CPUs

After setting the name and queue, start with the PBS file. You must configure how many nodes and process per node will be need.

PBS -l nodes=1:ppn=1
PBS -l nodes=zarate-1a.cnca
PBS -l nodes=zarate-1a.cnca+zarate-1c.cnca

The first line shows a generic way to choose one node with one process. You can specify the node name as shown on second line. AAdding more than one will improve performance of parallel NN as the example on the third row.

ppn top is 64. If the user does not specify its value, the default one would be 64.

Required libraries

The following are the basic libraries for optimal performance with CPUs.

module load intelpython/3.5
module load hdf5/1.10.0-patch1
module load openmpi/4.0.1

Command line to execute the code

mpirun -np 2 \
-bind-to none -map-by slot -x HOROVOD_MPI_THREADS_DISABLE=1 \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib python3.5 test.py

In this context the flag -n represents the number of GPU that will be used. For more information of flags and parameters click here.

The Neural Networks in CPUs have almost the same flags than in GPUs. Check flag HOROVOD_TIMELINE explained in theGPUs section.