Run distributed training jobs
What is a distributed experiment?
A distributed experiment is a single machine learning run on top of multi-node or multi-GPUs. The distributed experiment results are consist of logs, metrics, and artifacts for each worker which you can find under corresponding tabs.
Environment variables
VESSL automatically sets the below environment variables based on the configuration.
NUM_NODES
: Number of workers
NUM_TRAINERS
: Number of GPUs per node
RANK
: The global rank of node
MASTER_ADDR
: The address of the master node service
MASTER_PORT
: The port number on the master address
Creating a distributed experiment
Using Web Console
Running a distributed experiment on the web console is similar to a single node experiment. To create a distributed experiment, you only need to specify the number of workers. Other options are the same as those of a single node experiment.
Using CLI
To run a distributed experiment using CLI, the number of nodes must be set to an integer greater than one.
Examples: Distributed CIFAR
You can find the full example codes here.
Step 1: Prepare CIFAR-10 dataset
Download the CIFAR dataset with the scripts below. and add a vessl type dataset to your organization.
Or, you can simply add an AWS S3 type dataset to your organization with the following public bucket URI.
Step 2: Create a distributed experiment
To run a distributed experiment we recommend to use torch.distributed.launch
package. The example start command that runs on two nodes and one GPU for each node is as follows.
VESSL will automatically set environment variables of --node_rank
, --master_addr
, --master_port
, --nproc_per_node
and --nnodes
.
Files
In a distributed experiment, all workers share an output storage. Please be aware that files can be overrided by other workers when you use same output path.