A distributed experiment is a single machine learning run on top of multi-node or multi-GPUs. The distributed experiment results are consist of logs, metrics, and artifacts for each worker which you can find under corresponding tabs.
VESSL automatically sets the below environment variables based on the configuration.
NUM_NODES
: Number of workers
NUM_TRAINERS
: Number of GPUs per node
RANK
: The global rank of node
MASTER_ADDR
: The address of the master node service
MASTER_PORT
: The port number on the master address
Running a distributed experiment on the web console is similar to a single node experiment. To create a distributed experiment, you only need to specify the number of workers. Other options are the same as those of a single node experiment.
To run a distributed experiment using CLI, the number of nodes must be set to an integer greater than one.
You can find the full example codes here.
Download the CIFAR dataset with the scripts below. and add a vessl type dataset to your organization.
Or, you can simply add an AWS S3 type dataset to your organization with the following public bucket URI.
To run a distributed experiment we recommend to use torch.distributed.launch
package. The example start command that runs on two nodes and one GPU for each node is as follows.
VESSL will automatically set environment variables of --node_rank
, --master_addr
, --master_port
, --nproc_per_node
and --nnodes
.
In a distributed experiment, all workers share an output storage. Please be aware that files can be overrided by other workers when you use same output path.