Only the PyTorch framework is supported distributed experiment currently.
What is a distributed experiment?
A distributed experiment is a single machine learning run on top of multi-node or multi-GPUs. The distributed experiment results are consist of logs, metrics, and artifacts for each worker which you can find under corresponding tabs.Environment variables
VESSL automatically sets the below environment variables based on the configuration.NUM_NODES: Number of workers
NUM_TRAINERS: Number of GPUs per node
RANK: The global rank of node
MASTER_ADDR: The address of the master node service
MASTER_PORT: The port number on the master address
Creating a distributed experiment
Using Web Console
Running a distributed experiment on the web console is similar to a single node experiment. To create a distributed experiment, you only need to specify the number of workers. Other options are the same as those of a single node experiment.Using CLI
To run a distributed experiment using CLI, the number of nodes must be set to an integer greater than one.Examples: Distributed CIFAR
You can find the full example codes here.Step 1: Prepare CIFAR-10 dataset
Download the CIFAR dataset with the scripts below. and add a vessl type dataset to your organization.Step 2: Create a distributed experiment
To run a distributed experiment we recommend to usetorch.distributed.launch package. The example start command that runs on two nodes and one GPU for each node is as follows.
--node_rank, --master_addr, --master_port, --nproc_per_node and --nnodes.

