Distributed Experiments
Early access feature
Only the PyTorch framework is supported distributed experiment currently.
A distributed experiment is a single machine learning run on top of multi-node or multi-GPUs. The distributed experiment results are consist of logs, metrics, and artifacts for each worker which you can find under corresponding tabs.
Caveats
Multi-node training is not always an optimal solution. We recommend you try several experiments with a few epochs to see if multi-node training is the correct choice for you.
VESSL automatically sets the below environment variables based on the configuration.
NUM_NODES
: Number of workersNUM_TRAINERS
: Number of GPUs per nodeRANK
: The global rank of nodeMASTER_ADDR
: The address of the master node serviceMASTER_PORT
: The port number on the master addressRunning a distributed experiment on the web console is similar to a single node experiment. To create a distributed experiment, you only need to specify the number of workers. Other options are the same as those of a single node experiment.
To run a distributed experiment using CLI, the number of nodes must be set to an integer greater than one.
vessl experiment create --worker-count 2 --framework-type pytorch
Download the CIFAR dataset with the scripts below. and add a vessl type dataset to your organization.
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
Or, you can simply add an AWS S3 type dataset to your organization with the following public bucket URI.
s3://savvihub-public-apne2/cifar-10
To run a distributed experiment we recommend to use
torch.distributed.launch
package. The example start command that runs on two nodes and one GPU for each node is as follows.python -m torch.distributed.launch \
--nnodes=$NUM_NODES \
--nproc_per_node=$NUM_TRAINERS \
--node_rank=$RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
examples/distributed_cifar/pytorch/main.py
VESSL will automatically set environment variables of
--node_rank
, --master_addr
, --master_port
, --nproc_per_node
and --nnodes
.In a distributed experiment, all workers share an output storage. Please be aware that files can be overrided by other workers when you use same output path.
Last modified 1yr ago