NUM_NODES
: Number of workers
NUM_TRAINERS
: Number of GPUs per node
RANK
: The global rank of node
MASTER_ADDR
: The address of the master node service
MASTER_PORT
: The port number on the master address
torch.distributed.launch
package. The example start command that runs on two nodes and one GPU for each node is as follows.
--node_rank
, --master_addr
, --master_port
, --nproc_per_node
and --nnodes
.