Leverge the power of GPUs to efficiently train batch runs
resources.preset=v1.v100-1.mem-52
will request a V100 GPU instance. Next, the nvidia-smi
command will be executed to display the
NVIDIA system management inteface and then terminate the run.
termination_protect
will protect the container termination after running nvidia-smi
command.
nvcr.io/nvidia/pytorch:21.05-py3
is used, and a V100 GPU (resources.preset=v1.v100-1.mem-52
) is allocated for the run. This will ensure that the training job runs on top of the V100 GPU.
The model and scripts used in this run are fetched from a Github repository (/root/examples: git://github.com/vessl-ai/examples
).
The commands executed in the run first install the requriements, and train the model using the run.py
script.
This example demonstrates how you can set up a batch run for GPU-backed training a machine learning model with a single YAML configuration.