Launch batch jobs on GPUs
Leverge the power of GPUs to efficiently train batch runs
Batch Run
Batch runs are designed to execute a series of commands defined in your YAML configuration and then terminate. Batch job is suitable for large-scale, long-running tasks. These tasks are powered by the robustness of GPU capabilities, which significantly hasten model training times.
A Simple Batch Run
Here is an example of a simple batch run YAML configuration. It specifies Docker image to be used, the resource required for the run, and the commands to be exectued during the run.
name: gpu-batch-run
description: Run a GPU-backed batch run.
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small
run:
- command: |
nvidia-smi
In this example, the resources.preset=v1.v100-1.mem-52
will request a V100 GPU instance. Next, the nvidia-smi
command will be executed to display the
NVIDIA system management inteface and then terminate the run.
Termination Protection
You can also define termination protection in a batch run. Termination protection keeps your run active for a specified duration even after your commands have finished executing. This can be usefrul for debugging or retrieving intermediate files.
name: gpu-batch-run
description: Run a GPU-backed batch run.
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small
run:
- command: |
nvidia-smi
termination_protect: true
In this example, the termination_protect
will protect the container termination after running nvidia-smi
command.
Train a Thin-Plate Spline Motion Model with GPU resource
Now let’s dive in more complex batch run configuration. This configuration file describes a batch run for training a Thin-Plate Spline Motion Model utilizing a V100 GPU.
name: Thin-Plate-Spline-Motion-Model
description: "Animate your own image in the desired way with a batch run on VESSL."
image: nvcr.io/nvidia/pytorch:21.05-py3
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small
run:
- workdir: /root/examples/deprecated/thin-plate-spline-motion-model
command: |
pip install -r requirements.txt
python run.py --config config/vox-256.yaml --device_ids 0
import:
/root/examples: git://github.com/vessl-ai/examples
/root/examples/vox: s3://vessl-public-apne2/vessl_run_datasets/vox/
In this batch run, the Docker image nvcr.io/nvidia/pytorch:21.05-py3
is used, and a V100 GPU (resources.preset=v1.v100-1.mem-52
) is allocated for the run. This will ensure that the training job runs on top of the V100 GPU.
The model and scripts used in this run are fetched from a Github repository (/root/examples: git://github.com/vessl-ai/examples
).
The commands executed in the run first install the requriements, and train the model using the run.py
script.
This example demonstrates how you can set up a batch run for GPU-backed training a machine learning model with a single YAML configuration.
What’s Next
For more advanced configurations and examples. please visit VESSL Hub.
VESSL Hub
A variatey of YAML examples that you can use as references