This example fine-tunes Llama2-7B with a code instruction dataset, illustrating how VESSL AI offloads the infrastructural challenges of large-scale AI workloads and help you train multi-billion-parameter models in hours, not weeks.

This is the most compute-intensive workload yet but you will see how VESSL AI’s efficient training stack enables you to seamlessly scale and execute multi-node training. For a more in-depth guide, refer to our blog post.

What you will do

  • Fine-tune an LLM with zero-to-minimum setup
  • Mount a custom dataset
  • Store and export model artifacts

Writing the YAML

Let’s fill in the llama2_fine-tuning.yml file.

1

Spin up a training job

Let’s spin up an instance.

name: Llama2-7B fine-tuning 
description: Fine-tune Llama2-7B with instruction datasets
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small
image: quay.io/vessl-ai/torch:2.1.0-cuda12.2-r3
2

Mount the code, modal, and dataset

Here, in addition to our GitHub repo and Hugging Face model, we are also mounting a Hugging Face dataset.

As with our HF model, mountint data is as simple as referencing the URL beginnging with the hf:// scheme — this goes the same for other cloud storages as well, s3:// for Amazon S3 for example.

name: llama2-finetuning
description: Fine-tune Llama2-7B with instruction datasetst
resources:
	cluster: vessl-gcp-oregon
	preset: gpu-l4-small
image: quay.io/vessl-ai/torch:2.1.0-cuda12.2-r3
import:
	/model/: hf://huggingface.co/VESSL/llama2
	/code/:
		git:
			url: https://github.com/vessl-ai/hub-model
			ref: main
/dataset/: hf://huggingface.co/datasets/VESSL/code_instructions_small_alpaca
3

Write the run commands

Now that we have the three pillars of model development mounted on our remote workload, we are ready to define the run command. Let’s install additiona Python dependencies and run finetuning.py — which calls for our HF model and datasets in the config.yaml file.

name: llama2-finetuning
description: Fine-tune Llama2-7B with instruction datasetst
resources:
	cluster: vessl-gcp-oregon
	preset: gpu-l4-small
image: quay.io/vessl-ai/torch:2.1.0-cuda12.2-r3
import:
	/model/: hf://huggingface.co/VESSL/llama2
	/code/:
		git:
			url: https://github.com/vessl-ai/hub-model
			ref: main
	/dataset/: hf://huggingface.co/datasets/VESSL/code_instructions_small_alpaca
run:
	- command: |-
			pip install -r requirements.txt
			python finetuning.py
		workdir: /code/llama2-finetuning
4

Export a model artifact

You can keep track of model checkpoints by dedicating an export volume to the workload. After training is finished, trained models are uploaded to the artifact folder as model checkpoints.

name: llama2-finetuning
description: Fine-tune Llama2-7B with instruction datasetst
resources:
	cluster: vessl-gcp-oregon
	preset: gpu-l4-small
image: quay.io/vessl-ai/torch:2.1.0-cuda12.2-r3
import:
	/model/: hf://huggingface.co/VESSL/llama2
	/code/:
		git:
			url: https://github.com/vessl-ai/hub-model
			ref: main
	/dataset/: hf://huggingface.co/datasets/VESSL/code_instructions_small_alpaca
run:
	- command: |-
			pip install -r requirements.txt
			python finetuning.py
		workdir: /code/llama2-finetuning
export:
/artifacts/: vessl-artifact://

Running the workload

Once the workload is completed, you can follow the link in the terminal to get the output files including the model checkpoints under Files.

vessl run create -f llama2_fine-tuning.yml

Behind the scenes

With VESSL AI, you can launch a full-scale LLM fine-tuning workload on any cloud, at any scale, without worrying about these underlying system backends.

  • Model checkpointing — VESSL AI stores .pt files to mounted volumes or model registry and ensures seamless checkpointing of fine-tuning progress.
  • GPU failovers — VESSL AI can autonomously detect GPU failures, recover failed containers, and automatically re-assign workload to other GPUs.
  • Spot instances — Spot instance on VESSL AI works with model checkpointing and export volumes, saving and resuming the progress of interrupted workloads safely.
  • Distributed training — VESSL AI comes with native support for PyTorch DistributedDataParallel and simplifies the process for setting up multi-cluster, multi-node distributed training.
  • Autoscaling — As more GPUs are released from other tasks, you can dedicate more GPUs to fine-tuning workloads. You can do this on VESSL AI by adding the following to your existing fine-tuning YAML.

Tips & tricks

In addition to the model checkpoints, you can track key metrics and parameters with vessl.log Python SDK. Here’s a snippet from finetuning.py.

class VesslLogCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if "eval_loss" in logs.keys():
            payload = {
                "eval_loss": logs["eval_loss"],
            }
            vessl.log(step=state.global_step, payload=payload)
        elif "loss" in logs.keys():
            payload = {
                "train_loss": logs["loss"],
                "learning_rate": logs["learning_rate"],
            }
            vessl.log(step=state.global_step, payload=payload)

Using our web interface

You can repeat the same process on the web. Head over to your Organization, select a project, and create a New run.

What’s next?

We shared ho you can use VESSL AI to go from a simple Python container to a full-scale AI workload. We hope these guides give you a glimpse of what you can achieve with VESSL AI. For more resources, follow along our example models or use casese.