Llama 3.1 fine-tuning
Fine-tune Llama 3.1 8B with instruction datasets
This example fine-tunes Llama-3.1-8B with a code instruction dataset, illustrating how VESSL AI offloads the infrastructural challenges of large-scale AI workloads and help you train multi-billion-parameter models in hours, not weeks.
This is the most compute-intensive workload yet but you will see how VESSL AI’s training stack enables you to seamlessly scale and execute multi-node training. For a more in-depth introduction, refer to our blog post.
For non-gated quantized models, please refer to the relevant links provided below.
* https://huggingface.co/unsloth
* https://huggingface.co/casperhansen
* https://huggingface.co/TheBloke
What you will do
- Fine-tune an LLM with zero-to-minimum setup
- Mount a custom dataset
- Store and export model artifacts
Writing the YAML
Let’s fill in the llama-3-1-fine-tuning.yaml
file.
Spin up a training job
Let’s spin up an instance.
name: Llama-3.1-fine-tuning
description: Fine-tune Llama-3.1-8B on instruction datasets
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small-spot
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
Mount the code, model, and dataset
Here, in addition to our GitHub repo, we are also mounting a Hugging Face dataset.
As with our HF model, mounting data is as simple as referencing the URL beginning with the hf://
scheme — this goes the same for other cloud storages as well, s3://
for Amazon S3 for example.
As referenced in main.py
we are using Meta’s Llama 3.1 hosted on Hugging Face. Since this is a gated model, we need to authenticate the run with a Hugging Face token under env
.
name: Llama-3.1-fine-tuning
description: Fine-tune Llama-3.1-8B on instruction datasets
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small-spot
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
import:
/code/:
git:
url: https://github.com/vessl-ai/examples
ref: main
/dataset/: hf://huggingface.co/datasets/Amod/mental_health_counseling_conversations
Write the run commands
Now that we have the three pillars of model development mounted on our remote workload, we are ready to define the run command. Let’s install additional Python dependencies and run main.py
.
name: Llama-3.1-fine-tuning
description: Fine-tune Llama-3.1-8B on instruction datasets
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small-spot
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
import:
/code/:
git:
url: https://github.com/vessl-ai/examples
ref: main
/dataset/: hf://huggingface.co/datasets/Amod/mental_health_counseling_conversations
run:
- command: |-
pip install -r requirements.txt
pip install flash-attn==2.6.3 --no-build-isolation
python main.py \
--model_name_or_path unsloth/Meta-Llama-3.1-8B-bnb-4bit \
--dataset_name /dataset/ \
--output_dir outputs \
--max_seq_length 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--bf16 True \
--learning_rate 1e-5 \
--weight_decay 1e-2 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing True \
--peft_type LORA \
--load_in_4bit True \
--bnb_4bit_use_double_quant True
workdir: /code/runs/finetune-llms
Export a model artifact
You can keep track of model checkpoints by dedicating an export volume to the workload. Once training is finished, the trained models are uploaded to the artifact folder as model checkpoints.
name: Llama-3.1-fine-tuning
description: Fine-tune Llama-3.1-8B on instruction datasets
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small-spot
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
import:
/code/:
git:
url: https://github.com/vessl-ai/examples
ref: main
/dataset/: hf://huggingface.co/datasets/Amod/mental_health_counseling_conversations
run:
- command: |-
pip install -r requirements.txt
pip install flash-attn==2.6.3 --no-build-isolation
python main.py \
--model_name_or_path unsloth/Meta-Llama-3.1-8B-bnb-4bit \
--dataset_name /dataset/ \
--output_dir /artifacts/ \
--max_seq_length 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--bf16 True \
--learning_rate 1e-5 \
--weight_decay 1e-2 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing True \
--peft_type LORA \
--load_in_4bit True \
--bnb_4bit_use_double_quant True
workdir: /code/runs/finetune-llms
export:
/artifacts/: vessl-artifact://
Running the workload
Once the workload is completed, you can follow the link in the terminal to get the output files including the model checkpoints under Files.
vessl run create -f llama-3-1-fine-tuning.yaml
Behind the scenes
With VESSL AI, you can launch a full-scale LLM fine-tuning workload on any cloud, at any scale, without worrying about these underlying system backends.
- Model checkpointing — VESSL AI stores .pt files to mounted volumes or model registry and ensures seamless checkpointing of fine-tuning progress.
- GPU failovers — VESSL AI can autonomously detect GPU failures, recover failed containers, and automatically re-assign workload to other GPUs.
- Spot instances — Spot instance on VESSL AI works with model checkpointing and export volumes, saving and resuming the progress of interrupted workloads safely.
- Distributed training — VESSL AI comes with native support for PyTorch
DistributedDataParallel
and simplifies the process for setting up multi-cluster, multi-node distributed training. - Autoscaling — As more GPUs are released from other tasks, you can dedicate more GPUs to fine-tuning workloads. You can do this on VESSL AI by adding the following to your existing fine-tuning YAML.
Tips and Tricks
This example utilizes callbacks based on the Hugging Face Transformers library (see here). Therefore, metrics are automatically plotted, and no additional logging is required. VESSL AI supports integration with the Transformers library. By using the callback available in the Python SDK, metrics are automatically plotted, and users can choose to automatically upload models to VESSL Models if desired. For detailed usage, please refer to the Transformers integration page.
Using our web interface
You can repeat the same process on the web. Head over to your Organization, select a project, and create a New run.
What’s next?
Next, let’s see how you can serve and deploy your fine-tuning Llama 3 to the cloud and create a text-generation API endpoint.