Train nanoGPT from scratch
Launch batch jobs to train nanoGPT from scratch and track experiments with vessl.log()
NanoGPT is a simple codebase for training and fine-tuning small-sized GPTs. This example explains how you can launch batch jobs with VESSL Run to train nanoGPT from scratch, and use VESSL AI’s Python SDK to log & track the training progress.
Try it on VESSL Hub
Try out the Quickstart example with a single click on VESSL Hub.
See the final code
See the completed YAML file and final code for this example.
What you will do
- Write YAML to train nanoGPT from scratch
- Run batch jobs with different hyperparameters using VESSL Run
- Use
vessl.log()
to log model’s key metrics to VESSL AI during training
Using our Python SDK
As we previously covered here, you can log metrics like accuracy and loss during each epoch with vessl.log()
. In this example, we have the training loop defined in train.py
. For every iterations, we log the learning rate and loss on the training and validation datasets as defined in lr
, train_loss
, and val_loss
.
vessl.log(
step=iter_num,
payload={
"train_loss": losses["train"],
"val_loss": losses["val"],
"lr": lr,
},
)
Let’s also edit the code that we can accept the three hyperparameters, batch_size
, block_size
, and learning_rate
, as environment variables. Later, you will see how you can quickly edit these values directly in our YAML for the batch jobs.
batch_size = int(os.environ.get("batch_size", globals()["batch_size"]))
block_size = int(os.environ.get("block_size", globals()["block_size"]))
learning_rate = float(os.environ.get("learning_rate", globals()["learning_rate"]))
Writing the YAML
Here’s the completed nanogpt.yaml
file for training nanoGPT from scratch. We have the familiar YAML definition we previously covered in our get started guides. As a quick recap, we are
- launching a GPU instance on our managed google cloud,
- setting up a runtime with NVIDIA’s PyTorch NGC Docker image,
- mounting our GitHub codebase, and
- exporting the results, in this case the generated text, to our ESSL AI’s managed artifacts
name: nanogpt-1
description: The fastest way to build your own storyteller with VESSL AI
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small
image: nvcr.io/nvidia/pytorch:22.03-py3
import:
/root/examples/: git://github.com/vessl-ai/examples.git
export:
/out-shakespeare-char/: vessl-artifact://
run:
- command: |-
pip install torchaudio -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install transformers datasets tiktoken tqdm
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char
workdir: /root/examples/nanogpt
In the run
section we defined a series of commands to start training.
pip install
— We install dependencies liketorchaudio
,tiktoken
, andtqdm
in addition to our base Docker image.prepare.py
— This step preprocesses the data for training.train.py
— The command runstrain.py
usingtrain_shakespeare_char.py
as the configuration file. The configuration file defines parameters necessary for training, such as learning rate, number of epochs, and batch size.sample.py
— The command generates sample outputs to the specified directory during trianing.
You can try this run by running the vessl run
.
vessl run -f nanogpt-1.yaml
Running batch jobs
You can set hyperparameters with different environment variables with the YAML file we defined above simply by adding a env
section. Here, we created a new file called nanogpt-2.yaml
where we changed the batch_size
and block_size
.
name: nanogpt-2
description: The fastest way to build your own storyteller with a batch run on VESSL.
resources:
cluster: vessl-gcp-oregon
preset: gpu-l4-small
image: nvcr.io/nvidia/pytorch:22.03-py3
import:
/root/examples/: git://github.com/vessl-ai/examples.git
export:
/out-shakespeare-char/: vessl-artifact://
run:
- command: |-
pip install torchaudio -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install transformers datasets tiktoken tqdm
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char
workdir: /root/examples/nanogpt
env:
batch_size: "8"
block_size: "128"
Running batch jobs with different hyperapameters will be a matter running multiple vessl run
commands.
vessl run -f nanogpt-1.yaml # block_size 64
vessl run -f nanogpt-2.yaml # block_size 128
vessl run -f nanogpt-3.yaml # block_size 256
Tracking progress
The tracking dashboard on VESSL AI allows a real-time monitoring of the key metrics logged with vessl.log
. Refer to our guide on vessl.log
to learn more.
Once you set up a dashboard, you can also set configurations for each chart.
Was this page helpful?