NanoGPT is a simple codebase for training and fine-tuning small-sized GPTs. This example explains how you can launch batch jobs with VESSL Run to train nanoGPT from scratch, and use VESSL AI’s Python SDK to log & track the training progress.

What you will do

  • Write YAML to train nanoGPT from scratch
  • Run batch jobs with different hyperparameters using VESSL Run
  • Use vessl.log() to log model’s key metrics to VESSL AI during training

Using our Python SDK

As we previously covered here, you can log metrics like accuracy and loss during each epoch with vessl.log(). In this example, we have the training loop defined in train.py. For every iterations, we log the learning rate and loss on the training and validation datasets as defined in lr, train_loss, and val_loss.

vessl.log(
    step=iter_num,
    payload={
        "train_loss": losses["train"],
        "val_loss": losses["val"],
        "lr": lr,
    },
)

Let’s also edit the code that we can accept the three hyperparameters, batch_size, block_size, and learning_rate , as environment variables. Later, you will see how you can quickly edit these values directly in our YAML for the batch jobs.

batch_size = int(os.environ.get("batch_size", globals()["batch_size"]))
block_size = int(os.environ.get("block_size", globals()["block_size"]))
learning_rate = float(os.environ.get("learning_rate", globals()["learning_rate"]))

Writing the YAML

Here’s the completed nanogpt.yaml file for training nanoGPT from scratch. We have the familiar YAML definition we previously covered in our get started guides. As a quick recap, we are

  • launching a GPU instance on our managed google cloud,
  • setting up a runtime with NVIDIA’s PyTorch NGC Docker image,
  • mounting our GitHub codebase, and
  • exporting the results, in this case the generated text, to our ESSL AI’s managed artifacts
name: nanogpt-1
description: The fastest way to build your own storyteller with VESSL AI
resources:
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small
image: nvcr.io/nvidia/pytorch:22.03-py3
import:
  /root/examples/: git://github.com/vessl-ai/examples.git
export:
  /out-shakespeare-char/: vessl-artifact://
run:
  - command: |-
      pip install torchaudio -f https://download.pytorch.org/whl/cu111/torch_stable.html
      pip install transformers datasets tiktoken tqdm
      python data/shakespeare_char/prepare.py
      python train.py config/train_shakespeare_char.py
      python sample.py --out_dir=out-shakespeare-char
    workdir: /root/examples/nanogpt

In the run section we defined a series of commands to start training.

  • pip install — We install dependencies like torchaudio, tiktoken, and tqdm in addition to our base Docker image.
  • prepare.py — This step preprocesses the data for training.
  • train.py — The command runs train.py using train_shakespeare_char.py as the configuration file. The configuration file defines parameters necessary for training, such as learning rate, number of epochs, and batch size.
  • sample.py — The command generates sample outputs to the specified directory during trianing.

You can try this run by running the vessl run.

vessl run -f nanogpt-1.yaml

Running batch jobs

You can set hyperparameters with different environment variables with the YAML file we defined above simply by adding a env section. Here, we created a new file called nanogpt-2.yaml where we changed the batch_size and block_size.

name: nanogpt-2
description: The fastest way to build your own storyteller with a batch run on VESSL.
resources:
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small
image: nvcr.io/nvidia/pytorch:22.03-py3
import:
  /root/examples/: git://github.com/vessl-ai/examples.git
export:
  /out-shakespeare-char/: vessl-artifact://
run:
  - command: |-
      pip install torchaudio -f https://download.pytorch.org/whl/cu111/torch_stable.html
      pip install transformers datasets tiktoken tqdm
      python data/shakespeare_char/prepare.py
      python train.py config/train_shakespeare_char.py
      python sample.py --out_dir=out-shakespeare-char
    workdir: /root/examples/nanogpt
env:
  batch_size: "8"
  block_size: "128"

Running batch jobs with different hyperapameters will be a matter running multiple vessl run commands.

vessl run -f nanogpt-1.yaml # block_size 64
vessl run -f nanogpt-2.yaml # block_size 128
vessl run -f nanogpt-3.yaml # block_size 256

Tracking progress

The tracking dashboard on VESSL AI allows a real-time monitoring of the key metrics logged with vessl.log. Refer to our guide on vessl.log to learn more.

Once you set up a dashboard, you can also set configurations for each chart.