This example deploys a text generation API using Llama3-8B and vLLM. It illustrates how VESSL AI facilitates the common logics of model deployment from launching a GPU-accelerated service workload to establishing an API server.

Upon deployment, VESSL AI also offloads the challenges in managing production models models while ensuring availability, scalability, and reliability.

  • Autoscaling the model to handle peak loads and scale to zero when it’s not being used.
  • Routing traffic efficiently across different model versions.
  • Providing a real-time monitoring of predictions and performance metrics through comprehensive dashboards and logs.

Read our announcement post for more details.

What you will do

  • Define a text generation API and create a model endpoint
  • Define service specifications
  • Deploy model to VESSL AI managed GPU cloud

1. Set up your environment

We’ll start with the LLaMA 3 example, which demonstrates how to deploy an AI service using a single YAML file. Follow these steps to prepare:

# Clone the example repository
git clone

## Install and configure vessl
pip install vessl
vessl configure

2. Deploy a vLLM LLaMA3 Server with VESSL Serve

Configure resource and environment to run VLLM LLaMA3 server through YAML file as follows.

# quickstart.yaml
name: vllm-llama-3-server
message: Quickstart to serve Llama3 model with vllm.
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small-spot
- command: |
    pip install vllm
    python -m vllm.entrypoints.openai.api_server --model $MODEL_NAME
  MODEL_NAME: casperhansen/llama-3-8b-instruct-awq
- port: 8000
    max: 2
    metric: cpu
    min: 1
    target: 50
    - port: 8000
      path: /metrics
  expose: 8000

For YAML manifest details, refer to the YAML schema reference.

Deploy your server easily using the YAML configuration and VESSL CLI with the following command:

cd examples/serve-quickstart
vessl serve create -f quickstart.yaml

Upon activation, access your model via the provided endpoint, as depicted below:

3. Explore the API Documentation

Access the API documentation by appending /docs to your endpoint URL:

4. Test the API with an OpenAI Client

For compatibility with OpenAI clients, install the OpenAI Python package:

pip install openai

Test your deployed API using the script. Replace YOUR-SERVICE-ENDPOINT with your actual endpoint and execute the command below:

python \
  --base-url "https://{YOUR-SERVICE-ENDPOINT}" \
  --prompt "Can you explain the background concept of LLM?"


  • NotFound (404): Requested entity not found. error while creating Revisions or Gateways via CLI:
    • Use the vessl whoami command to confirm if the default organization matches the one where Service exists.
    • You can use the vessl configure --reset command to change the default organization.
    • Ensure that Service is properly created within the selected default organization.
  • What’s the difference between Gateway and Endpoint?
    • There is no difference between the two terms; they refer to the same concept.
    • To prevent confusion, these terms will be unified under “Endpoint” in the future.
  • HPA Scale-in/Scale-out Approach:
    • Currently, VESSL Serve operates based on Kubernetes’ Horizontal Pod Autoscaler (HPA) and uses its algorithms as is. For detailed information, refer to the Kubernetes documentation.
    • As an example of how it works based on CPU metrics:
      • Desired replicas = ceil[current replicas * ( current CPU metric value / desired CPU metric value )]
      • The HPA constantly monitors this metric and adjusts the current replicas within the [min, max] range.

Let’s go ahead and deploy the model.

vessl serve create -f service.yaml -a

Once deployed, you can check the status of the model, including the endpoint, logs, and metrics under Services.

For our llama-3-textgen service, You can put the model into use using the following curl command. Make sure the replace ENDPOINT_URL and API_KEY with your own.

    -H "Content-Type: application/json" \
    -H "X-AUTH-KEY: ${API_KEY}" \
    -d '{
      "messages": [
          "role": "system",
          "content": "You are a pirate chatbot who always responds in pirate speak!"
          "role": "user",
          "content": "Who are you?"

What’s next?

Next, let’s see how you can serve your model with serverless mode with Text Generation Inference(TGI).