By using the serverless mode of VESSL Serve, you can quickly launch a fully operational inference server within minutes using a simple configuration file and instances from VESSL-managed Cloud. Furthermore, these instances are billed on an on-demand basis.

In this example, we will use the serverless mode of VESSL Serve to quickly launch a server using the Llama 3 with a Text Generation Interface(TGI). This example can be easily adapted to deploy your model for inference.

What you will do

  • Create a new service with a serverless mode
  • Create a new service revision
  • Send an HTTP request to service
  • Getting queued result

Create a new service with a serverless mode

Serverless Mode is only available in VESSL-managed cloud clusters.
Select your organization and click the “Services” tab. Click the “New Service” button on the right side of the “Services” page. This will allow you to set your first service information:

  • Name: Set your service name
  • Description: You can add any description for your services.
  • Cluster: The cluster in which your service is physically located.

Then, toggle “Serverless” to enable Serverless mode, and click “Create”. Your new service is created, automatically guiding you to make your first “Revision”, for setting your container environment.

Create a new service revision

You can set your container settings on this page.

  • Resources: Select GPU resource, (GPU)gpu-l4-small. This means that we will be using one NVIDIA L4 GPU and a 42GB RAM instance.
  • Container image: We will use a pre-created TGI docker image. Click on the “Custom” button and type ghcr.io/huggingface/text-generation-inference:2.0.2.
  • Commands: This is a bash command you can run in the container. Use the following command:
text-generation-launcher \
  --model-id $MODEL_ID \
  --port 8000 \
  --max-total-tokens 8192
  • Port: This is an open HTTP port for the container. Set the port to HTTP, 8000 and name it default.
  • Advanced Options:
    • Variable: You can set environment variables and secret values which can be used for the container. Click “Add variables or secrets”, and add the following name/value.
      • Name: MODEL_ID
      • Value: casperhansen/llama-3-8b-instruct-awq

Click the “Create” button in the right corner. Then, our first VESSL Serve is created!

Once the revision update is complete, your inference server will be ready to go.

Send an HTTP request to service

You can find your Service overview in the “Overview” Tab.

Click on the upper right “Request” Button. You can find information on how to send an inference request to your service.

You can find HTTP request information such as inference endpoint, authorization token, and sample request to inference server. For more information, please refer to Serverless API documentation.

Below is an example of Python code for requesting your inference results.

base_url = "{your-service-endpoint}"
token = "{your-token}"

import requests
response = requests.post(
    f"{base_url}/request/generate",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "inputs": "What is Deep Learning?",
        "parameters": {"max_new_tokens": 1000}
    }
)
print(f"response: {response}")
print(f"response json: {response.json()}")

Since TGI exposes an OpenAI-compatible interface, you can use OpenAI Python binding to access the server as well.

base_url = "{your-service-endpoint}"
token = "{your-token}"

from openai import OpenAI
client = OpenAI(
    base_url=f"{base_url}/request/v1",
    api_key=token,
)

chat_completion = client.chat.completions.create(
    model="casperhansen/llama-3-8b-instruct-awq",
    messages=[
        {"role": "user", "content": "What is Deep Learning?"}
    ],
)
print(chat_completion)

When the service is in a cold state (i.e. there are no running replicas due to service idleness) and a new request is made, a new replica will be started immediately.

In such case, the first few requests may get aborted due to timeouts, until the replica becomes up and running. Please consult your HTTP client’s timeout configuration.

Make an asynchronous request

Sometimes you may want to have requests processed asynchronously, for example:

  • to process a large amount of data in a batch, where it is infeasible or inefficient to make requests one-by-one;
  • to ensure all requests are processed eventually, and not be interrupted due to network timeouts;
  • where the caller has the capability to periodically poll for results, and immediate HTTP response is not a requirement.

To make asynchronous requests, you will use different pair of APIs: one for request creation and another for result fetch.

First, create an asynchronous request:

base_url = "{your-service-endpoint}"
token = "{your-token}"

import requests
r = requests.post(
    f"{base_url}/async",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "method": "POST",
        "path": "/generate",
        "data": { # payload to send to actual service
            "inputs": "What is Deep Learning?",
            "parameters": {"max_new_tokens": 1000}
        }
    }
)
assert r.status_code == 201

request_id = r.json()["id"]
print("Successfully created an async request.")
print(f"ID: {request_id}")

Then, you can periodically poll for request.

while True:
    r = requests.get(
        f"{base_url}/async/{request_id}/output",
        headers={"Authorization": f"Bearer {token}"}
    )
    assert r.status_code == 200
    resp = r.json()
    
    status = resp["status"]
    if status == "pending":
        print("Request is waiting in the queue.")
    elif status == "in_progress":
        print("Request is being processed.")
    elif status == "completed":
        print(f"Request complete! (status code: {resp['status_code']})")
        print()
        print("Output (JSON):")
        print(resp["output"])
        print()
        print("Output (raw):")
        print(resp["raw_output"])
        break
    elif status == "failed":
        print(f"Request failed! (status code: {resp['status_code']})")
        print()
        print("Reason:")
        print(resp["fail_reason"])
        print()
        print("Response body:")
        print(resp["raw_output"])
        break

    import time
    time.sleep(1)