This example deploys a text generation API using Llama3-8B and vLLM. It illustrates how VESSL AI facilitates the common logics of model deployment from launching a GPU-accelerated service workload to establishing an API server.

Upon deployment, VESSL AI also offloads the challenges in managing production models models while ensuring availability, scalability, and reliability.

  • Autoscaling the model to handle peak loads and scale to zero when it’s not being used.
  • Routing traffic efficiently across different model versions.
  • Provideing a real-time monitoring of predictions and performance metrics through comprehensive dashboards and logs.

Read our announcement post for more details.

What you will do

  • Register fine-tuned model to a model registry
  • Define a text generation API and create a model endpoint
  • Define service specifications
  • Deploy model to VESSL AI managed GPU cloud

Registering model

Defining the API

Writing the service spec

Deploying the model

What’s next?

We shared how you can use VESSL AI to go from a simple Python container to a full-scale AI deployment. We hope these guides give you a glimpse of what you can achieve with VESSL AI. For more resources, follow along our example models or use casese.