This document provides a quickstart guide of VESSL Serve - managing revisions and the gateway using YAML manifests.

1. Prepare a Model to Serve

Prepare the model and service for deployment. In this document, we will use the MNIST example where you can train a model and register it to the VESSL Model Registry.

Use the following command in the CLI to proceed:

# Clone the example repository
git clone git@github.com:vessl-ai/examples.git
cd examples/mnist/pytorch

# Train the model and register it to the repository
pip install -r requirements.txt
python main.py --output-path ./output --save-model

# Register the model
python model.py --checkpoint ./output/model.pt --model-repository mnist-example
For more detailed information about the VESSL Model Registry, please refer to the Model Registry section.

2. Create a Serving Instance

Create a serving instance for deployment. Navigate to the ‘Serving’ section in the VESSL Web Console and click the ‘New Serving’ button. This will allow you to create a serving named mnist-example.

  1. Write manifest file for serving revision

Create a new serving revision. Save the following content as a file named serve-revision.yaml:

message: VESSL Serve example
image: quay.io/vessl-ai/kernels:py38-202308150329
resources:
  name: v1.cpu-2.mem-6
run: vessl model serve mnist-example 1 --install-reqs
autoscaling:
  min: 1
  max: 3
  metric: cpu
  target: 60
ports:
  - port: 8000
    name: fastapi
    type: http

You can easily deploy the revision defined in YAML using the VESSL CLI as shown below:

vessl serve revision create --serving mnist-example -f serve-revision.yaml

Refer to the YAML schema reference for detailed information on the YAML manifest schema.

Ensure that you specify a container image with the same Python version as used during model creation. For instance, if you trained the model with Python 3.8, it’s recommended to use an image containing Python 3.8, such as quay.io/vessl-ai/kernels:py38-202308150329.

3. Create an Endpoint

To perform inference with the created revision, it’s necessary to expose it to the external network. in VESSL Seriving, Gateway(Endpoint) determines how traffic is routed and distributed to which port.

Firstly, create a YAML file defining Gateway. Create a file named serve-gateway.yaml with the following content:

enabled: true
targets:
  - number: 1   # Use the revision number you got in previous step
    port: 8000
    weight: 100

The Gateway can be easily deployed using the VESSL CLI, as shown below:

vessl serve gateway update --serving mnist-example -f serve-gateway.yaml

To check the status of the deployed Gateway, use the vessl serve gateway show command.

vessl serve gateway show --serving mnist-example

You can check the status of the deployed Gateway as shown below:

 Enabled True
 Status success
 Endpoint model-service-gateway-xyzyxyxx.managed-cluster-apne2.vessl.ai
 Ingress Class nginx
 Annotations (empty)
 Traffic Targets
 - ########## 100%:  22 (port 8000)

4. Dividing Traffic Among Multiple Revisions

To deploy a new version of the model without interrupting the service, a process is required where the new version is deployed first, followed by a gradual transition of traffic.

In VESSL Serve, the Gateway (Endpoint) provides the capability to distribute traffic across multiple Revisions.

Begin by defining and deploying the new Revision.

message: Revision v2
image: quay.io/vessl-ai/kernels:py38-202308150329
resources:
  name: v1.cpu-2.mem-6
run: vessl model serve mnist-example 2 --install-reqs # New model version
autoscaling:
  min: 1
  max: 3
  metric: cpu
  target: 60
ports:
  - port: 8000
    name: fastapi
    type: http
vessl serve revision create --serving mnist-example -f serve-revision.yaml
Successfully created revision in serving mnist-example.

 Number 2
 Status pending
 Message Revision v2

Subsequently, modify the serve-gateway.yaml to split traffic to the new Revision.

enabled: true
targets:
  - number: 1
    port: 8000
    weight: 90
  - number: 2
    port: 8000
    weight: 10

Update the Gateway configuration with the provided settings:

vessl serve gateway update --serving mnist-example -f gateway.yaml

Executing this command will display the Gateway’s status, revealing the distribution of traffic across the specified Revisions.

Successfully update gateway of serving mnist-example.

 Enabled True
 Status success
 Endpoint model-service-gateway-xyzyxyxx.managed-cluster-apne2.vessl.ai
 Ingress Class nginx
 Annotations (empty)
 Traffic Targets
 - #          10 %:  1 (port 8000)
 - #########  90 %:  2 (port 8000)

5. Helpful Tips for Using VESSL Serve

Simultaneously Update Revisions and Endpoint Configurations

After defining a Revision using YAML, you can create the revision and launch the gateway simultaneously by providing parameters directly in the CLI. Here’s an example of the CLI command:

vessl serve revision create --serving serve-example -f serve-exmple.yaml \
  --update-gateway --enable-gateway-if-off --update-gateway-port 8000 --update-gateway-weight 100

By using the --update-gateway option, you can update the gateway (endpoint) simultaneously while creating a revision. The following options can be used in conjunction:

  • --enable-gateway-if-off: This option changes the gateway’s status to “enabled” if it’s currently disabled.
  • --update-gateway-port: Specify the port to be used by the newly created revision. This should be used in conjunction with -update-gateway-weight below.
  • --update-gateway-weight: Define how traffic should be distributed to the newly created revision. This should be used alongside the -update-gateway-weight option mentioned above.

Troubleshooting

  • NotFound (404): Requested entity not found. error while creating Revisions or Gateways via CLI:
    • Use the vessl whoami command to confirm if the default organization matches the one where Serving exists.
    • You can use the vessl configure --reset command to change the default organization.
    • Ensure that Serving is properly created within the selected default organization.
  • What’s the difference between Gateway and Endpoint?
    • There is no difference between the two terms; they refer to the same concept.
    • To prevent confusion, these terms will be unified under “Endpoint” in the future.
  • HPA Scale-in/Scale-out Approach:
    • Currently, VESSL Serve operates based on Kubernetes’ Horizontal Pod Autoscaler (HPA) and uses its algorithms as is. For detailed information, refer to the Kubernetes documentation.
    • As an example of how it works based on CPU metrics:
      • Desired replicas = ceil[current replicas * ( current CPU metric value / desired CPU metric value )]
      • The HPA constantly monitors this metric and adjusts the current replicas within the [min, max] range.