> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vessl.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Phi-4-mini-reasoning deployment

> Serve & deploy vLLM-accelerated Phi-4-mini-reasoning

This example deploys a text generation API using [Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) and [vLLM](https://github.com/vllm-project/vllm). It illustrates how VESSL AI facilitates the common logics of model deployment from launching a GPU-accelerated service workload to establishing an API server.

Upon deployment, VESSL also offloads the challenges in managing production models while ensuring availability, scalability, and reliability.

VESSL guides you to **smooth** and **seamless performance** with the following items:

* Autoscaling the model to handle peak loads and scale to zero when it's not being used.
* Routing traffic efficiently across different model versions.
* Providing a real-time monitoring of predictions and performance metrics through comprehensive dashboards and logs.

Read our [announcement post](https://blog.vessl.ai/en/posts/vessl-serve) for more details.

<CardGroup cols={1}>
  <Card icon="book" title="YAML definition" href="https://docs.vessl.ai/reference/yaml/serve-yaml">
    See the completed YAML definition for VESSL Service.
  </Card>
</CardGroup>

## What you will do

<img style={{ borderRadius: '0.5rem' }} src="https://mintcdn.com/vesslai/q8PNhBb-7_q5awBv/images/get-started/deployment-title.png?fit=max&auto=format&n=q8PNhBb-7_q5awBv&q=85&s=832f404299c585ca56f199090b8e0649" width="3204" height="2404" data-path="images/get-started/deployment-title.png" />

* Define a text generation API and create a model endpoint
* Define service specifications
* Deploy model to VESSL managed GPU cloud

## Set up your environment

We'll start with the [Phi-4-mini-reasoning example](https://github.com/vessl-ai/examples/tree/main/services/service-quickstart), which demonstrates how to deploy an AI service using a single YAML file. Follow these steps to prepare:

```sh theme={null}
# Clone the example repository
git clone https://github.com/vessl-ai/examples.git

## Install and configure vessl
pip install vessl
vessl configure
```

## Deploy a vLLM Phi-4-mini-reasoning Server with VESSL Service

Configure resource and environment to run vLLM Phi-4-mini-reasoning server through YAML file as follows.

```yaml theme={null}
# quickstart.yaml
message: Quickstart to serve Phi-4-mini-reasoning model with vllm.
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
resources:
  cluster: vessl-oci-sanjose
  preset: gpu-a10-small
run: |-
  apt update && apt install -y libgl1
  pip install --upgrade vllm accelerate https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl --no-build-isolation
  vllm serve $MODEL_NAME --max-model-len 32768
env:
  MODEL_NAME: microsoft/Phi-4-mini-reasoning
ports:
- port: 8000
service:
  autoscaling:
    max: 2
    metric: cpu
    min: 1
    target: 50
  monitoring:
    - port: 8000
      path: /metrics
  expose: 8000
```

For YAML manifest details, refer to the [YAML schema reference](/reference/yaml/serve-yaml).

Deploy your server easily using the YAML configuration and VESSL CLI with the following command:

```sh theme={null}
cd examples/services/service-quickstart
vessl service create -f quickstart.yaml
```

<img style={{ borderRadius: '0.5rem' }} src="https://mintcdn.com/vesslai/q8PNhBb-7_q5awBv/images/get-started/deployment-warp.png?fit=max&auto=format&n=q8PNhBb-7_q5awBv&q=85&s=24e4cc2c06e9906625ededb796f25263" width="1208" height="344" data-path="images/get-started/deployment-warp.png" />

Upon activation, access your model via the provided endpoint, as depicted below:

<img style={{ borderRadius: '0.5rem' }} src="https://mintcdn.com/vesslai/q8PNhBb-7_q5awBv/images/get-started/deployment-endpoint.png?fit=max&auto=format&n=q8PNhBb-7_q5awBv&q=85&s=34b04d94b7c5ac4c441ac43d9260278b" width="1157" height="170" data-path="images/get-started/deployment-endpoint.png" />

<Note>Due to compatibility issues between Python and VESSL CLI, executing the command (`vessl service create -f quickstart.yaml`) may temporarily result in unexpected errors. **<u>If this occurs, please use VESSL CLI with Python 3.12 for the time being.</u>** We are working on it.</Note>

## Explore the API Documentation

Access the API documentation by appending `/docs` to your endpoint URL:

<img style={{ borderRadius: '0.5rem' }} src="https://mintcdn.com/vesslai/U31B4TuGi_otqUlh/images/service/quickstart/3_fastapi.png?fit=max&auto=format&n=U31B4TuGi_otqUlh&q=85&s=6abeabd0b8daf97eccf9d9502602839c" width="1281" height="503" data-path="images/service/quickstart/3_fastapi.png" />

## Test the API with an OpenAI Client

For compatibility with OpenAI clients, install the OpenAI Python package:

```python theme={null}
pip install openai
```

Test your deployed API using the `api-test.py` script. Replace `YOUR-SERVICE-ENDPOINT` with your actual endpoint and execute the command below:

```python theme={null}
python api-test.py \
  --base-url "https://{YOUR-SERVICE-ENDPOINT}" \
  --prompt "Can you explain the background concept of LLM?"
```

<img style={{ borderRadius: '0.5rem' }} src="https://mintcdn.com/vesslai/U31B4TuGi_otqUlh/images/service/quickstart/4_response.png?fit=max&auto=format&n=U31B4TuGi_otqUlh&q=85&s=cf7467ed498c43d50808b9e1f6919377" width="1278" height="1091" data-path="images/service/quickstart/4_response.png" />

<Tip>
  ### Troubleshooting

  * **NotFound (404): Requested entity not found** error while creating Revisions or Gateways via CLI:
    * Use the `vessl whoami` command to confirm if the default organization matches the one where Service exists.
    * You can use the `vessl configure --reset` command to change the default organization.
    * Ensure that Service is properly created within the selected default organization.

  * **What's the difference between Gateway and Endpoint?**
    * There is no difference between the two terms; they refer to the same concept.
    * To prevent confusion, these terms will be unified under "Endpoint" in the future.

  * **HPA Scale-in/Scale-out Approach:**
    * Currently, VESSL Service operates based on Kubernetes' Horizontal Pod Autoscaler (HPA) and uses its algorithms as is. For detailed information, refer to the [Kubernetes documentation](https://kubernetes.io/ko/docs/tasks/run-application/horizontal-pod-autoscale/).
    * As an example of how it works based on CPU metrics:
      * Desired replicas = `ceil[current replicas * ( current CPU metric value / desired CPU metric value )]`
      * The HPA constantly monitors this metric and adjusts the current replicas within the `[min, max]` range.
</Tip>

## What's next?

Next, let's see how you can serve your model with serverless mode with Text Generation Inference(TGI).

<CardGroup cols={1}>
  <Card icon="bolt-lightning" title="Enable Serverless Mode" href="serverless-deployment">
    Deploy with VESSL Service Serverless mode
  </Card>
</CardGroup>
