On-premises clusters
In the background, VESSL Clusters leverages GPU-accelerated Docker containers and Kubernetes pods. It abstracts the complex compute backends and system details of Kubernetes-backed GPU infrastructure into an easy-to-use web interface and simple CLI commands. Data Scientists and Machine Learning Researchers without any software or DevOps backgrounds can use VESSL’s single-line CURL command to set up and configure on-premise GPU servers for ML.
VESSL’s cluster integration is composed of four primitives.
- VESSL API Server — Enables communication between the user and the GPU clusters, through which users can launch containerized ML workloads.
- VESSL Cluster Agent — Sends information about the clusters and workloads running on the cluster such as the node specifications and model metrics.
- Control plane node — Acts as the 🔗 cluster-wide control tower and orchestrates subsidiary worker nodes.
- Worker nodes — Run specified ML workloads based on the runtime spec and environment received from the control plane node.
Integrating more powerful, multi-node GPU clusters for your team is as easy as integrating your personal laptop. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server.
Step-by-step Guide
(1) Prerequisites
Note that Ubuntu 18.04 or CentOS 7.9 or higher Linux OS is installed on your server.
Install dependencies
You can install all the dependencies required for cluster integration using a single-line curl
command. The command
- Installs 🔗 Docker if it’s not already installed.
- Installs and configures 🔗 NVIDIA container runtime.
- Installs 🔗 k0s, a lightweight Kubernetes distribution, and designates and configures a control plane node.
- Generates a token and a command for connecting worker nodes to the control plane node configured above.
If you wish to use your control plane solely for the control plane node — meaning not running any ML workloads on the control plane node and only using it for admin and monitoring purposes — add a --taint-controller
flag at the end of the command.
Upon installing all the dependencies, the command returns a follow-up command with a token. You can use this to add worker nodes to the control plane. If you don’t want to add an additional worker node you can skip to the next step.
You can confirm that your control plane and worker node have been successfully configured using a k0s
command.
(2) VESSL integration
You are now ready to integrate the Kubernetes cluster with VESSL. Make sure you have VESSL Client installed on the server and configured for your organization.
The following single-line command connects your Kubernetes-backed GPU cluster to VESSL.
Follow through prompting your configurtaion options. You can press Enter
to use the default values.
By this point, you have successfully completed the integration.
(3) Confirm integration
You can use VESSL CLI command or visit 🗂️ Clusters to confirm your integration.
Common troubleshooting commands
Here are common problems that our users face as they integrate on-premise Clusters. You can use the journalctl
command to get a more detailed log of the issue. Please share this log as you reach out for support.
VesslApiException: PermissionDenied (403): Permission denied.
It’s likely that you don’t have full access to install VESSL cluster agent on the server. Contact your organization’s cluster and infrastructure administrator to receive help.
VesslApiException: NotFound (404) Requested entity not found.
Try again after running the following command:
Invalid value: “k0s-ctrl-[HOSTNAME]”
There is an ongoing 🔗 issue related to Kubernetes hostname containing capital letters. Your hostname must be in lowercase alphanumeric characters.
You can solve this issue by contacting your organization’s cluster and infrastructure administrator to change your hostname, or by changing your hostname yourself using the following sudo
command:
Troubleshooting
If you’re experiencing issues with your on-premises cluster, or can’t figure out what’s causing them, try VESSL Flare.