On-premise Clusters

Scale up your ML infra with on-premise GPUs

Overview

In the background, VESSL Clusters leverages GPU-accelerated Docker containers and Kubernetes pods. It abstracts the complex compute backends and system details of Kubernetes-backed GPU infrastructure into an easy-to-use web interface and simple CLI commands. Data Scientists and Machine Learning Researchers without any software or DevOps backgrounds can use VESSL's single-line CURL command to set up and configure on-premise GPU servers for ML.

VESSL’s cluster integration is composed of four primitives.

VESSL API Server — Enables communication between the user and the GPU clusters, through which users can launch containerized ML workloads.
VESSL Cluster Agent — Sends information about the clusters and workloads running on the cluster such as the node specifications and model metrics.
Control plane node — Acts as the 🔗 cluster-wide control tower and orchestrates subsidiary worker nodes.
Worker nodes — Run specified ML workloads based on the runtime spec and environment received from the control plane node.

Integrating more powerful, multi-node GPU clusters for your team is as easy as integrating your personal laptop. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server.

Step-by-step Guide

There is an ongoing 🔗 issue related to Kubernetes hostname containing capital letters. Please make sure your machine's hostname is in lowercase.

(1) Prerequisites

Note that Ubuntu 18.04 or CentOS 7.9 or higher Linux OS is installed on your server.

Install dependencies

You can install all the dependencies required for cluster integration using a single-line curl command. The command

Installs 🔗 Docker if it’s not already installed.
Installs and configures 🔗 NVIDIA container runtime.
Installs 🔗 k0s, a lightweight Kubernetes distribution, and designates and configures a control plane node.
Generates a token and a command for connecting worker nodes to the control plane node configured above.

If you wish to use your control plane solely for the control plane node — meaning not running any ML workloads on the control plane node and only using it for admin and monitoring purposes — add a --taint-controller flag at the end of the command.

curl -sSLf https://install.vessl.ai/bootstrap-cluster/bootstrap-cluster.sh | sudo bash -s -- --role=controller

Upon installing all the dependencies, the command returns a follow-up command with a token. You can use this to add worker nodes to the control plane. If you don’t want to add an additional worker node you can skip to the next step.

curl -sSLf https://install.vessl.ai/bootstrap-cluster/bootstrap-cluster.sh | sudo bash -s -- --role worker --token '[TOKEN_HERE]'

You can confirm that your control plane and worker node have been successfully configured using a k0s command.

sudo k0s kubectl get nodes

(2) VESSL integration

You are now ready to integrate the Kubernetes cluster with VESSL. Make sure you have VESSL Client installed on the server and configured for your organization.

pip install vessl --upgrade

vessl configure

The following single-line command connects your Kubernetes-backed GPU cluster to VESSL.

vessl cluster create

Follow through prompting your configurtaion options. You can press Enter to use the default values.

By this point, you have successfully completed the integration.

(3) Confirm integration

You can use VESSL CLI command or visit 🗂️ Clusters to confirm your integration.

vessl cluster list

Common troubleshooting commands

Here are common problems that our users face as they integrate on-premise Clusters. You can use the journalctl command to get a more detailed log of the issue. Please share this log as you reach out for support.

sudo journalctl -u k0scontroller | tail -n 20

VesslApiException: PermissionDenied (403): Permission denied.

kernel_clsuter.py111] VESSL cluster agent installed. Waiting for the agent to be connected with VESSL...
_base.py:107] VesslApiException: PermissionDenied (403): Permission denied.

It's likely that you don't have full access to install VESSL cluster agent on the server. Contact your organization's cluster and infrastructure administrator to receive help.

VesslApiException: NotFound (404) Requested entity not found.

kernel_cluster.py:289] Existing VESSL cluster installation found! getting cluster information...
_base.py:107] VesslApiException: NotFound (404) Requested entity not found.

Try again after running the following command:

sudo helm uninstall vessl =n vessl --kubeconfig="var/lib/k0s/pki/admin/conf"

Invalid value: "k0s-ctrl-[HOSTNAME]"

leaderelection.go:334] error initially creating leader election record: Lease.coordination.k8s.io "k0s-ctrl-[HOSTNAME]" is invalide: metadata.name: Invalid value: "k0s-ctrl-[HOSTNAME]": a lowercase RFC 1123 subdomain must consist of lowercase alphanumeric characters.

There is an ongoing 🔗 issue related to Kubernetes hostname containing capital letters. Your hostname must be in lowercase alphanumeric characters.

You can solve this issue by contacting your organization's cluster and infrastructure administrator to change your hostname, or by changing your hostname yourself using the following sudo command:

Changing your hostname may have unexpected side effects, and might be prohibited depending on your organization's IT policy.

sudo hostname [HOSTNAME]
sudo systemctl restart k0scontroller.

Troubleshooting

If you're experiencing issues with your on-premises cluster, or can't figure out what's causing them, try VESSL Flare.

PreviousPersonal Laptops NextPrivate Cloud (AWS)

Last updated 1 year ago