[DEV] VESSL Docs
  • Welcome to VESSL Docs!
  • GETTING STARTED
    • Overview
    • Quickstart
    • End-to-end Guides
      • CLI-driven Workflow
      • SDK-driven Workflow
  • USER GUIDE
    • Organization
      • Creating an Organization
      • Organization Settings
        • Add Members
        • Set Notifications
        • Configure Clusters
        • Add Integrations
        • Billing Information
    • Project
      • Creating a Project
      • Project Overview
      • Project Repository & Project Dataset
    • Clusters
      • Cluster Integrations
        • Fully Managed Cloud
        • Personal Laptops
        • On-premise Clusters
        • Private Cloud (AWS)
      • Cluster Monitoring
      • Cluster Administration
        • Resource Specs
        • Access Control
        • Quotas and Limits
        • Remove Cluster
    • Dataset
      • Adding New Datasets
      • Managing Datasets
      • Tips & Limitations
    • Experiment
      • Creating an Experiment
      • Managing Experiments
      • Experiment Results
      • Distributed Experiments
      • Local Experiments
    • Model Registry
      • Creating a Model
      • Managing Models
    • Sweep
      • Creating a Sweep
      • Sweep Results
    • Workspace
      • Creating a Workspace
      • Exploring Workspaces
      • SSH Connection
      • Downloading / Attaching Datasets
      • Running a Server Application
      • Tips & Limitations
      • Building Custom Images
    • Serve
      • Quickstart
      • Serve Web Workflow
        • Monitoring Dashboard
        • Service Logs
        • Service Revisions
        • Service Rollouts
      • Serve YAML Workflow
        • YAML Schema Reference
    • Commons
      • Running Spot Instances
      • Volume Mount
  • API REFERENCE
    • What is the VESSL CLI/SDK?
    • CLI
      • Getting Started
      • vessl run
      • vessl cluster
      • vessl dataset
      • vessl experiment
      • vessl image
      • vessl model
      • vessl organization
      • vessl project
      • vessl serve
      • vessl ssh-key
      • vessl sweep
      • vessl volume
      • vessl workspace
    • Python SDK
      • Integrations
        • Keras
        • TensorBoard
      • Utilities API
        • configure
        • vessl.init
        • vessl.log
          • vessl.Image
          • vessl.Audio
        • vessl.hp.update
        • vessl.progress
        • vessl.upload
        • vessl.finish
      • Dataset API
      • Experiment API
      • Cluster API
      • Image API
      • Model API
        • Model Serving API
      • Organization API
      • Project API
      • Serving API
      • SSH Key API
      • Sweep API
      • Volume API
      • Workspace API
    • Rate Limits
  • TROUBLESHOOTING
    • GitHub Issues
    • VESSL Flare
Powered by GitBook
On this page
  • Overview
  • Step-by-step Guide
  • (1) Prerequisites
  • (2) VESSL integration
  • (3) Confirm integration
  • Common troubleshooting commands
  • Troubleshooting
  1. USER GUIDE
  2. Clusters
  3. Cluster Integrations

On-premise Clusters

Scale up your ML infra with on-premise GPUs

PreviousPersonal LaptopsNextPrivate Cloud (AWS)

Last updated 1 year ago

Overview

In the background, VESSL Clusters leverages GPU-accelerated Docker containers and Kubernetes pods. It abstracts the complex compute backends and system details of Kubernetes-backed GPU infrastructure into an easy-to-use web interface and simple CLI commands. Data Scientists and Machine Learning Researchers without any software or DevOps backgrounds can use VESSL's single-line CURL command to set up and configure on-premise GPU servers for ML.

VESSL’s cluster integration is composed of four primitives.

  • VESSL API Server — Enables communication between the user and the GPU clusters, through which users can launch containerized ML workloads.

  • VESSL Cluster Agent — Sends information about the clusters and workloads running on the cluster such as the node specifications and model metrics.

  • Control plane node — Acts as the and orchestrates subsidiary worker nodes.

  • Worker nodes — Run specified ML workloads based on the runtime spec and environment received from the control plane node.

Integrating more powerful, multi-node GPU clusters for your team is as easy as integrating your personal laptop. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server.

Step-by-step Guide

(1) Prerequisites

Note that Ubuntu 18.04 or CentOS 7.9 or higher Linux OS is installed on your server.

Install dependencies

You can install all the dependencies required for cluster integration using a single-line curl command. The command

  • Generates a token and a command for connecting worker nodes to the control plane node configured above.

If you wish to use your control plane solely for the control plane node — meaning not running any ML workloads on the control plane node and only using it for admin and monitoring purposes — add a --taint-controller flag at the end of the command.

curl -sSLf https://install.vessl.ai/bootstrap-cluster/bootstrap-cluster.sh | sudo bash -s -- --role=controller

Upon installing all the dependencies, the command returns a follow-up command with a token. You can use this to add worker nodes to the control plane. If you don’t want to add an additional worker node you can skip to the next step.

curl -sSLf https://install.vessl.ai/bootstrap-cluster/bootstrap-cluster.sh | sudo bash -s -- --role worker --token '[TOKEN_HERE]'

You can confirm that your control plane and worker node have been successfully configured using a k0s command.

sudo k0s kubectl get nodes

(2) VESSL integration

You are now ready to integrate the Kubernetes cluster with VESSL. Make sure you have VESSL Client installed on the server and configured for your organization.

pip install vessl --upgrade
vessl configure

The following single-line command connects your Kubernetes-backed GPU cluster to VESSL.

vessl cluster create

Follow through prompting your configurtaion options. You can press Enter to use the default values.

By this point, you have successfully completed the integration.

(3) Confirm integration

You can use VESSL CLI command or visit 🗂️ Clusters to confirm your integration.

vessl cluster list

Common troubleshooting commands

Here are common problems that our users face as they integrate on-premise Clusters. You can use the journalctl command to get a more detailed log of the issue. Please share this log as you reach out for support.

sudo journalctl -u k0scontroller | tail -n 20

VesslApiException: PermissionDenied (403): Permission denied.

kernel_clsuter.py111] VESSL cluster agent installed. Waiting for the agent to be connected with VESSL...
_base.py:107] VesslApiException: PermissionDenied (403): Permission denied.

It's likely that you don't have full access to install VESSL cluster agent on the server. Contact your organization's cluster and infrastructure administrator to receive help.

VesslApiException: NotFound (404) Requested entity not found.

kernel_cluster.py:289] Existing VESSL cluster installation found! getting cluster information...
_base.py:107] VesslApiException: NotFound (404) Requested entity not found.

Try again after running the following command:

sudo helm uninstall vessl =n vessl --kubeconfig="var/lib/k0s/pki/admin/conf"

Invalid value: "k0s-ctrl-[HOSTNAME]"

leaderelection.go:334] error initially creating leader election record: Lease.coordination.k8s.io "k0s-ctrl-[HOSTNAME]" is invalide: metadata.name: Invalid value: "k0s-ctrl-[HOSTNAME]": a lowercase RFC 1123 subdomain must consist of lowercase alphanumeric characters.

You can solve this issue by contacting your organization's cluster and infrastructure administrator to change your hostname, or by changing your hostname yourself using the following sudo command:

Changing your hostname may have unexpected side effects, and might be prohibited depending on your organization's IT policy.

sudo hostname [HOSTNAME]
sudo systemctl restart k0scontroller.

Troubleshooting

There is an ongoing containing capital letters. Please make sure your machine's hostname is in lowercase.

Installs if it’s not already installed.

Installs and configures .

Installs , a lightweight Kubernetes distribution, and designates and configures a control plane node.

There is an ongoing containing capital letters. Your hostname must be in lowercase alphanumeric characters.

If you're experiencing issues with your on-premises cluster, or can't figure out what's causing them, try .

🔗 issue related to Kubernetes hostname
🔗 Docker
🔗 NVIDIA container runtime
🔗 k0s
🔗 issue related to Kubernetes hostname
VESSL Flare
🔗 cluster-wide control tower