[DEV] VESSL Docs
  • Welcome to VESSL Docs!
  • GETTING STARTED
    • Overview
    • Quickstart
    • End-to-end Guides
      • CLI-driven Workflow
      • SDK-driven Workflow
  • USER GUIDE
    • Organization
      • Creating an Organization
      • Organization Settings
        • Add Members
        • Set Notifications
        • Configure Clusters
        • Add Integrations
        • Billing Information
    • Project
      • Creating a Project
      • Project Overview
      • Project Repository & Project Dataset
    • Clusters
      • Cluster Integrations
        • Fully Managed Cloud
        • Personal Laptops
        • On-premise Clusters
        • Private Cloud (AWS)
      • Cluster Monitoring
      • Cluster Administration
        • Resource Specs
        • Access Control
        • Quotas and Limits
        • Remove Cluster
    • Dataset
      • Adding New Datasets
      • Managing Datasets
      • Tips & Limitations
    • Experiment
      • Creating an Experiment
      • Managing Experiments
      • Experiment Results
      • Distributed Experiments
      • Local Experiments
    • Model Registry
      • Creating a Model
      • Managing Models
    • Sweep
      • Creating a Sweep
      • Sweep Results
    • Workspace
      • Creating a Workspace
      • Exploring Workspaces
      • SSH Connection
      • Downloading / Attaching Datasets
      • Running a Server Application
      • Tips & Limitations
      • Building Custom Images
    • Serve
      • Quickstart
      • Serve Web Workflow
        • Monitoring Dashboard
        • Service Logs
        • Service Revisions
        • Service Rollouts
      • Serve YAML Workflow
        • YAML Schema Reference
    • Commons
      • Running Spot Instances
      • Volume Mount
  • API REFERENCE
    • What is the VESSL CLI/SDK?
    • CLI
      • Getting Started
      • vessl run
      • vessl cluster
      • vessl dataset
      • vessl experiment
      • vessl image
      • vessl model
      • vessl organization
      • vessl project
      • vessl serve
      • vessl ssh-key
      • vessl sweep
      • vessl volume
      • vessl workspace
    • Python SDK
      • Integrations
        • Keras
        • TensorBoard
      • Utilities API
        • configure
        • vessl.init
        • vessl.log
          • vessl.Image
          • vessl.Audio
        • vessl.hp.update
        • vessl.progress
        • vessl.upload
        • vessl.finish
      • Dataset API
      • Experiment API
      • Cluster API
      • Image API
      • Model API
        • Model Serving API
      • Organization API
      • Project API
      • Serving API
      • SSH Key API
      • Sweep API
      • Volume API
      • Workspace API
    • Rate Limits
  • TROUBLESHOOTING
    • GitHub Issues
    • VESSL Flare
Powered by GitBook
On this page
  • Overview
  • Cluster-level Monitoring
  • Node-level Monitoring
  • Workload-level Monitoring
  1. USER GUIDE
  2. Clusters

Cluster Monitoring

Monitor cluster usage and status down to each node

PreviousPrivate Cloud (AWS)NextCluster Administration

Last updated 2 years ago

Overview

VESSL Clusters comes with a built-in cluster dashboard that provides a visualization of cluster usage and status down to each node and workload. This is enabled by the VESSL Cluster Agent which sends real-time information about the clusters and workloads running on the cluster such as node specifications and model metrics.

Dashboard setup is done automatically as you integrate your cloud or on-prem servers using vessl cluster create command.

Users on the Enterprise can use VESSL's custom cluster agent to route the monitoring information to your monitoring tools like Datadog and Grafana.

Contact us at or through our for more details.

Cluster-level Monitoring

Multi-cluster monitoring of resource usage and ongoing workloads is available under 🗂️ Clusters. Here, you can get an overview of the integrated clusters.

  • Cluster status — Connection and incident status of a cluster.

  • Available nodes — Available number of worker nodes.

  • Real-time resource usage — Real-time resource usage of CPU cores, RAM, and GPUs.

  • Ongoing workloads by type — The number of running notebook servers (Workspaces) and training jobs (Experiments).

Clicking the cluster guides you to the Summary tab which holds more detailed information about the cluster.

(1) Summary

The summary section presents the basic information about the cluster including the connection and incident status.

(2) Quotas & Usage

Quotas & Usage shows the organization-wide and personal resource quota for the cluster, including the number of GPU hours and occupiable GPUs and CPUs. This is set by the organization admin. Refer to our next section in the documentation VESSL Cluster's features on cluster administration.

(3) Resource Statistics

This section shows you how much CPU, GPU, and memory have been requested (and allocated) and are currently being used.

Note that when you are using VESSL Workspaces (notebook servers) you may be occupying a node without actively using the resources — you are only actively using the resources only when the cell is running.

(4) Workloads

This section shows all ongoing workloads on the cluster with information on the occupying node, resource consumption, creator, and the created date. If you are an organization admin, clicking the workload name guides you to the detailed workload page under 🗂️ Projects or 🗂️ Workspaces.

Node-level Monitoring

Under Nodes, you can view all the worker nodes tied to the cluster with their real-time CPU, Memory, and GPU usage, ongoing workloads by their type, and incident status. You can select the checkbox to get more in-depth information.

(1) Metadata

(2) System metrics

(3) Workloads

(4) Issues

Workload-level Monitoring

Under Workloads, you can view the workload log related to the cluster with the current status, occupying node, resource consumption, and a visualization of the usage history.

If you are on the Enterprise plan and wish to send the cluster information collected by VESSL Cluster Agent to your central infra monitoring tool such as Datadog and Grafana, contact us at .

support@vessl.ai
support@vessl.ai
community Slack