Distributed Experiments

Early access feature

circle-info

Only the PyTorch framework is supported distributed experiment currently.

What is a distributed experiment?

A distributed experiment is a single machine learning run on top of multi-node or multi-GPUs. The distributed experiment results are consist of logs, metrics, and artifacts for each worker which you can find under corresponding tabs.

triangle-exclamation

Environment variables

VESSL automatically sets the below environment variables based on the configuration.

NUM_NODES: Number of workers

NUM_TRAINERS: Number of GPUs per node

RANK: The global rank of node

MASTER_ADDR: The address of the master node service

MASTER_PORT: The port number on the master address

Creating a distributed experiment

Using Web Console

Running a distributed experiment on the web console is similar to a single node experiment. To create a distributed experiment, you only need to specify the number of workers. Other options are the same as those of a single node experiment.

Creating an Experimentchevron-right

Using CLI

To run a distributed experiment using CLI, the number of nodes must be set to an integer greater than one.

Examples: Distributed CIFAR

You can find the full example codes herearrow-up-right.

Step 1: Prepare CIFAR-10 dataset

Download the CIFAR dataset with the scripts below. and add a vessl type dataset to your organization.

Adding New Datasetschevron-right

Or, you can simply add an AWS S3 type dataset to your organization with the following public bucket URI.

Step 2: Create a distributed experiment

To run a distributed experiment we recommend to use torch.distributed.launcharrow-up-right package. The example start command that runs on two nodes and one GPU for each node is as follows.

VESSL will automatically set environment variables of --node_rank, --master_addr, --master_port, --nproc_per_node and --nnodes.

Files

In a distributed experiment, all workers share an output storage. Please be aware that files can be overrided by other workers when you use same output path.

Last updated