VESSL AI Documentation

Volumes in workspaces provide powerful data management capabilities for interactive development. This guide demonstrates practical examples of using volumes to enhance your data science and machine learning workflows.

Example 1: Data Science Project Setup

This example shows how to set up a workspace with multiple data sources for a typical data science project.

Web Console Setup

When creating a new workspace, configure the following volumes:

Import code repository
- Type: Import
- Source: git://github.com/myorg/customer-analytics
- Target path: /workspace/code
Import training dataset
- Type: Import
- Source: VESSL Dataset myorg/customer-data-train
- Target path: /data/train
Mount shared results storage
- Type: Mount
- Source: VESSL Storage shared-experiments
- Target path: /results

CLI Configuration

vessl workspace create \
  --name "customer-analytics-workspace" \
  --cluster "vessl-gcp-oregon" \
  --preset "gpu-l4-small" \
  --image "quay.io/vessl-ai/torch:2.3.1-cuda12.1" \
  --import "/workspace/code:git://github.com/myorg/customer-analytics" \
  --import "/data/train:vessl-dataset://myorg/customer-data-train" \
  --import "/data/validation:vessl-dataset://myorg/customer-data-val" \
  --mount "/results:volume://vessl-storage/shared-experiments"

Workspace Structure

After startup, your workspace will have:

/workspace/code/          # Latest code from Git repository
├── notebooks/
├── src/
└── requirements.txt

/data/train/              # Training dataset
├── features.csv
└── labels.csv

/data/validation/         # Validation dataset
├── features.csv
└── labels.csv

/results/                 # Shared storage for experiments
├── models/
├── logs/
└── metrics/

/root/                    # Persistent personal workspace
├── .jupyter/
├── experiments/          # Your personal experiments
└── notebooks/            # Your development notebooks

Benefits

Automatic setup: All data is ready when workspace starts
Team collaboration: Shared results storage enables team access to experiments
Version control: Code automatically synced from Git
Organized structure: Clear separation of code, data, and results

Example 2: Large Dataset Workflow

This example demonstrates working with large datasets that exceed workspace disk capacity.

Problem

Your workspace has 50GB disk space, but you need to work with a 100GB dataset.

Solution: Mount the Dataset

vessl workspace create \
  --name "large-dataset-analysis" \
  --cluster "vessl-gcp-oregon" \
  --preset "memory-optimized-large" \
  --image "quay.io/vessl-ai/python:3.11" \
  --import "/code:git://github.com/myorg/big-data-analysis" \
  --mount "/data/large-dataset:vessl-dataset://myorg/large-customer-dataset" \
  --mount "/shared/models:volume://vessl-storage/pretrained-models"

Working with Mounted Data

In your Jupyter notebook:

import pandas as pd
import dask.dataframe as dd

# Read from mounted dataset (doesn't consume workspace disk)
df = dd.read_parquet('/data/large-dataset/data.parquet')

# Process data in chunks
processed_df = df.map_partitions(lambda x: preprocess(x))

# Save results to persistent /root directory
processed_df.to_parquet('/root/processed_data/')

# Or save to shared storage for team access
processed_df.to_parquet('/shared/models/processed_data/')

This example shows setting up a workspace for multi-modal AI development with various data types.

CLI Setup

vessl workspace create \
  --name "multimodal-ai-workspace" \
  --cluster "vessl-gcp-oregon" \
  --preset "gpu-a100-large" \
  --image "quay.io/vessl-ai/torch:2.3.1-cuda12.1" \
  --import "/code:git://github.com/myorg/multimodal-ai" \
  --import "/data/images:vessl-dataset://myorg/image-dataset" \
  --import "/data/text:vessl-dataset://myorg/text-corpus" \
  --import "/models/pretrained:vessl-model://myorg/clip-base/v1" \
  --mount "/storage/experiments:volume://vessl-storage/multimodal-experiments" \
  --mount "/storage/checkpoints:volume://vessl-storage/model-checkpoints"

Development Workflow

Data Exploration: Use mounted datasets for exploration without disk usage
Model Development: Access pre-trained models immediately
Experiment Tracking: Save experiments to shared storage
Checkpoint Management: Persistent checkpoint storage across team

# Example notebook usage
import torch
from transformers import CLIPModel, CLIPProcessor

# Load pre-trained model (available immediately)
model = CLIPModel.from_pretrained('/models/pretrained/')
processor = CLIPProcessor.from_pretrained('/models/pretrained/')

# Work with datasets (mounted, no disk usage)
image_paths = glob('/data/images/**/*.jpg')
text_data = pd.read_csv('/data/text/corpus.csv')

# Save experiments to shared storage
experiment_results = train_model(model, image_paths, text_data)
torch.save(experiment_results, '/storage/experiments/experiment_001.pt')

# Save checkpoints to persistent storage
torch.save(model.state_dict(), '/storage/checkpoints/model_epoch_10.pth')

Example 4: External Cloud Storage Integration

This example demonstrates integrating external cloud storage for seamless data access.

Prerequisites

Set up AWS credentials in organization settings
Ensure S3 bucket has appropriate permissions

Setup with External Storage

vessl workspace create \
  --name "cloud-data-workspace" \
  --cluster "vessl-aws-us-east" \
  --preset "gpu-l4-medium" \
  --image "quay.io/vessl-ai/python:3.11" \
  --import "/code:git://github.com/myorg/ml-pipeline" \
  --mount "/data/raw:s3://my-company-data/raw-data/" \
  --mount "/data/processed:s3://my-company-data/processed/" \
  --export "/results:s3://my-company-results/experiments/"

Working with Cloud Data

# Direct access to S3-mounted data
import pandas as pd

# Read directly from mounted S3 (real-time access)
raw_data = pd.read_csv('/data/raw/latest_data.csv')

# Process data
processed_data = preprocess(raw_data)

# Save to mounted S3 processed folder
processed_data.to_csv('/data/processed/processed_data.csv')

# Results automatically exported to S3 on workspace termination
model_results = train_model(processed_data)
model_results.save('/results/model_v1.joblib')

Example 5: Hugging Face Integration

This example shows how to efficiently work with Hugging Face datasets and models.

Setup

vessl workspace create \
  --name "huggingface-workspace" \
  --cluster "vessl-gcp-oregon" \
  --preset "gpu-l4-small" \
  --image "quay.io/vessl-ai/torch:2.3.1-cuda12.1" \
  --import "/models/bert:hf://models/bert-base-uncased" \
  --import "/datasets/squad:hf://datasets/squad" \
  --import "/code:git://github.com/myorg/nlp-experiments" \
  --mount "/shared/models:volume://vessl-storage/fine-tuned-models"

Development

from transformers import AutoTokenizer, AutoModel
from datasets import load_from_disk

# Models and datasets are pre-downloaded and ready
tokenizer = AutoTokenizer.from_pretrained('/models/bert/')
model = AutoModel.from_pretrained('/models/bert/')

# Load pre-downloaded dataset
dataset = load_from_disk('/datasets/squad/')

# Fine-tune model
fine_tuned_model = fine_tune(model, dataset)

# Save to shared storage for team access
fine_tuned_model.save_pretrained('/shared/models/bert-squad-finetuned/')

Best Practices for Volume Usage

1. Choose the Right Volume Type

Use Import for:

Small to medium datasets (< 10GB)
Code repositories
Pre-trained models
Data that doesn’t change during development

Use Mount for:

Large datasets (> 10GB)
Frequently updated data
Shared storage across workspaces
Real-time data pipelines

2. Organize Your Data

/code/              # Imported code repositories
/data/              # All datasets (imported or mounted)
  ├── raw/          # Raw, unprocessed data
  ├── processed/    # Processed datasets
  └── external/     # External data sources
/models/            # Pre-trained models
  ├── pretrained/   # Downloaded models
  └── checkpoints/  # Training checkpoints
/results/           # Shared results storage
/root/              # Personal persistent storage
  ├── experiments/  # Personal experiments
  └── notebooks/    # Development notebooks

3. Performance Optimization

Co-locate storage and compute: Use storage in the same region as your cluster
Cache frequently accessed data: Copy small, frequently used files to /root
Use appropriate storage types: SSDs for random access, object storage for large sequential reads

4. Cost Optimization

Import small data: For datasets under 1GB, import is usually more cost-effective
Mount large data: Avoid duplicating large datasets across workspaces
Clean up exports: Regularly clean up exported data to avoid storage costs

Common Troubleshooting

Volume Mount Failures

# Check if source exists and is accessible
vessl dataset list
vessl storage list

# Verify organization integrations
vessl organization integration list

# Check workspace logs for detailed error messages
vessl workspace logs my-workspace

Performance Issues

# Monitor disk usage
import shutil
total, used, free = shutil.disk_usage('/root')
print(f"Disk usage: {used/total*100:.1f}%")

# Check mounted volume accessibility
import os
print("Mounted datasets:", os.listdir('/data'))

Access Permission Issues

# Check volume permissions in workspace
ls -la /data/
ls -la /models/

# Verify organization has access to external storage
vessl organization integration test aws-s3-integration

Learn more about workspace volumes

Explore the complete guide to workspace volume configuration and advanced usage patterns.

Featured

Explore use cases

Using volumes in workspaces for data science workflows

Example 1: Data Science Project Setup

Web Console Setup

CLI Configuration

Workspace Structure

Benefits

Example 2: Large Dataset Workflow

Problem

Solution: Mount the Dataset

Working with Mounted Data

CLI Setup

Development Workflow

Example 4: External Cloud Storage Integration

Prerequisites

Setup with External Storage

Working with Cloud Data

Example 5: Hugging Face Integration

Setup

Development

Best Practices for Volume Usage

1. Choose the Right Volume Type

2. Organize Your Data

3. Performance Optimization

4. Cost Optimization

Common Troubleshooting

Volume Mount Failures

Performance Issues

Access Permission Issues

Learn more about workspace volumes

Featured

Explore use cases

​Example 1: Data Science Project Setup

​Web Console Setup

​CLI Configuration

​Workspace Structure

​Benefits

​Example 2: Large Dataset Workflow

​Problem

​Solution: Mount the Dataset

​Working with Mounted Data

​Example 3: Multi-Modal AI Development

​CLI Setup

​Development Workflow

​Example 4: External Cloud Storage Integration

​Prerequisites

​Setup with External Storage

​Working with Cloud Data

​Example 5: Hugging Face Integration

​Setup

​Development

​Best Practices for Volume Usage

​1. Choose the Right Volume Type

​2. Organize Your Data

​3. Performance Optimization

​4. Cost Optimization

​Common Troubleshooting

​Volume Mount Failures

​Performance Issues

​Access Permission Issues

Learn more about workspace volumes

Example 1: Data Science Project Setup

Web Console Setup

CLI Configuration

Workspace Structure

Benefits

Example 2: Large Dataset Workflow

Problem

Solution: Mount the Dataset

Working with Mounted Data

Example 3: Multi-Modal AI Development

CLI Setup

Development Workflow

Example 4: External Cloud Storage Integration

Prerequisites

Setup with External Storage

Working with Cloud Data

Example 5: Hugging Face Integration

Setup

Development

Best Practices for Volume Usage

1. Choose the Right Volume Type

2. Organize Your Data

3. Performance Optimization

4. Cost Optimization

Common Troubleshooting

Volume Mount Failures

Performance Issues

Access Permission Issues