Skip to main content
Volumes in workspaces provide powerful data management capabilities for interactive development. This guide demonstrates practical examples of using volumes to enhance your data science and machine learning workflows.

Example 1: Data Science Project Setup

This example shows how to set up a workspace with multiple data sources for a typical data science project.

Web Console Setup

When creating a new workspace, configure the following volumes:
  1. Import code repository
    • Type: Import
    • Source: git://github.com/myorg/customer-analytics
    • Target path: /workspace/code
  2. Import training dataset
    • Type: Import
    • Source: VESSL Dataset myorg/customer-data-train
    • Target path: /data/train
  3. Mount shared results storage
    • Type: Mount
    • Source: VESSL Storage shared-experiments
    • Target path: /results

CLI Configuration

vessl workspace create \
  --name "customer-analytics-workspace" \
  --cluster "vessl-gcp-oregon" \
  --preset "gpu-l4-small" \
  --image "quay.io/vessl-ai/torch:2.3.1-cuda12.1" \
  --import "/workspace/code:git://github.com/myorg/customer-analytics" \
  --import "/data/train:vessl-dataset://myorg/customer-data-train" \
  --import "/data/validation:vessl-dataset://myorg/customer-data-val" \
  --mount "/results:volume://vessl-storage/shared-experiments"

Workspace Structure

After startup, your workspace will have:
/workspace/code/          # Latest code from Git repository
├── notebooks/
├── src/
└── requirements.txt

/data/train/              # Training dataset
├── features.csv
└── labels.csv

/data/validation/         # Validation dataset
├── features.csv
└── labels.csv

/results/                 # Shared storage for experiments
├── models/
├── logs/
└── metrics/

/root/                    # Persistent personal workspace
├── .jupyter/
├── experiments/          # Your personal experiments
└── notebooks/            # Your development notebooks

Benefits

  • Automatic setup: All data is ready when workspace starts
  • Team collaboration: Shared results storage enables team access to experiments
  • Version control: Code automatically synced from Git
  • Organized structure: Clear separation of code, data, and results

Example 2: Large Dataset Workflow

This example demonstrates working with large datasets that exceed workspace disk capacity.

Problem

Your workspace has 50GB disk space, but you need to work with a 100GB dataset.

Solution: Mount the Dataset

vessl workspace create \
  --name "large-dataset-analysis" \
  --cluster "vessl-gcp-oregon" \
  --preset "memory-optimized-large" \
  --image "quay.io/vessl-ai/python:3.11" \
  --import "/code:git://github.com/myorg/big-data-analysis" \
  --mount "/data/large-dataset:vessl-dataset://myorg/large-customer-dataset" \
  --mount "/shared/models:volume://vessl-storage/pretrained-models"

Working with Mounted Data

In your Jupyter notebook:
import pandas as pd
import dask.dataframe as dd

# Read from mounted dataset (doesn't consume workspace disk)
df = dd.read_parquet('/data/large-dataset/data.parquet')

# Process data in chunks
processed_df = df.map_partitions(lambda x: preprocess(x))

# Save results to persistent /root directory
processed_df.to_parquet('/root/processed_data/')

# Or save to shared storage for team access
processed_df.to_parquet('/shared/models/processed_data/')

Example 3: Multi-Modal AI Development

This example shows setting up a workspace for multi-modal AI development with various data types.

CLI Setup

vessl workspace create \
  --name "multimodal-ai-workspace" \
  --cluster "vessl-gcp-oregon" \
  --preset "gpu-a100-large" \
  --image "quay.io/vessl-ai/torch:2.3.1-cuda12.1" \
  --import "/code:git://github.com/myorg/multimodal-ai" \
  --import "/data/images:vessl-dataset://myorg/image-dataset" \
  --import "/data/text:vessl-dataset://myorg/text-corpus" \
  --import "/models/pretrained:vessl-model://myorg/clip-base/v1" \
  --mount "/storage/experiments:volume://vessl-storage/multimodal-experiments" \
  --mount "/storage/checkpoints:volume://vessl-storage/model-checkpoints"

Development Workflow

  1. Data Exploration: Use mounted datasets for exploration without disk usage
  2. Model Development: Access pre-trained models immediately
  3. Experiment Tracking: Save experiments to shared storage
  4. Checkpoint Management: Persistent checkpoint storage across team
# Example notebook usage
import torch
from transformers import CLIPModel, CLIPProcessor

# Load pre-trained model (available immediately)
model = CLIPModel.from_pretrained('/models/pretrained/')
processor = CLIPProcessor.from_pretrained('/models/pretrained/')

# Work with datasets (mounted, no disk usage)
image_paths = glob('/data/images/**/*.jpg')
text_data = pd.read_csv('/data/text/corpus.csv')

# Save experiments to shared storage
experiment_results = train_model(model, image_paths, text_data)
torch.save(experiment_results, '/storage/experiments/experiment_001.pt')

# Save checkpoints to persistent storage
torch.save(model.state_dict(), '/storage/checkpoints/model_epoch_10.pth')

Example 4: External Cloud Storage Integration

This example demonstrates integrating external cloud storage for seamless data access.

Prerequisites

  1. Set up AWS credentials in organization settings
  2. Ensure S3 bucket has appropriate permissions

Setup with External Storage

vessl workspace create \
  --name "cloud-data-workspace" \
  --cluster "vessl-aws-us-east" \
  --preset "gpu-l4-medium" \
  --image "quay.io/vessl-ai/python:3.11" \
  --import "/code:git://github.com/myorg/ml-pipeline" \
  --mount "/data/raw:s3://my-company-data/raw-data/" \
  --mount "/data/processed:s3://my-company-data/processed/" \
  --export "/results:s3://my-company-results/experiments/"

Working with Cloud Data

# Direct access to S3-mounted data
import pandas as pd

# Read directly from mounted S3 (real-time access)
raw_data = pd.read_csv('/data/raw/latest_data.csv')

# Process data
processed_data = preprocess(raw_data)

# Save to mounted S3 processed folder
processed_data.to_csv('/data/processed/processed_data.csv')

# Results automatically exported to S3 on workspace termination
model_results = train_model(processed_data)
model_results.save('/results/model_v1.joblib')

Example 5: Hugging Face Integration

This example shows how to efficiently work with Hugging Face datasets and models.

Setup

vessl workspace create \
  --name "huggingface-workspace" \
  --cluster "vessl-gcp-oregon" \
  --preset "gpu-l4-small" \
  --image "quay.io/vessl-ai/torch:2.3.1-cuda12.1" \
  --import "/models/bert:hf://models/bert-base-uncased" \
  --import "/datasets/squad:hf://datasets/squad" \
  --import "/code:git://github.com/myorg/nlp-experiments" \
  --mount "/shared/models:volume://vessl-storage/fine-tuned-models"

Development

from transformers import AutoTokenizer, AutoModel
from datasets import load_from_disk

# Models and datasets are pre-downloaded and ready
tokenizer = AutoTokenizer.from_pretrained('/models/bert/')
model = AutoModel.from_pretrained('/models/bert/')

# Load pre-downloaded dataset
dataset = load_from_disk('/datasets/squad/')

# Fine-tune model
fine_tuned_model = fine_tune(model, dataset)

# Save to shared storage for team access
fine_tuned_model.save_pretrained('/shared/models/bert-squad-finetuned/')

Best Practices for Volume Usage

1. Choose the Right Volume Type

Use Import for:
  • Small to medium datasets (< 10GB)
  • Code repositories
  • Pre-trained models
  • Data that doesn’t change during development
Use Mount for:
  • Large datasets (> 10GB)
  • Frequently updated data
  • Shared storage across workspaces
  • Real-time data pipelines

2. Organize Your Data

/code/              # Imported code repositories
/data/              # All datasets (imported or mounted)
  ├── raw/          # Raw, unprocessed data
  ├── processed/    # Processed datasets
  └── external/     # External data sources
/models/            # Pre-trained models
  ├── pretrained/   # Downloaded models
  └── checkpoints/  # Training checkpoints
/results/           # Shared results storage
/root/              # Personal persistent storage
  ├── experiments/  # Personal experiments
  └── notebooks/    # Development notebooks

3. Performance Optimization

  • Co-locate storage and compute: Use storage in the same region as your cluster
  • Cache frequently accessed data: Copy small, frequently used files to /root
  • Use appropriate storage types: SSDs for random access, object storage for large sequential reads

4. Cost Optimization

  • Import small data: For datasets under 1GB, import is usually more cost-effective
  • Mount large data: Avoid duplicating large datasets across workspaces
  • Clean up exports: Regularly clean up exported data to avoid storage costs

Common Troubleshooting

Volume Mount Failures

# Check if source exists and is accessible
vessl dataset list
vessl storage list

# Verify organization integrations
vessl organization integration list

# Check workspace logs for detailed error messages
vessl workspace logs my-workspace

Performance Issues

# Monitor disk usage
import shutil
total, used, free = shutil.disk_usage('/root')
print(f"Disk usage: {used/total*100:.1f}%")

# Check mounted volume accessibility
import os
print("Mounted datasets:", os.listdir('/data'))

Access Permission Issues

# Check volume permissions in workspace
ls -la /data/
ls -la /models/

# Verify organization has access to external storage
vessl organization integration test aws-s3-integration

Learn more about workspace volumes

Explore the complete guide to workspace volume configuration and advanced usage patterns.