A cluster is a set of nodes connected to VESSL that runs containerized machine learning workloads. You can either use VESSL’s managed clusters, or bring your own on-premise or cloud cluster. VESSL provides number of useful cluster management features for the clusters plugged in:
  • Running VESSL-managed, fault-tolerant ML workloads on the clusters
  • Monitoring & tracking system metrics for cluster, nodes and workloads
  • Watching cluster health and issue notifications, e.g. network failure or disk pressure
  • Managing resource spec and quota for resource governance across multiple teams & users