As an organization manager in your firm, you can set custom resource presets under Resource specs that users can select when launching ML workloads. Additionally, you can specify the priority of these options.
For example, when you define resource specs as described above, users will only be able to choose from the three predefined options in Run or Workspace, as shown in the image above.
These default options can help admins optimize resource usage by (1) preventing someone from occupying an excessive number of GPUs and (2) preventing unbalanced resource requests that cause skewed resource usage. As a result, average users can simply proceed their jobs without thinking and configuring the exact number of CPU cores and memories they need to request.
Take a quick 2-minute tour of Resource specs using the demo below.
Click New resource spec and define the following parameters.
Name — Set a name for the preset. Use names that well represent the preset like a100-2.mem-16.cpu-6.
Processor type — Define the preset by the processor type, either by CPU or GPU.
CPU limit — Enter the number of CPUs. For a100-2.mem-16.cpu-6, enter 6.
Memory limit — Enter the amount of memory in GB. For a100-2.mem-16.cpu-6, the number would be 16.
Priority - Assigning different priority values disables the First In, First Out (FIFO) scheduler and executes workloads based on their priority, with lower priority values being processed first. In the example preset above, workloads running on cpu-medium are always prioritized over workloads on other GPUs. To view the priority assigned to each node, click the “Edit” button under Resource Specs.
GPU type — Specify the GPU model you are using by running the nvidia-smi command on your server. In the example below, the GPU type is a100-sxm-80gb.
nvidia-smi
Thu Jan 1917:44:05 2023+-----------------------------------------------------------------------------+| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================||0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off |0|| N/A 40C P0 64W / 275W | 0MiB / 81920MiB |0% Default |||| Disabled |+-------------------------------+----------------------+----------------------+
GPU limit — Enter the number of GPUs. For gpu2.mem16.cpu6, enter 2. You can also place decimal values if you are using Multi-Instance GPUs (MIG).
Available workloads — Select the type of workloads that can use the preset. With this, you can guide users to use Experiment by preventing them from running Workspace with 4 or 8 GPUs.
Tolerations allow workloads to be scheduled on nodes with specific taints by matching their conditions. They consist of two key components: Operator and Effect. Here is an explanation of the available options:
Equal
The Toleration is applied only if both the Key and Value match the node’s taint exactly.
Example: If a node has a taint key=value, the Toleration must also specify key=value to allow scheduling.
Exists
The Toleration is applied if the Key exists, regardless of the Value.
Example: If a node has a taint with key=anything, the Toleration only needs to specify key to allow scheduling.
NoExecute
Workloads that do not tolerate this taint will be evicted immediately from the node. Additionally, they cannot be scheduled onto the node.
NoSchedule
Workloads that do not tolerate this taint will not be scheduled on the node. However, any workloads already running on the node will remain unaffected.
PreferNoSchedule
Kubernetes will attempt to avoid scheduling workloads on nodes with this taint if they do not have a matching Toleration. However, it is not strictly enforced, and workloads may still be scheduled if necessary.
Enhanced scheduling control: Tolerations work with taints to provide fine-grained control over where workloads can and cannot run, allowing for sophisticated scheduling policies.
Workload Isolation: By tolerating specific taints, workloads can be isolated to certain nodes, enhancing security and performance.
Node maintenance and stability: Taints and Tolerations help manage node availability and workload eviction during maintenance or when nodes exhibit issues, improving cluster stability.
Resource optimization: They enable better resource utilization by ensuring that workloads are scheduled on appropriate nodes that meet their operational requirements.
Node Selectors allow you to control where workloads are scheduled by matching specific labels on nodes. They are a simple key-value mechanism used to constrain workloads to run only on nodes that meet certain criteria.
Key
Specifies the label key on the node that the workload should match.
Example: vessl.ai/role
Value
Specifies the corresponding value of the key. The workload will only be scheduled on nodes where the label matches this value.
Example: gpu-worker