GPU Instances

Configure Cast AI Autoscaler to scale your cluster using GPU-optimized instances across AWS EKS, GCP GKE, and Azure AKS with NVIDIA GPU support.

Autoscaling using GPU instances

The Cast AI Autoscaler can scale the cluster using GPU-optimized instances. This guide describes the steps needed to configure the cluster so that GPU nodes can join it.

Supported providers

ProviderGPUs supported
AWS EKSNVIDIA
GCP GKENVIDIA
Azure AKS *NVIDIA

* - Please contact Cast AI support to enable this feature for your organization.

How does it work?

Once activated, Cast AI's Autoscaler detects workloads requiring GPU resources and starts provisioning them.

To enable the provisioning of GPU nodes, you need a few things:

  • Choose a GPU instance type or attach a GPU to the instance type
  • Install GPU drivers
  • Expose the GPU to Kubernetes as a consumable resource.

Cast AI ensures that the correct GPU instance type is selected - all you have to do is define GPU resources and add a GPU or a node template toleration. You can also target specific GPU characteristics using node selectors or affinities to the GPU labels.

LabelValue ExampleDescription
nvidia.com/gputrueNode has an NVIDIA GPU attached
nvidia.com/gpu.namenvidia-tesla-t4Attached GPU type
nvidia.com/gpu.count1Attached GPU count
nvidia.com/gpu.memory15258Avaialble single GPU memory in Mib
📘

Tainting of GPU nodes

A GPU Node added using the "default-by-castai" Node template will have the taint nvidia.com/gpu=true:NoSchedule applied to it. On the contrary, a GPU node added using a custom Node template will not be tainted unless specified in the template definition.

GPU sharing

Cast AI supports GPU sharing to enable multiple workloads to utilize GPU resources more efficiently. Multiple sharing methods are available, each optimized for different use cases:

For a detailed comparison and guidance on choosing the right method, see the GPU sharing overview.

GPU drivers

After creating a node of an instance type with a GPU, the node becomes part of the cluster, but GPU resources are not immediately usable. To make GPUs accessible to Kubernetes, you must install the GPU drivers on the node.

GPU driver plugins help achieve this goal. The installation of GPU driver plugins on the cluster/node varies depending on the cloud provider or desired behavior.

Cast AI validates to ensure that the driver exists on a cluster before performing any kind of autoscaling. If it doesn't detect the driver, it creates a pod event with details on how to resolve the problem.

Driver detection

Cast AI assumes that the GPU driver plugin is installed if it finds a DaemonSet that matches the plugin's characteristics, and a pod created from that DaemonSet can run on a node.

Cast AI supports all default GPU driver plugins that match specific name patterns. Moreover, it also allows tagging all custom plugins as supported with the label nvidia-device-plugin: "true"

Daemonset name matching one of the patterns is considered a known official GPU driver plugin:

  • *nvidia-device-plugin*
  • *nvidia-gpu-device-plugin*
  • *nvidia-driver-installer*

GPU drivers on AWS (EKS)

By default, EKS clusters do not come with a GPU device plugin installed. There are several ways to add a GPU plugin to a cluster:

During Cast AI onboarding

Cast AI allows you to enable the GPU device plugin at the cluster onboarding stage. By ticking the checkbox in the UI, you can install plugins automatically during onboarding.

Manually installing device plugin

Alternatively, you can manually install the plugin from the NVIDIA Helm repository.

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
noglob helm upgrade -i nvdp nvdp/nvidia-device-plugin -n castai-agent \
    --set-string nodeSelector."nvidia\.com/gpu"=true \
    --set \
tolerations[0].key=CriticalAddonsOnly,tolerations[0].operator=Exists,\
tolerations[1].effect=NoSchedule,tolerations[1].key="nvidia\.com/gpu",tolerations[1].operator=Exists,\
tolerations[2].key="scheduling\.cast\.ai/spot",tolerations[2].operator=Exists,\
tolerations[3].key="scheduling\.cast\.ai/scoped-autoscaler",tolerations[3].operator=Exists,\
tolerations[4].key="scheduling\.cast\.ai/node-template",tolerations[4].operator=Exists

Custom GPU device plugin

Cast AI assumes a custom plugin controls the driver installation process and node management. If you wish to autoscale GPU nodes using a custom plugin, it must be detectable to Cast AI.

NVIDIA drivers and Amazon Machine Images (AMIs)

The NVIDIA device plugin requires that NVIDIA drivers and the nvidia-container-toolkit already exist on the machine, or it will fail to start, or expose GPU resources properly. By default, Cast AI detects GPU-enabled nodes and uses the EKS-optimized and GPU-enabled AMI by Amazon, which already bundles these. However, if custom AMIs are used, then the installation of these prerequisites must also be included in the AMI building process or the node's user data scripts.

Amazon Linux 2023 AMI

Amazon Linux 2023 (AL2023) AMIs support GPU time-slicing. When using AL2023 for GPU nodes with time-slicing enabled, Cast AI automatically configures the required NVIDIA device plugin settings.

See the GPU sharing with time-slicing documentation for detailed AL2023 configuration instructions.

Bottlerocket AMI considerations

Bottlerocket AMIs come with pre-installed NVIDIA drivers. Do not install additional NVIDIA device plugins on clusters where you plan to use Bottlerocket GPU nodes, as they may conflict with the pre-installed drivers and fail to start. If you do, you need to ensure these NVIDIA device plugin daemonsets do not run on those Bottlerocket nodes through tolerations.

When using Bottlerocket for GPU nodes, the OS handles driver management automatically. For more information, see Bottlerocket support for NVIDIA GPUs.

See Node Configuration documentation for more details on AMI choice.

🚧

Known issues

NVIDIA dropped support for Kepler architecture GPUs after driver version 470. Since Cast AI uses AMIs that bundle newer driver versions, these AMIs cannot be used with GPU instances that utilize such GPUs (P2 instance types). In order to use those instance types, an older GPU-enabled AMI or custom AMI must be set in the node configuration.

GPU drivers on GCP (GKE)

The GKE cluster, by default, has preinstalled NVIDIA driver plugins. If Cast AI finds default plugins, it will use them, instructing to install default NVIDIA drivers version based on the cluster version, GPU, and instance type.

Alternatively, you can manually install driver plugins or use custom drivers. If Cast AI finds a custom or manually installed plugin, its priority will be higher than preinstalled drivers.

Manually installing driver plugin

A manually installed driver plugin will be used to install GPU drivers, but a preinstalled GPU plugin manages both the GPU and the node. Use this command to install the drivers:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Custom GPU driver plugin

Cast AI operates under the assumption that a custom plugin possesses complete control over driver installation and compatibility with pre-installed plugins. In order to allow Cast AI to autoscale GPU nodes using a custom plugin, it needs to be detectable to Cast AI.

GPU drivers on Azure (AKS)

Install the device plugin daemonset:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin.yml

Verify that the pods of the daemonset are up and running on GPU nodes.

GPU can be verified by running an Nvidia plugin job or an Azure GPU job.

Workload configuration examples

The following examples show common patterns for configuring GPU workloads. For GPU sharing-specific configurations, see the time-slicing and MIG documentation.

spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: Exists
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  nodeSelector:
    scheduling.cast.ai/node-template: "gpu-node-template"
  tolerations:
    - key: "gpu-node-template"
      value: "template-affinity"
      operator: "Equal"
      effect: "NoSchedule"
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  nodeSelector:
    nvidia.com/gpu.name: "nvidia-tesla-t4"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: Exists
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  nodeSelector:
    scheduling.cast.ai/node-template: "gpu-node-template"
    nvidia.com/gpu.name: "nvidia-tesla-p4"
  tolerations:
    - key: "scheduling.cast.ai/node-template"
      value: "gpu-node-template"
      operator: "Equal"
      effect: "NoSchedule"
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.memory
            operator: Gt
            values:
            - "10000"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: Exists
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1