GPU instances

Autoscaling using GPU instances

The CAST AI Autoscaler is able to scale the cluster using GPU-optimized instances. This guide describes the steps needed to configure the cluster in order to ensure that GPU nodes are able to join the cluster.

Supported providers

ProviderGPUs supported
AWS EKSNVIDIA
GCP GKENVIDIA
Azure AKScoming soon.

How does it work?

Once activated, CAST AI's Autoscaler detects workloads requiring GPU resources and starts provisioning them.

To enable the provisioning of GPU nodes, you need a few things:

  • Choose a GPU instance type or attach a GPU to the instance type;
  • Install GPU drivers;
  • Expose GPU to Kubernetes as a consumable resource.

CAST AI ensures that the correct GPU instance type is selected - all you have to do is define GPU resources and add GPU or a node template toleration. You can also target specific GPU characteristics using node selectors or affinities to the GPU labels.

LabelValue ExampleDescription
nvidia.com/gputrueNode has NVIDIA GPU attached
nvidia.com/gpu.namenvidia-tesla-t4Attached GPU type
nvidia.com/gpu.count1Attached GPU count
nvidia.com/gpu.memory15258Avaialble single GPU memory in Mib

πŸ“˜

Tainting of GPU nodes

A GPU Node added using the "default-by-castai" Node template will have the taint nvidia.com/gpu=true:NoSchedule applied to it. On the contrary, a GPU node added using a custom Node template will not be tainted unless specified in the template definition.

Workload configuration examples

spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: Exists
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  nodeSelector:
    scheduling.cast.ai/node-template: "gpu-node-template"
  tolerations:
    - key: "gpu-node-template"
      value: "template-affinity"
      operator: "Equal"
      effect: "NoSchedule"
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  nodeSelector:
    nvidia.com/gpu.name: "nvidia-tesla-t4"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: Exists
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  nodeSelector:
    scheduling.cast.ai/node-template: "gpu-node-template"
    nvidia.com/gpu.name: "nvidia-tesla-p4"
  tolerations:
    - key: "scheduling.cast.ai/node-template"
      value: "gpu-node-template"
      operator: "Equal"
      effect: "NoSchedule"
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.memory
            operator: Gt
            values:
            - "10000"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: Exists
  containers:
    - image: my-image
      name: gpu-test
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1

GPU drivers

After creating a node of an instance type with a GPU, the node becomes part of the cluster, but GPU resources are not immediately usable. In order to make GPUs accessible to Kubernetes, you need to install GPU drivers on the node.

GPU driver plugins help to achieve this goal. GPU driver plugin installation on the cluster/node varies based on the cloud provider or desired behavior.

CAST AI does the validation to ensure that the driver exists on a cluster, before performing any kind of autoscaling. If it doesn't detect the driver, it creates a pod event with details on solving the problem.

Driver detection

CAST AI assumes that the GPU driver plugin is installed if it finds a daemonset that matches plugin characteristics and a pod created from that daemonset can run on a node.

CAST AI supports all default GPU driver plugins that match specific name patterns. Moreover, it also allows tagging all custom plugins as supported with the label nvidia-device-plugin: "true"

Daemonset name matching one of the patterns is considered a known official GPU driver plugin:

  • *nvidia-device-plugin*
  • *nvidia-gpu-device-plugin*
  • *nvidia-driver-installer*

GPU drivers on AWS (EKS)

By default, EKS clusters come without a GPU drivers plugin installed on a cluster. There are several ways to add a GPU plugin to a cluster:

During CAST AI onboarding

CAST AI provides the ability to enable GPU drivers at the stage of the cluster onboarding. You can install plugins automatically during onboarding by ticking the checkbox in the UI or using INSTALL_NVIDIA_DEVICE_PLUGIN=true through Terraform.

Manually installing driver plugin

Alternatively, you can install the plugin manually from the NVIDIA Helm repository.

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
noglob helm upgrade -i nvdp nvdp/nvidia-device-plugin -n castai-agent \
    --set-string nodeSelector."nvidia\.com/gpu"=true \
    --set \
tolerations[0].key=CriticalAddonsOnly,tolerations[0].operator=Exists,\
tolerations[1].effect=NoSchedule,tolerations[1].key="nvidia\.com/gpu",tolerations[1].operator=Exists,\
tolerations[2].key="scheduling\.cast\.ai/spot",tolerations[2].operator=Exists,\
tolerations[3].key="scheduling\.cast\.ai/scoped-autoscaler",tolerations[3].operator=Exists,\
tolerations[4].key="scheduling\.cast\.ai/node-template",tolerations[4].operator=Exists

Custom GPU driver plugin

CAST AI assumes that a custom plugin has full control over the driver installation process and node management. If you wish to autoscale GPU nodes using a custom plugin, it needs to be detectable to CAST AI.

GPU drivers on GCP (GKE)

The GKE cluster by default has preinstalled NVIDIA driver plugins. If CAST AI finds default plugins, it will use them, instructing to install default NVIDIA drivers version based on the cluster version, GPU, and instance type.

Alternatively, you can manually install driver plugins or use custom drivers. If CAST AI finds a custom or manually installed plugin, its priority will be higher compared to preinstalled drivers.

Manually installing driver plugin

A manually installed driver plugin will be used to install GPU drivers, but a preinstalled GPU plugin manages both GPU and node. Use this command to install the drivers:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Custom GPU driver plugin

CAST AI operates under the assumption that a custom plugin possesses complete control over driver installation and compatibility with pre-installed plugins. In order to allow CAST AI to autoscale GPU nodes using a custom plugin, it needs to be detectable to CAST AI.

GPU drivers on Azure (AKS)

Install device plugin daemonset.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

Verify pods of daemonset are up and running on GPU nodes.

GPU can be verified by running nvidia plugin job or azure GPU job