GPU Instances
Configure Cast AI Autoscaler to scale your cluster using GPU-optimized instances across AWS EKS, GCP GKE, and Azure AKS with NVIDIA GPU support.
Autoscaling using GPU instances
The Cast AI Autoscaler can scale the cluster using GPU-optimized instances. This guide describes the steps needed to configure the cluster so that GPU nodes can join it.
Supported providers
| Provider | GPUs supported |
|---|---|
| AWS EKS | NVIDIA |
| GCP GKE | NVIDIA |
| Azure AKS * | NVIDIA |
* - Please contact Cast AI support to enable this feature for your organization.
How does it work?
Once activated, Cast AI's Autoscaler detects workloads requiring GPU resources and starts provisioning them.
To enable the provisioning of GPU nodes, you need a few things:
- Choose a GPU instance type or attach a GPU to the instance type
- Install NVIDIA device plugin (needed for EKS and AKS)
- Expose the GPU to Kubernetes as a consumable resource.
Cast AI ensures that the correct GPU instance type is selected - all you have to do is define GPU resources and add a GPU or a node template toleration. You can also target specific GPU characteristics using node selectors or affinities to the GPU labels.
| Label | Value Example | Description |
|---|---|---|
nvidia.com/gpu | true | Node has an NVIDIA GPU attached |
nvidia.com/gpu.name | nvidia-tesla-t4 | Attached GPU type |
nvidia.com/gpu.count | 1 | Attached GPU count |
nvidia.com/gpu.memory | 15258 | Avaialble single GPU memory in Mib |
Tainting of GPU nodesGPU nodes will have the nvidia.com/gpu=true:NoSchedule taint applied automatically, except when the Node Template has GPU constraints configured or limits instance types to GPU instances only. In such cases, the taint is not applied since the template is specifically designed for only GPU workloads.
GPU sharing
Cast AI supports GPU sharing to enable multiple workloads to utilize GPU resources more efficiently. Multiple sharing methods are available, each optimized for different use cases:
- Time-slicing - Share GPUs through rapid context switching
- Multi-Instance GPU (MIG) - Partition GPUs with hardware-level isolation
- Fractional GPUs - Pre-partitioned GPU portions from AWS (1/8 to full GPU)
For a detailed comparison and guidance on choosing the right method, see the GPU sharing overview.
NVIDIA device plugin
After creating a node of an instance type with a GPU, the node becomes part of the cluster, but GPU resources are not immediately usable. To make GPUs accessible to Kubernetes, you must install the NVIDIA device plugin on the node.
Cast AI validates to ensure that the NIDIA device plugin exists on a cluster before performing any kind of autoscaling. If it doesn't detect the NVIDIA device plugin, it creates a pod event with details on how to resolve the problem.
NVIDIA device plugin detection
Cast AI assumes that the NVIDIA device plugin is installed if it finds a DaemonSet that matches the plugin's characteristics, and a pod created from that DaemonSet can run on a node.
Cast AI supports all default NVIDIA device plugin that match specific name patterns. Moreover, it also allows tagging all custom plugins as supported with the label nvidia-device-plugin: "true"
Daemonset name matching one of the patterns is considered a known official NVIDIA device plugin:
*nvidia-device-plugin**nvidia-gpu-device-plugin*
NVIDIA device plugin requires successful pod schedulingCast AI considers a NVIDIA device plugin present only if a pod from the matching DaemonSet can actually run on the GPU node. A DaemonSet that exists in the cluster but whose pods cannot be scheduled onto a node — for example, because the node has a taint that the DaemonSet does not tolerate — will not satisfy NVIDIA device plugin detection. This is a common source of the
NVIDIA Device Plugin is requirederror message.If your Node Template includes custom taints, the NVIDIA device plugin DaemonSet must have a matching toleration for each of those taints. Without it, the device plugin pod will not be scheduled on newly provisioned GPU nodes, and Cast AI will report a detection failure even if the plugin is otherwise correctly installed.
If you see pod events with error message
NVIDIA Device Plugin is required, check that the NVIDIA device plugin DaemonSet has tolerations for:
nvidia.com/gpu:NoSchedule(applied automatically by Cast AI on most GPU nodes)- All Cast AI system taints:
scheduling.cast.ai/spot,scheduling.cast.ai/scoped-autoscaler,scheduling.cast.ai/node-template- Any custom taints defined in your Node Templates
Example: If your Node Template has the taint
team: ml-workloads:NoSchedule, add a specific toleration:tolerations: - key: "team" operator: "Equal" value: "ml-workloads" effect: "NoSchedule"Alternatively, you can use a wildcard toleration that matches all taints:
tolerations: - operator: Exists # Tolerates ALL taintsNote that the wildcard is permissive — it allows the pod to be scheduled on any node regardless of its taints. Use specific tolerations in production where possible.
NVIDIA device plugin
By default, EKS and AKS clusters do not come with a NVIDIA device plugin installed. There are several ways to add a NVIDIA device plugin to a cluster:
GKE clusters come with the NVIDIA device plugin pre-installed by default. No manual installation is required unless you want to use a custom NVIDIA device plugin.
During Cast AI onboarding (only for EKS)
Cast AI allows you to enable the GPU device plugin at the cluster onboarding stage. By ticking the checkbox in the UI, you can install plugins automatically during onboarding.
Manually installing NVIDIA device plugin
Alternatively, you can manually install the plugin from the NVIDIA Helm repository.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo updateInstall the plugin using a values file. Copy the command below, which pipes the values directly via stdin:
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace castai-agent \
--create-namespace \
-f - <<'EOF'
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: scheduling.cast.ai/spot
operator: Exists
- key: scheduling.cast.ai/scoped-autoscaler
operator: Exists
- key: scheduling.cast.ai/node-template
operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
# Discrete GPU nodes (PCI vendor ID 10de = NVIDIA)
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present
operator: In
values:
- "true"
- key: nvidia.com/gpu.dra
operator: NotIn
values:
- "true"
# Tegra-based systems (CPU vendor NVIDIA)
- matchExpressions:
- key: feature.node.kubernetes.io/cpu-model.vendor_id
operator: In
values:
- "NVIDIA"
- key: nvidia.com/gpu.dra
operator: NotIn
values:
- "true"
# Manually labeled GPU nodes
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
- key: nvidia.com/gpu.dra
operator: NotIn
values:
- "true"
EOFCustom GPU device plugin
Cast AI assumes a custom plugin controls the driver installation process and node management. If you wish to autoscale GPU nodes using a custom plugin, it must be detectable to Cast AI.
NVIDIA drivers and Amazon Machine Images (AMIs)
The NVIDIA device plugin requires that NVIDIA drivers and the nvidia-container-toolkit already exist on the machine, or it will fail to start, or expose GPU resources properly. By default, Cast AI detects GPU-enabled nodes and uses the EKS-optimized and GPU-enabled AMI by Amazon, which already bundles these. However, if custom AMIs are used, then the installation of these prerequisites must also be included in the AMI building process or the node's user data scripts.
Amazon Linux 2023 AMI
Amazon Linux 2023 (AL2023) AMIs support GPU time-slicing. When using AL2023 for GPU nodes with time-slicing enabled, Cast AI automatically configures the required NVIDIA device plugin settings.
See the GPU sharing with time-slicing documentation for detailed AL2023 configuration instructions.
Bottlerocket AMI considerations
Bottlerocket AMIs come with pre-installed NVIDIA device plugin. Do not install additional NVIDIA device plugins on clusters where you plan to use Bottlerocket GPU nodes, as they may conflict with the pre-installed NVIDIA device plugin and fail to start. If you do, you need to ensure these NVIDIA device plugin daemonsets do not run on those Bottlerocket nodes through tolerations.
When using Bottlerocket for GPU nodes, the OS handles driver management automatically. For more information, see Bottlerocket support for NVIDIA GPUs.
See Node Configuration documentation for more details on AMI choice.
Known issuesNVIDIA dropped support for Kepler architecture GPUs after driver version 470. Since Cast AI uses AMIs that bundle newer driver versions, these AMIs cannot be used with GPU instances that utilize such GPUs (P2 instance types). In order to use those instance types, an older GPU-enabled AMI or custom AMI must be set in the node configuration.
Workload configuration examples
The following examples show common patterns for configuring GPU workloads. For GPU sharing-specific configurations, see the time-slicing and MIG documentation.
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
containers:
- image: my-image
name: gpu-test
resources:
requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1spec:
nodeSelector:
scheduling.cast.ai/node-template: "gpu-node-template"
tolerations:
- key: "gpu-node-template"
value: "template-affinity"
operator: "Equal"
effect: "NoSchedule"
containers:
- image: my-image
name: gpu-test
resources:
requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1spec:
nodeSelector:
nvidia.com/gpu.name: "nvidia-tesla-t4"
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
containers:
- image: my-image
name: gpu-test
resources:
requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1spec:
nodeSelector:
scheduling.cast.ai/node-template: "gpu-node-template"
nvidia.com/gpu.name: "nvidia-tesla-p4"
tolerations:
- key: "scheduling.cast.ai/node-template"
value: "gpu-node-template"
operator: "Equal"
effect: "NoSchedule"
containers:
- image: my-image
name: gpu-test
resources:
requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.memory
operator: Gt
values:
- "10000"
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
containers:
- image: my-image
name: gpu-test
resources:
requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1Updated 8 days ago
