Dynamic Resource Allocation (DRA)
Configure Cast AI Autoscaler to use Kubernetes Dynamic Resource Allocation (DRA) for flexible resource management
Dynamic Resource Allocation (DRA)
Cast AI supports Dynamic Resource Allocation (DRA), a powerful feature that enables flexible resource allocation for hardware accelerators like GPUs. DRA provides more sophisticated resource management compared to traditional device plugin approaches, with support for resource sharing, fine-grained device selection, and simplified workload configuration.
What is Dynamic Resource Allocation?
Dynamic Resource Allocation is a Kubernetes feature that enables pods to request and share specialized hardware resources in a flexible manner. DRA works through a driver-based architecture where:
- DRA device drivers advertise available hardware resources to the cluster
- DeviceClasses categorize and define resource types (like GPUs) that can be allocated
- ResourceClaims/ResourceClaimTemplates represent specific resource requests from workloads
- ResourceSlices advertise available resources from nodes to the API server
Unlike traditional device plugins that require explicit resource quantities in each container, DRA allows:
- Flexible device selection using attribute-based filtering
- Easy resource sharing between containers and pods
- Simplified pod specifications with centralized resource definitions
- Better resource visibility through structured resource advertisement
For more information on DRA architecture, see the Kubernetes DRA documentation.
Supported providers and drivers
| Provider | DRA Support | Supported Drivers | Notes |
|---|---|---|---|
| AWS EKS | ✓ | NVIDIA GPU | Kubernetes 1.34+ required. Bottlerocket not yet supported |
| GCP GKE | ✓ | NVIDIA GPU | Kubernetes 1.34+ required |
| Azure AKS | Coming soon | - | Contact Cast AI support for updates |
Prerequisites for NVIDIA GPU DRA
Before using DRA with Cast AI, ensure you have:
Required versions:
- Kubernetes version: 1.34 or higher
- Cast AI agent: v0.109.0 or higher (helm chart 0.132.0+)
- Cluster: GKE or EKS cluster connected to Cast AI
Tools required:
kubectlCLIhelmCLI (for driver installation)
You can verify your Kubernetes version:
kubectl version --shortExpected output should show server version v1.34.x or higher.
Installing NVIDIA DRA driver
DRA requires a driver to be installed in your cluster to manage resource allocation. NVIDIA provides DRA driver for GPUs.
Important: The NVIDIA device plugin and NVIDIA DRA kubelet plugin cannot coexist on the same GPU nodesNVIDIA device plugin and NVIDIA DRA kubelet plugin cannot run on the same GPU nodes. If you have a NVIDIA device plugin installed, you must update its configuration to exclude nodes with the
nvidia.com/gpu.dra=truelabel before installing the DRA driver.Note for GKE users: This is not required for GKE clusters using the default NVIDIA device plugin. Cast AI automatically creates DRA GPU nodes with the
gke-no-default-nvidia-gpu-device-pluginlabel, which prevents the default NVIDIA device plugin from running on those nodes.See the updating existing NVIDIA device plugin section below for detailed configuration instructions.
Updating existing NVIDIA device plugin
If you have an existing NVIDIA device plugin installed in your cluster (common in EKS clusters), you must update its affinity configuration to prevent it from scheduling on DRA-enabled GPU nodes.
The NVIDIA device plugin and NVIDIA DRA kubelet plugin cannot coexist on the same GPU nodes. Add the following affinity configuration to your existing NVIDIA device plugin installation to exclude nodes with the nvidia.com/gpu.dra=true label:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# Discrete GPU label
- key: feature.node.kubernetes.io/pci-10de.present
operator: In
values:
- "true"
# Exclude DRA GPU nodes
- key: nvidia.com/gpu.dra
operator: NotIn
values:
- "true"
- matchExpressions:
# Tegra / CPU vendor NVIDIA
- key: feature.node.kubernetes.io/cpu-model.vendor_id
operator: In
values:
- "NVIDIA"
# Exclude DRA GPU nodes
- key: nvidia.com/gpu.dra
operator: NotIn
values:
- "true"
- matchExpressions:
# Forced GPU label
- key: "nvidia.com/gpu.present"
operator: In
values:
- "true"
# Exclude DRA GPU nodes
- key: nvidia.com/gpu.dra
operator: NotIn
values:
- "true"Installing NVIDIA DRA driver on GKE
The following installation process:
- Installs GPU drivers on nodes via DaemonSet
- Creates the
nvidianamespace for DRA driver components - Applies ResourceQuota
- Installs NVIDIA DRA driver via Helm (controller + kubelet plugin)
#!/bin/bash
# Install GPU drivers (GKE documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers)
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
# Install NVIDIA DRA driver
kubectl create namespace nvidia
kubectl apply -n nvidia -f - << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
name: nvidia
spec:
hard:
pods: 100
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- system-node-critical
- system-cluster-critical
EOF
cat <<EOF > dra_values.yaml
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /home/kubernetes/bin/nvidia/
controller:
affinity: null
kubeletPlugin:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
- key: nvidia.com/gpu.dra
operator: In
values:
- "true"
EOF
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.8.0" \
--namespace nvidia \
-f dra_values.yamlInstalling NVIDIA DRA driver on EKS
The following installation process:
- Creates the
nvidianamespace for DRA driver components - Applies ResourceQuota
- Installs NVIDIA DRA driver
#!/bin/bash
kubectl create ns nvidia
kubectl apply -n nvidia -f - << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
name: nvidia
spec:
hard:
pods: 100
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- system-node-critical
- system-cluster-critical
EOF
cat <<EOF > dra_values.yaml
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
controller:
affinity: null
kubeletPlugin:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
- key: nvidia.com/gpu.dra
operator: In
values:
- "true"
EOF
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.8.0" \
--namespace nvidia \
-f dra_values.yamlVerifying DRA driver installation
After installation, verify the DRA driver is functioning:
Check DeviceClasses:
kubectl get DeviceClassesExpected output:
NAME AGE
gpu.nvidia.com Xs
Check DRA driver pods:
kubectl get pods -n nvidiaYou should see running pods including:
nvidia-dra-driver-gpu-controller-*(controller pod)nvidia-dra-driver-gpu-kubelet-plugin-*(kubelet plugin DaemonSet)
View controller logs (optional):
kubectl logs -n nvidia deployment/nvidia-dra-driver-gpu-controllerConfiguring workloads with DRA
DRA workloads use ResourceClaims and ResourceClaimTemplates to request access to hardware resources. ResourceClaims and ResourceClaimTemplates reference a DeviceClass and specify requirements.
Supported GPU attributes for DRACast AI supports the following GPU attributes for use with DRA ResourceClaims, allowing you to specify precise GPU requirements:
- productName - Specific GPU model
- device count - Number of GPU devices required
- architecture - GPU architecture (e.g., "Ampere", "Hopper")
- brand - GPU brand classification (e.g., "Tesla", "Nvidia")
- type - GPU type category
- memory - GPU memory capacity
These attributes can be used in ResourceClaim selectors to target specific GPU characteristics for your workloads.
Note: Multi-Instance GPU (MIG) is not currently supported with DRA.
Basic workload configuration
Here's a simple example with one deployment requesting dedicated GPU access:
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-workload
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
namespace: gpu-workload
name: single-gpu
spec:
devices:
requests:
- name: gpu
firstAvailable:
- name: nvidia-gpu
deviceClassName: gpu.nvidia.com
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: gpu-workload
name: gpu-deployment
labels:
app: gpu-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gpu-deployment
template:
metadata:
labels:
app: gpu-deployment
spec:
containers:
- name: gpu-container
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"GPU sharing across multiple pods
DRA also allows multiple pods to share the same GPU by referencing a common ResourceClaim:
---
apiVersion: v1
kind: Namespace
metadata:
name: multi-pod-shared
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
namespace: multi-pod-shared
name: global-shared-gpu
spec:
devices:
requests:
- name: gpu
firstAvailable:
- name: nvidia-gpu
deviceClassName: gpu.nvidia.com
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: multi-pod-shared
name: gpu-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gpu-deployment
template:
metadata:
labels:
app: gpu-deployment
spec:
containers:
- name: container
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: global-shared-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: multi-pod-shared
name: deployment-2
spec:
replicas: 1
selector:
matchLabels:
app: deployment-2
template:
metadata:
labels:
app: deployment-2
spec:
containers:
- name: container
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: global-shared-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"Advanced GPU selection using CEL expressions
DRA supports using CEL (Common Expression Language) expressions in ResourceClaims and ResourceClaimTemplates to filter GPUs based on multiple attributes simultaneously. This example demonstrates selecting GPUs by brand, architecture, and memory capacity:
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-cel-example
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
namespace: gpu-cel-example
name: gpu-cel-filter
spec:
devices:
requests:
- name: gpu
firstAvailable:
- name: nvidia-gpu
deviceClassName: gpu.nvidia.com
selectors:
# Filter by brand, architecture, and memory using combined CEL expression
- cel:
expression: |
device.attributes['gpu.nvidia.com'].brand == 'Tesla' &&
device.attributes['gpu.nvidia.com'].architecture == 'Volta' &&
device.capacity['gpu.nvidia.com'].memory.isGreaterThan(quantity('10Gi'))
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: gpu-cel-example
name: gpu-workload
labels:
app: gpu-cel-example
spec:
replicas: 1
selector:
matchLabels:
app: gpu-cel-example
template:
metadata:
labels:
app: gpu-cel-example
spec:
containers:
- name: gpu-workload
image: ubuntu:22.04
command: ["bash", "-c"]
args:
- |
echo "GPU allocated using CEL expression filter:"
nvidia-smi -L
nvidia-smi --query-gpu=name,architecture,memory.total,uuid --format=csv
sleep 3600
resources:
claims:
- name: gpu-claim
resourceClaims:
- name: gpu-claim
resourceClaimName: gpu-cel-filter
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"Monitoring and verification
Watching Cast AI provision GPU nodes
After deploying DRA workloads, monitor as Cast AI detects requirements and provisions nodes:
# Watch pod status
kubectl get pods -n <namespace> -w
# Watch for new GPU nodes
kubectl get nodes -wVerifying ResourceSlices
After GPU nodes join the cluster, verify ResourceSlices are created:
kubectl get ResourceSlicesYou should see ResourceSlice objects advertising available GPUs from provisioned nodes.
Troubleshooting
Pods stuck in Pending state
Check pod events:
kubectl describe pod <pod-name> -n <namespace>Common issues:
- Cast AI is still provisioning GPU nodes (wait a few minutes)
- DRA driver not installed or not functioning
- Insufficient quota for GPU instances in your cloud provider
- DeviceClass not found or misconfigured
Cast AI not provisioning GPU nodes
- Verify Cast AI autoscaler is enabled:
kubectl get deployment -n castai-agent- Check Cast AI agent logs:
kubectl logs -n castai-agent deployment/castai-agent- Ensure your Cast AI configuration includes GPU instance types in allowed instance families
ResourceSlices not appearing
- Verify GPU nodes for NVIDIA DRA are running:
kubectl get nodes -l nvidia.com/gpu.dra=true- Check DRA kubelet plugin is running on GPU nodes:
kubectl get pods -n nvidia -o wide- Check kubelet plugin logs:
kubectl logs -n nvidia <nvidia-dra-driver-gpu-kubelet-plugin-pod>DeviceClass not found
Verify the DRA driver created the DeviceClass:
kubectl get DeviceClassesIf no DeviceClasses exist, reinstall the DRA driver and check controller logs for errors.
References
Updated 1 day ago
