Troubleshooting node autoscaling
Solutions for resolving node provisioning, pod scheduling, and cloud provider-specific issues with the Cast AI autoscaler.
EKS: Max pod count on AWS CNI
There are situations when you can get VMs in AWS which will have a low upper limit of max pod count, i.e. 58 for c6g.2xlarge. Complete list of ENI limitations per instance type available at eni-max-pods.txt.
This can be mitigated in two ways:
- Setting Min CPU constraints in Node Templates to 16 CPUs, as the issue only exists on nodes that have fewer CPUs (i.e., 8 CPU nodes)
- Increasing the pods per node limits, you can do it by executing the following within your cluster context:
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true- Amazon VPC CNI plugin increases pods per node limits
- Increase the amount of available IP addresses for your Amazon EC2 nodes
EKS: Missing instance profile IAM role from aws-auth ConfigMap
aws-auth ConfigMapFor clusters that utilize ConfigMap access mode, nodes require an entry in aws-auth ConfigMap with the IAM instance profile role to properly access the cluster. If nodes remain unhealthy because kubelet cannot access the api-server, check if the role is present.
Sharing the instance profile and role for Cast AI-managed nodes with EKS-managed node groups is not recommended. This is because deleting all managed node groups that use an instance profile removes the instance profile role from aws-auth, which can break Cast AI-managed nodes that utilize the role. The observed symptom is nodes becoming NotReady shortly after, with kubelet receiving unauthorized errors when accessing the api-server.
To resolve the problem after it appears:
- For clusters utilizing
ConfigMapaccess mode, add the role back in theaws-auth. - For clusters utilizing
EKS API and ConfigMapaccess mode, adding an access entry either in the EKS access entry list or in theaws-authConfigMap is sufficient. - Clusters utilizing only
EKS API, access mode should not be affected as EKS does not delete entries in EKS API.
Sample aws-auth entry:
- "groups":
- "system:bootstrappers"
- "system:nodes"
"rolearn": "arn:aws:iam::account-id:role/instance-profile-role"
"username": "system:node:{{EC2PrivateDNSName}}"References:
EKS: Warm pool nodes are not supported
Cast AI does not support deleting nodes that were provisioned from AWS autoscaling warm pools. Attempting to delete warm pool nodes through the Cast AI console will result in a deletion failure with an error message.
What are warm pools? AWS autoscaling warm pools are pre-initialized EC2 instances that remain in a stopped or hibernated state to reduce scale-out latency when scaling your cluster.
Resolution: Warm pool nodes must be managed directly through the AWS console or CLI rather than through Cast AI. If you need to remove warm pool nodes, use AWS tools to manage the autoscaling group configuration.
Related: If you're using warm pools and experiencing node management issues, consider using standard EKS managed node groups or self-managed node groups without warm pools for nodes that Cast AI will manage.
GKE: Pod startup failures with PD Standard on C3/C3D nodes
Pods fail to start when PersistentVolumeClaims using pd-standard storage are scheduled onto Nodes running C3 or C3D machine types. These instance families do not support pd-standard disks, which prevents volume attachment.
Symptoms
Pods remain in ContainerCreating state and display this warning:
Warning FailedAttachVolume attachdetach-controller
AttachVolume.Attach failed for volume "pvc-example" :
rpc error: code = InvalidArgument desc = Failed to Attach:
failed cloud service attach disk call: googleapi:
Error 400: pd-standard disk type cannot be used by c3d-standard-90 machine type., badRequestRoot cause
When your default StorageClass is configured with volumeBindingMode: Immediate and type: pd-standard, Kubernetes provisions the PersistentVolume (PV) before it schedules the Pod. This creates a timing issue:
- Kubernetes creates the PV as
pd-standard - The scheduler places the Pod on an available Node (potentially C3/C3D)
- The Node attempts to attach the volume, but cannot support
pd-standard - Volume attachment fails, and the Pod cannot start
C3 and C3D machine types support only pd-balanced, pd-ssd, and pd-extreme disk types.
Solution
Prevent Pods requiring pd-standard storage from scheduling onto incompatible Nodes by implementing both configuration changes below.
1. Update StorageClass volume binding mode
Change your StorageClass to use WaitForFirstConsumer. This ensures Kubernetes provisions the PV only after the Pod is scheduled, allowing the scheduler to consider Node constraints, including disk type compatibility.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete2. Add Node Selector to Pods using PD Standard
Add this NodeSelector to Pod specifications that use pd-standard volumes:
nodeSelector:
volume.scheduling.cast.ai/pd-standard: "true"This ensures these Pods only schedule onto NodePools that support pd-standard disks.
Why both changes are required
| Configuration | Purpose |
|---|---|
WaitForFirstConsumer | Delays PV provisioning until Kubernetes selects a Node, enabling the scheduler to evaluate disk type compatibility |
| Node selector | Restricts pod placement to nodes that support pd-standard, preventing scheduling onto C3/C3D machine types |
With this configuration:
- Pods using PD Standard Volumes schedule only on compatible Node types
- Kubernetes provisions volumes after selecting a valid Node
- Cast AI's autoscaler respects these constraints when placing workloads
- Volume attachment failures are prevented
For details on disk type compatibility with GCP machine types, see Google Cloud documentation on persistent disk types.
Updated about 2 hours ago
