Troubleshooting node autoscaling

Solutions for resolving node provisioning, pod scheduling, and cloud provider-specific issues with the Cast AI autoscaler.

EKS: Max pod count on AWS CNI

There are situations when you can get VMs in AWS which will have a low upper limit of max pod count, i.e. 58 for c6g.2xlarge. Complete list of ENI limitations per instance type available at eni-max-pods.txt.

This can be mitigated in two ways:

  1. Setting Min CPU constraints in Node Templates to 16 CPUs, as the issue only exists on nodes that have fewer CPUs (i.e., 8 CPU nodes)
  2. Increasing the pods per node limits, you can do it by executing the following within your cluster context:

EKS: Missing instance profile IAM role from aws-auth ConfigMap

For clusters that utilize ConfigMap access mode, nodes require an entry in aws-auth ConfigMap with the IAM instance profile role to properly access the cluster. If nodes remain unhealthy because kubelet cannot access the api-server, check if the role is present.

Sharing the instance profile and role for Cast AI-managed nodes with EKS-managed node groups is not recommended. This is because deleting all managed node groups that use an instance profile removes the instance profile role from aws-auth, which can break Cast AI-managed nodes that utilize the role. The observed symptom is nodes becoming NotReady shortly after, with kubelet receiving unauthorized errors when accessing the api-server.

To resolve the problem after it appears:

  • For clusters utilizing ConfigMap access mode, add the role back in the aws-auth.
  • For clusters utilizing EKS API and ConfigMap access mode, adding an access entry either in the EKS access entry list or in the aws-auth ConfigMap is sufficient.
  • Clusters utilizing only EKS API, access mode should not be affected as EKS does not delete entries in EKS API.

Sample aws-auth entry:

- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "arn:aws:iam::account-id:role/instance-profile-role"
  "username": "system:node:{{EC2PrivateDNSName}}"

References:


EKS: Warm pool nodes are not supported

Cast AI does not support deleting nodes that were provisioned from AWS autoscaling warm pools. Attempting to delete warm pool nodes through the Cast AI console will result in a deletion failure with an error message.

What are warm pools? AWS autoscaling warm pools are pre-initialized EC2 instances that remain in a stopped or hibernated state to reduce scale-out latency when scaling your cluster.

Resolution: Warm pool nodes must be managed directly through the AWS console or CLI rather than through Cast AI. If you need to remove warm pool nodes, use AWS tools to manage the autoscaling group configuration.

Related: If you're using warm pools and experiencing node management issues, consider using standard EKS managed node groups or self-managed node groups without warm pools for nodes that Cast AI will manage.


GKE: Pod startup failures with PD Standard on C3/C3D nodes

Pods fail to start when PersistentVolumeClaims using pd-standard storage are scheduled onto Nodes running C3 or C3D machine types. These instance families do not support pd-standard disks, which prevents volume attachment.

Symptoms

Pods remain in ContainerCreating state and display this warning:

Warning  FailedAttachVolume  attachdetach-controller  
AttachVolume.Attach failed for volume "pvc-example" : 
rpc error: code = InvalidArgument desc = Failed to Attach: 
failed cloud service attach disk call: googleapi: 
Error 400: pd-standard disk type cannot be used by c3d-standard-90 machine type., badRequest

Root cause

When your default StorageClass is configured with volumeBindingMode: Immediate and type: pd-standard, Kubernetes provisions the PersistentVolume (PV) before it schedules the Pod. This creates a timing issue:

  1. Kubernetes creates the PV as pd-standard
  2. The scheduler places the Pod on an available Node (potentially C3/C3D)
  3. The Node attempts to attach the volume, but cannot support pd-standard
  4. Volume attachment fails, and the Pod cannot start

C3 and C3D machine types support only pd-balanced, pd-ssd, and pd-extreme disk types.

Solution

Prevent Pods requiring pd-standard storage from scheduling onto incompatible Nodes by implementing both configuration changes below.

1. Update StorageClass volume binding mode

Change your StorageClass to use WaitForFirstConsumer. This ensures Kubernetes provisions the PV only after the Pod is scheduled, allowing the scheduler to consider Node constraints, including disk type compatibility.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete

2. Add Node Selector to Pods using PD Standard

Add this NodeSelector to Pod specifications that use pd-standard volumes:

nodeSelector:
  volume.scheduling.cast.ai/pd-standard: "true"

This ensures these Pods only schedule onto NodePools that support pd-standard disks.

Why both changes are required

ConfigurationPurpose
WaitForFirstConsumerDelays PV provisioning until Kubernetes selects a Node, enabling the scheduler to evaluate disk type compatibility
Node selectorRestricts pod placement to nodes that support pd-standard, preventing scheduling onto C3/C3D machine types

With this configuration:

  • Pods using PD Standard Volumes schedule only on compatible Node types
  • Kubernetes provisions volumes after selecting a valid Node
  • Cast AI's autoscaler respects these constraints when placing workloads
  • Volume attachment failures are prevented

For details on disk type compatibility with GCP machine types, see Google Cloud documentation on persistent disk types.