Troubleshooting Node Autoscaling | Node Autoscaling

EKS: Max pod count on AWS CNI

There are situations when you can get VMs in AWS which will have a low upper limit of max pod count, i.e. 58 for c6g.2xlarge. Complete list of ENI limitations per instance type available at eni-max-pods.txt.

This can be mitigated in two ways:

Setting Min CPU constraints in Node Templates to 16 CPUs, as the issue only exists on nodes that have fewer CPUs (i.e., 8 CPU nodes)
Increasing the pods per node limits, you can do it by executing the following within your cluster context:

kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
Amazon VPC CNI plugin increases pods per node limits
Increase the amount of available IP addresses for your Amazon EC2 nodes

EKS: Missing instance profile IAM role from `aws-auth` ConfigMap

For clusters that utilize ConfigMap access mode, nodes require an entry in aws-auth ConfigMap with the IAM instance profile role to properly access the cluster. If nodes remain unhealthy because kubelet cannot access the api-server, check if the role is present.

Sharing the instance profile and role for Cast AI-managed nodes with EKS-managed node groups is not recommended. This is because deleting all managed node groups that use an instance profile removes the instance profile role from aws-auth, which can break Cast AI-managed nodes that utilize the role. The observed symptom is nodes becoming NotReady shortly after, with kubelet receiving unauthorized errors when accessing the api-server.

To resolve the problem after it appears:

For clusters utilizing ConfigMap access mode, add the role back in the aws-auth.
For clusters utilizing EKS API and ConfigMap access mode, adding an access entry either in the EKS access entry list or in the aws-auth ConfigMap is sufficient.
Clusters utilizing only EKS API, access mode should not be affected as EKS does not delete entries in EKS API.

Sample aws-auth entry:

- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "arn:aws:iam::account-id:role/instance-profile-role"
  "username": "system:node:{{EC2PrivateDNSName}}"

References:

Delete a managed node group from your cluster

EKS: Warm pool nodes are not supported

Cast AI does not support deleting nodes that were provisioned from AWS autoscaling warm pools. Attempting to delete warm pool nodes through the Cast AI console will result in a deletion failure with an error message.

What are warm pools? AWS autoscaling warm pools are pre-initialized EC2 instances that remain in a stopped or hibernated state to reduce scale-out latency when scaling your cluster.

Resolution: Warm pool nodes must be managed directly through the AWS console or CLI rather than through Cast AI. If you need to remove warm pool nodes, use AWS tools to manage the autoscaling group configuration.

Related: If you're using warm pools and experiencing node management issues, consider using standard EKS managed node groups or self-managed node groups without warm pools for nodes that Cast AI will manage.

EKS: Bottlerocket nodes take 5+ minutes to reach Ready state

Bottlerocket nodes on EKS may consistently take 5-6 minutes to reach Ready state instead of the typical 2-3 minutes. This delay occurs regardless of instance type or region.

Symptoms

New Bottlerocket nodes take 5-6 minutes to join the cluster, while Amazon Linux nodes join in 2-3 minutes.
Running systemd-analyze blame on a Bottlerocket node shows pluto.service taking ~5 minutes to complete.
The journalctl -u pluto output shows a long gap between the Starting Generate additional settings for Kubernetes... and Finished log lines.

Root cause

Bottlerocket uses a system service called pluto to generate Kubernetes settings during node bootstrap. One of the values pluto resolves is the DNS cluster IP — the ClusterIP of the kube-dns service. To discover this value, pluto queries the Instance Metadata Service (IMDS) and the EKS API.

When this lookup fails or times out (due to network configuration, security group rules, or IMDS access restrictions), pluto retries for approximately 5 minutes before giving up. Because kubelet depends on pluto completing, the entire node bootstrap stalls until the timeout expires.

Solution

Set the Dns-cluster-ip field in your Cast AI Node Configuration to the ClusterIP of the kube-dns service in your cluster. When this value is explicitly provided, Bottlerocket skips the IMDS lookup and boots without delay.

To find the correct value, run:

kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'

For most EKS clusters, this value is 172.20.0.10 or 10.100.0.10.

Then set it in your Node Configuration:

UI: Go to Autoscaler > Node configuration > select your configuration > Advanced Settings > set the Dns-cluster-ip field.
Terraform: Set the dns_cluster_ip parameter in the eks block:

resource "castai_node_configuration" "default" {
  name       = "default"
  cluster_id = castai_eks_cluster.example.id

  eks {
    dns_cluster_ip = "172.20.0.10"
  }
}

API: Include the dnsClusterIp field when creating or updating a node configuration.

Verification

After applying the change, trigger a new node (for example, by scaling a workload or running a rebalance). The node should reach Ready state within 2-3 minutes. You can confirm the improvement in the Cast AI console under Autoscaler > Node list by checking the node creation timeline.

GKE: Pod startup failures with PD Standard on C3/C3D nodes

Pods fail to start when PersistentVolumeClaims using pd-standard storage are scheduled onto Nodes running C3 or C3D machine types. These instance families do not support pd-standard disks, which prevents volume attachment.

Symptoms

Pods remain in ContainerCreating state and display this warning:

Warning  FailedAttachVolume  attachdetach-controller  
AttachVolume.Attach failed for volume "pvc-example" : 
rpc error: code = InvalidArgument desc = Failed to Attach: 
failed cloud service attach disk call: googleapi: 
Error 400: pd-standard disk type cannot be used by c3d-standard-90 machine type., badRequest

Root cause

When your default StorageClass is configured with volumeBindingMode: Immediate and type: pd-standard, Kubernetes provisions the PersistentVolume (PV) before it schedules the Pod. This creates a timing issue:

Kubernetes creates the PV as pd-standard
The scheduler places the Pod on an available Node (potentially C3/C3D)
The Node attempts to attach the volume, but cannot support pd-standard
Volume attachment fails, and the Pod cannot start

C3 and C3D machine types support only pd-balanced, pd-ssd, and pd-extreme disk types.

Solution

Prevent Pods requiring pd-standard storage from scheduling onto incompatible Nodes by implementing both configuration changes below.

1. Update StorageClass volume binding mode

Change your StorageClass to use WaitForFirstConsumer. This ensures Kubernetes provisions the PV only after the Pod is scheduled, allowing the scheduler to consider Node constraints, including disk type compatibility.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete

2. Add Node Selector to Pods using PD Standard

Add this NodeSelector to Pod specifications that use pd-standard volumes:

nodeSelector:
  volume.scheduling.cast.ai/pd-standard: "true"

This ensures these Pods only schedule onto NodePools that support pd-standard disks.

Why both changes are required

Configuration	Purpose
`WaitForFirstConsumer`	Delays PV provisioning until Kubernetes selects a Node, enabling the scheduler to evaluate disk type compatibility
Node selector	Restricts pod placement to nodes that support `pd-standard`, preventing scheduling onto C3/C3D machine types

With this configuration:

Pods using PD Standard Volumes schedule only on compatible Node types
Kubernetes provisions volumes after selecting a valid Node
Cast AI's autoscaler respects these constraints when placing workloads
Volume attachment failures are prevented

For details on disk type compatibility with GCP machine types, see Google Cloud documentation on persistent disk types.

EKS: Max pod count on AWS CNI

EKS: Missing instance profile IAM role from aws-auth ConfigMap

EKS: Warm pool nodes are not supported

EKS: Bottlerocket nodes take 5+ minutes to reach Ready state

Symptoms

Root cause

Solution

Verification

GKE: Pod startup failures with PD Standard on C3/C3D nodes

Symptoms

Root cause

Solution

1. Update StorageClass volume binding mode

2. Add Node Selector to Pods using PD Standard

Why both changes are required

EKS: Missing instance profile IAM role from `aws-auth` ConfigMap