Cloud Provider Troubleshooting

This guide is designed to assist users encountering issues specific to cloud providers such as EKS, GCP, or AKS clusters when using CAST AI.

AKS

AKS Versions not supported

During the cluster onboarding to CAST AI managed mode, the onboarding process will create a new Node Pool. Microsoft Azure Cloud enforces certain restrictions for Node Pool creations:

  • Node Pool can NOT be newer than your AKS cluster control plane version.
  • Microsoft support only a very small number of minor/patch K8s versions for Node Pool creation. Azure documentation.

You can check the list of supported AKS versions in your region:

❯ az aks get-versions --location eastus --output table
KubernetesVersion    Upgrades                 
-------------------  -----------------------  
1.29.2               None available           
1.29.0               1.29.2                   
1.28.5               1.29.0, 1.29.2           
1.28.3               1.28.5, 1.29.0, 1.29.2   
1.27.9               1.28.3, 1.28.5           
1.27.7               1.27.9, 1.28.3, 1.28.5   
1.26.12              1.27.7, 1.27.9           
1.26.10              1.26.12, 1.27.7, 1.27.9  

If your AKS cluster control plane version is 1.24.8, no new Node Pools can be created (CAST AI or not). To continue CAST AI onboarding, upgrade the AKS control plane to the nearest patch version say 1.24.9 or 1.24.10 (at the time of writing), and re-run the onboarding script. There is no need to upgrade your existing nodes, just the Control Plane.


AKS fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster

If the cluster is already attached to the ACR after onboarding on CAST AI, the Service Principal created to manage the cluster might not have the correct permissions to pull images from the private ACRs. This may result in failed to pull and unpack image, failed to fetch oauth token: unexpected status: 401 Unauthorized when creating new nodes.

Microsoft has detailed documentation on troubleshooting and fixing the issue: Fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster

In most cases, Solution 1: Ensure AcrPull role assignment is created for identity is enough to resolve it.


EKS

Max Pod Count on AWS CNI

There are situations when you can get VMs in AWS which will have low upper limit of max pod count, i.e. 58 for c6g.2xlarge. Full list of ENI limitations per instance type available at eni-max-pods.txt.

This can be mitigated in two ways

  1. Setting Min CPU constraints in Node Templates to 16 CPUs, as the issue only exists on nodes that are lower on CPUs (i.e. 8 CPU nodes)
  2. Increasing the pods per node limits, you can do it by executing the following within your cluster context:

Missing instance profile IAM role from aws-auth ConfigMap

For clusters that utilize ConfigMap access mode, nodes require an entry in aws-auth ConfigMap with the IAM instance profile role to properly access the cluster. If nodes remain unhealthy because kubelet cannot access the api-server, check if the role is present.

Sharing the instance profile and role for Cast AI-managed nodes with EKS-managed node groups is not recommended. This is because deleting all managed node groups that use an instance profile removes the instance profile role from aws-auth, which can break Cast AI-managed nodes that utilize the role. The observed symptom is nodes becoming NotReady shortly after, with kubelet receiving unauthorized errors when accessing the api-server.

To resolve the problem after it appears:

  • For clusters utilizing ConfigMap access mode, add the role back in the aws-auth.
  • For clusters utilizing EKS API and ConfigMap access mode, adding an access entry either in the EKS access entry list or in the aws-auth ConfigMap is sufficient.
  • Clusters utilizing only EKS API, access mode should not be affected as EKS does not delete entries in EKS API.

Sample aws-auth entry:

- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "arn:aws:iam::account-id:role/instance-profile-role"
  "username": "system:node:{{EC2PrivateDNSName}}"

References:


GKE