Cloud Provider Troubleshooting

This guide is designed to assist users encountering issues specific to cloud providers such as EKS, GCP, or AKS clusters when using CAST AI.

AKS

AKS Versions not supported

During the cluster onboarding to CAST AI managed mode, the onboarding process will create a new Node Pool. Microsoft Azure Cloud enforces certain restrictions for Node Pool creations:

  • Node Pool can NOT be newer than your AKS cluster control plane version.
  • Microsoft support only a very small number of minor/patch K8s versions for Node Pool creation. Azure documentation.

You can check the list of supported AKS versions in your region:

❯ az aks get-versions --location eastus --output table
KubernetesVersion    Upgrades                 
-------------------  -----------------------  
1.29.2               None available           
1.29.0               1.29.2                   
1.28.5               1.29.0, 1.29.2           
1.28.3               1.28.5, 1.29.0, 1.29.2   
1.27.9               1.28.3, 1.28.5           
1.27.7               1.27.9, 1.28.3, 1.28.5   
1.26.12              1.27.7, 1.27.9           
1.26.10              1.26.12, 1.27.7, 1.27.9  

If your AKS cluster control plane version is 1.24.8, no new Node Pools can be created (CAST AI or not). To continue CAST AI onboarding, upgrade the AKS control plane to the nearest patch version say 1.24.9 or 1.24.10 (at the time of writing), and re-run the onboarding script. There is no need to upgrade your existing nodes, just the Control Plane.



AKS fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster

If the cluster is already attached to the ACR after onboarding on CAST AI, the Service Principal created to manage the cluster might not have the correct permissions to pull images from the private ACRs. This may result in failed to pull and unpack image, failed to fetch oauth token: unexpected status: 401 Unauthorized when creating new nodes.

Microsoft has detailed documentation on troubleshooting and fixing the issue: Fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster.

In most cases, Solution 1: Ensure AcrPull role assignment is created for identity is enough to resolve it.



EKS

Max Pod Count on AWS CNI

There are situations when you can get VMs in AWS which will have low upper limit of max pod count, i.e. 58 for c6g.2xlarge. Full list of ENI limitations per instance type available at eni-max-pods.txt.

This can be mitigated in two ways

  1. Setting Min CPU constraints in Node Templates to 16 CPUs, as the issue only exists on nodes that are lower on CPUs (i.e. 8 CPU nodes)
  2. Increasing the pods per node limits, you can do it by executing the following within your cluster context:


GKE