Cloud Provider Troubleshooting
This guide is designed to assist users encountering issues specific to cloud providers such as EKS, GCP, or AKS clusters when using CAST AI.
AKS
AKS Versions not supported
During the cluster onboarding to CAST AI managed mode, the onboarding process will create a new Node Pool. Microsoft Azure Cloud enforces certain restrictions for Node Pool creations:
- Node Pool can NOT be newer than your AKS cluster control plane version.
- Microsoft support only a very small number of minor/patch K8s versions for Node Pool creation. Azure documentation.
You can check the list of supported AKS versions in your region:
❯ az aks get-versions --location eastus --output table
KubernetesVersion Upgrades
------------------- -----------------------
1.29.2 None available
1.29.0 1.29.2
1.28.5 1.29.0, 1.29.2
1.28.3 1.28.5, 1.29.0, 1.29.2
1.27.9 1.28.3, 1.28.5
1.27.7 1.27.9, 1.28.3, 1.28.5
1.26.12 1.27.7, 1.27.9
1.26.10 1.26.12, 1.27.7, 1.27.9
If your AKS cluster control plane version is 1.24.8, no new Node Pools can be created (CAST AI or not). To continue CAST AI onboarding, upgrade the AKS control plane to the nearest patch version say 1.24.9 or 1.24.10 (at the time of writing), and re-run the onboarding script. There is no need to upgrade your existing nodes, just the Control Plane.
AKS fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster
If the cluster is already attached to the ACR after onboarding on CAST AI, the Service Principal created to manage the cluster might not have the correct permissions to pull images from the private ACRs. This may result in failed to pull and unpack image, failed to fetch oauth token: unexpected status: 401 Unauthorized
when creating new nodes.
Microsoft has detailed documentation on troubleshooting and fixing the issue: Fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster
In most cases, Solution 1: Ensure AcrPull role assignment is created for identity is enough to resolve it.
EKS
Max Pod Count on AWS CNI
There are situations when you can get VMs in AWS which will have low upper limit of max pod count, i.e. 58
for c6g.2xlarge
. Full list of ENI limitations per instance type available at eni-max-pods.txt.
This can be mitigated in two ways
- Setting Min CPU constraints in Node Templates to 16 CPUs, as the issue only exists on nodes that are lower on CPUs (i.e. 8 CPU nodes)
- Increasing the pods per node limits, you can do it by executing the following within your cluster context:
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
- Amazon VPC CNI plugin increases pods per node limits
- Increase the amount of available IP addresses for your Amazon EC2 nodes
Missing instance profile IAM role from aws-auth
ConfigMap
aws-auth
ConfigMapFor clusters that utilize ConfigMap
access mode, nodes require an entry in aws-auth
ConfigMap with the IAM instance profile role to properly access the cluster. If nodes remain unhealthy because kubelet
cannot access the api-server
, check if the role is present.
Sharing the instance profile and role for Cast AI-managed nodes with EKS-managed node groups is not recommended. This is because deleting all managed node groups that use an instance profile removes the instance profile role from aws-auth
, which can break Cast AI-managed nodes that utilize the role. The observed symptom is nodes becoming NotReady
shortly after, with kubelet
receiving unauthorized errors when accessing the api-server
.
To resolve the problem after it appears:
- For clusters utilizing
ConfigMap
access mode, add the role back in theaws-auth
. - For clusters utilizing
EKS API and ConfigMap
access mode, adding an access entry either in the EKS access entry list or in theaws-auth
ConfigMap is sufficient. - Clusters utilizing only
EKS API
, access mode should not be affected as EKS does not delete entries in EKS API.
Sample aws-auth
entry:
- "groups":
- "system:bootstrappers"
- "system:nodes"
"rolearn": "arn:aws:iam::account-id:role/instance-profile-role"
"username": "system:node:{{EC2PrivateDNSName}}"
References:
GKE
Updated 24 days ago