Troubleshooting cluster onboarding

Solutions for resolving issues when connecting clusters to Cast AI, including connectivity problems, authentication errors, and cloud provider-specific onboarding issues.

Your cluster does not appear on the Connect Cluster screen

If the cluster does not appear on the Connect your cluster screen after you've run the connection script, perform the following steps:

1. Check agent container logs:

```shell
kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-agent -c agent
```

2. You might get output similar to this:

```text
time="2021-05-06T14:24:03Z" level=fatal msg="agent failed: registering cluster: getting cluster name: describing instance_id=i-026b5fadab5b69d67: UnauthorizedOperation: You are not authorized to perform this operation.\n\tstatus code: 403, request id: 2165c357-b4a6-4f30-9266-a51f4aaa7ce7"
```
or

```text
time="2021-05-06T14:24:03Z" level=fatal msg=agent failed: getting provider: configuring aws client: NoCredentialProviders: no valid providers in chain"
```
or

```text
time="2023-08-18T18:44:49Z" level=error msg="agent failed: getting provider: configuring aws client: getting instance region: EC2MetadataRequestError: failed to get EC2 instance identity document\ncaused by: RequestError: send request failed\ncaused by: Get \"http://169.254.169.254/latest/dynamic/instance-identity/document\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

These errors indicate that the Cast AI agent failed to connect to the AWS API. The reason may be that your cluster's nodes and/or workloads have custom-constrained IAM permissions, or the IAM roles are removed entirely.

However, the Cast AI agent requires read-only access to the AWS EC2 API to correctly identify some properties of your EKS cluster. Access to the AWS EC2 Metadata endpoint is optional, but the variables discovered from the endpoint must be provided.

The Cast AI agent uses the official AWS SDK, which supports all variables for customizing authentication, as mentioned in its documentation.

Provide cluster metadata by adding these environment variables to the Cast AI agent deployment:

- name: EKS_ACCOUNT_ID
  value: "000000000000"    # your AWS account ID
- name: EKS_REGION
  value: "eu-central-1"    # your EKS cluster region
- name: EKS_CLUSTER_NAME
  value: "staging-example" # your EKS cluster name

If you're instead using GCP GKE, you can provide the following environment variables to overcome the lack of access to VM metadata:

- name: GKE_PROJECT_ID
  value: your_project_id
- name: GKE_CLUSTER_NAME
  value: your_cluster_name
- name: GKE_REGION
  value: your_cluster_region
- name: GKE_LOCATION
  value: your_cluster_az

The Cast AI agent requires read-only permissions, so the default AmazonEC2ReadOnlyAccess is sufficient. Provide AWS API access by adding these variables to the Cast AI Agent secret:

AWS_ACCESS_KEY_ID = xxxxxxxxxxxxxxxxxxxx
AWS_SECRET_ACCESS_KEY = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Alternatively, if you use IAM roles for service accounts, you can annotate the castai-agent service account instead of providing AWS credentials with your IAM role.

kubectl annotate serviceaccount -n castai-agent castai-agent eks.amazonaws.com/role-arn="arn:aws:iam::111122223333:role/iam-role-name"

Disconnected or Not responding cluster

If the cluster has a Not responding status, most likely the Cast AI agent deployment is missing. Press Reconnect and follow the instructions provided.

The Not responding state is temporary, and unless fixed, the cluster will enter the Disconnected state. If your cluster is disconnected, you can reconnect or delete it from the console, as shown below.

The delete action only removes the cluster from the Cast AI console, leaving it running in the cloud service provider.


TLS handshake timeout issue

In some edge cases, due to a specific cluster network setup, the agent might fail with the following message in the agent container logs:

time="2021-11-13T05:19:54Z" level=fatal msg="agent failed: registering cluster: getting namespace \"kube-system\": Get \"https://100.10.1.0:443/api/v1/namespaces/kube-system\": net/http: TLS handshake timeout" provider=eks version=v0.22.1

You can resolve this issue by deleting the castai-agent pod. The deployment will recreate the pod and resolve the issue.


Refused connection to control plane

When enabling automated cluster optimization for the first time, the user runs a pre-generated script to grant Cast AI the required permissions. The error message No access to Kubernetes API server, please check your firewall settings indicates that a firewall prevents communication between the control plane and Cast AI.

To solve this issue, allow access to Cast AI IP 35.221.40.21 and then enable automated optimization again.


AKS versions not supported

During the cluster onboarding to Cast AI managed mode, the onboarding process will create a new Node Pool. Microsoft Azure Cloud enforces certain restrictions for Node Pool creations:

  • Node Pool can NOT be newer than your AKS cluster control plane version.
  • Microsoft supports only a few minor/patch Kubernetes versions for Node Pool creation. Azure documentation.

You can check the list of supported AKS versions in your region:

❯ az aks get-versions --location eastus --output table
KubernetesVersion    Upgrades                 
-------------------  -----------------------  
1.29.2               None available           
1.29.0               1.29.2                   
1.28.5               1.29.0, 1.29.2           
1.28.3               1.28.5, 1.29.0, 1.29.2   
1.27.9               1.28.3, 1.28.5           
1.27.7               1.27.9, 1.28.3, 1.28.5   
1.26.12              1.27.7, 1.27.9           
1.26.10              1.26.12, 1.27.7, 1.27.9  

If your AKS cluster control plane version is not in the supported list, no new Node Pools can be created (Cast AI or not). To continue Cast AI onboarding, upgrade the AKS control plane to a supported version and re-run the onboarding script. There is no need to upgrade your existing nodes, just the Control Plane.


AKS fails to pull images from Azure Container Registry to Azure Kubernetes Service cluster

If the cluster is already attached to the ACR after onboarding on Cast AI, the Service Principal created to manage the cluster might not have the correct permissions to pull images from the private ACRs. This may result in failed to pull and unpack image, failed to fetch oauth token: unexpected status: 401 Unauthorized when creating new nodes.

Microsoft has detailed documentation on troubleshooting and fixing the issue: Fail to pull images from Azure Container Registry to Azure Kubernetes Service cluster

In most cases, Solution 1: Ensure AcrPull role assignment is created for identity is enough to resolve it.