Common Deployment Challenges

Learn how to troubleshoot Kubernetes Pod scheduling issues and common deployment challenges when using CAST AI.

Topology spread constraints

Kubernetes Scheduler

Pods might not get scheduled as intended due to conflicting or incorrectly configured topology spread constraints.
Here's what you can do to troubleshoot this issue:

  1. Review constraints: ensure your topology spread constraints are configured correctly in the Pod specs. Verify the maxSkew, topologyKey, and whenUnsatisfiable parameters.
  2. Verify node labels: ensure your nodes have accurate and required labels. If topologyKey references node labels, these must be consistent across all nodes.
  3. Check the scheduler logs: inspect the Kubernetes scheduler logs for any issues or warnings related to Pod scheduling. Use kubectl describe pod <pod-name> to see events and errors.
  4. Inspect resource utilization: ensure that your nodes have enough resources to accommodate the Pods following spread constraints. Insufficient resources or tight constraints might lead to scheduling issues.

Event: Cannot Satisfy Zone Topology Spread is a supported topology spread constraint key in the Node Placer, allowing you to deploy Pods in a highly available way to take advantage of the cloud zone redundancy.

The Kubernetes Scheduler will take all cluster nodes and extract availability zones (AZs), which are used for distributing workloads across cluster nodes and utilizing topology spread constraints.

CAST AI's Autoscaler won't scale Pods up when it detects that it can't create new nodes in the AZs for which the cluster isn't configured, even when it contains nodes in non-configured zones. Instead, it will add a Cannot Satisfy Zone Topology Spread event to them.

Here's what you can do to troubleshoot this issue:

  1. Check for which AZs the cluster is configured: if the cluster has nodes in AZs that aren't configured, remove those nodes from the cluster.
  2. Configure the cluster to use the required AZs for nodes.
  3. Configure stricter node affinity on workload specifying only the configured AZs.

After implementing the troubleshooting steps, initiate reconciliation on the updated cluster.

Example Deployment

This example deployment is expected to spawn at least 3 nodes in different zones, each of them running 10

apiVersion: apps/v1
kind: Deployment
    app: az-topology-spread
  name: az-topology-spread
  replicas: 30
      app: az-topology-spread
        app: az-topology-spread
        - maxSkew: 1
          whenUnsatisfiable: DoNotSchedule
              - key: app
                operator: In
                  - az-topology-spread
        - image: nginx
          name: nginx
              cpu: 1000m

Custom secret management

There are many technologies for managing Secrets in GitOps. Some store the encrypted secret data in a git repository and use a cluster add-on to decrypt the data during deployment. Some other use a reference to an external secret manager/vault.

The agent helm chart provides the parameter apiKeySecretRef to enable the use of CAST AI with custom secret managers.

# Name of secret with Token to be used for authorizing agent access to the API
# apiKey and apiKeySecretRef are mutually exclusive
# The referenced secret must provide the token in .data["API_KEY"]
apiKeySecretRef: ""

An example of the CAST AI agent

Here's an example of using a CAST AI agent helm chart with a custom secret:

helm repo add castai-helm
helm repo update
helm upgrade --install castai-agent castai-helm/castai-agent -n castai-agent \
  --set apiKeySecretRef=<your-custom-secret> \
  --set clusterID=<your-cluster-id>

An example of the CAST AI cluster controller

An example of using CAST AI cluster controller helm chart with a custom secret:

helm repo add castai-helm
helm repo update
helm upgrade --install castai-agent castai-helm/castai-cluster-controller -n castai-agent \
  --set castai.apiKeySecretRef=<your-custom-secret> \
  --set castai.clusterID=<your-cluster-id>

TLS handshake timeout issue

In some edge cases, due to specific cluster network setup, the agent might fail with the following message in the agent container logs:

time="2021-11-13T05:19:54Z" level=fatal msg="agent failed: registering cluster: getting namespace \"kube-system\": Get \"\": net/http: TLS handshake timeout" provider=eks version=v0.22.1

You can resolve this issue by deleting the castai-agent pod. The deployment will recreate the pod and resolve the issue.

Refused connection to control plane

When enabling automated cluster optimization for the first time, the user runs a pre-generated script to grant required permissions to CAST AI. The error message No access to Kubernetes API server, please check your firewall settings indicates that a firewall prevents communication between the control plane and CAST AI.

To solve this issue, allow access to CAST AI IP and then enable automated optimization again.