Common deployment challenges

Topology spread constraints

Kubernetes Scheduler

Pods might not get scheduled as intended due to conflicting or incorrectly configured topology spread constraints.
Here's what you can do to troubleshoot this issue:

Review constraints: Ensure your topology spread constraints are configured correctly in the Pod specs. Verify the maxSkew, topologyKey, and whenUnsatisfiable parameters.
Verify node labels: ensure your nodes have accurate and required labels. If topologyKey references node labels, these must be consistent across all nodes.
Check the scheduler logs: inspect the Kubernetes scheduler logs for any issues or warnings related to Pod scheduling. Use kubectl describe pod <pod-name> to see events and errors.
Inspect resource utilization: ensure your nodes have enough resources to accommodate the Pods following spread constraints. Insufficient resources or tight constraints might lead to scheduling issues.

Event: Cannot satisfy Zone Topology Spread

topology.kubernetes.io/zone is a supported topology spread constraint key in the Node Placer, allowing you to deploy Pods in a highly available way to take advantage of the cloud zone redundancy.

The Kubernetes Scheduler will take all cluster nodes and extract availability zones (AZs), which are used to distribute workloads across cluster nodes and utilize topology spread constraints.

CAST AI's Autoscaler won't scale Pods up when it detects that it can't create new nodes in the AZs for which the cluster isn't configured, even when it contains nodes in non-configured zones. Instead, it will add a Cannot Satisfy Zone Topology Spread event to them.

Here's what you can do to troubleshoot this issue:

Check which AZs the cluster is configured for. If the cluster has nodes in AZs that aren't configured, remove those nodes from the cluster.
Configure the cluster to use the required AZs for nodes.
Configure stricter node affinity on workload, specifying only the configured AZs.

After implementing the troubleshooting steps, initiate reconciliation on the updated cluster.

Example Deployment

This example deployment is expected to spawn at least 3 nodes in different zones, each of them running 10
replicas.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: az-topology-spread
  name: az-topology-spread
spec:
  replicas: 30
  selector:
    matchLabels:
      app: az-topology-spread
  template:
    metadata:
      labels:
        app: az-topology-spread
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - az-topology-spread
      containers:
        - image: nginx
          name: nginx
          resources:
            requests:
              cpu: 1000m

Custom secret management

There are many technologies for managing Secrets in GitOps. Some store the encrypted secret data in a git repository and use a cluster add-on to decrypt it during deployment. Others use a reference to an external secret manager/vault.

The agent helm chart provides the parameter apiKeySecretRef to enable the use of Cast AI with custom secret managers.

# Name of secret with Token to be used for authorizing agent access to the API
# apiKey and apiKeySecretRef are mutually exclusive
# The referenced secret must provide the token in .data["API_KEY"]
apiKeySecretRef: ""

An example of the Cast AI agent

Here's an example of using a Cast AI agent helm chart with a custom secret:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-agent castai-helm/castai-agent -n castai-agent \
  --set apiKeySecretRef=<your-custom-secret> \
  --set clusterID=<your-cluster-id>

An example of the Cast AI cluster controller

An example of using the Cast AI cluster controller helm chart with a custom secret:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-agent castai-helm/castai-cluster-controller -n castai-agent \
  --set castai.apiKeySecretRef=<your-custom-secret> \
  --set castai.clusterID=<your-cluster-id>

TLS handshake timeout issue

In some edge cases, due to a specific cluster network setup, the agent might fail with the following message in the agent container logs:

time="2021-11-13T05:19:54Z" level=fatal msg="agent failed: registering cluster: getting namespace \"kube-system\": Get \"https://100.10.1.0:443/api/v1/namespaces/kube-system\": net/http: TLS handshake timeout" provider=eks version=v0.22.1

You can resolve this issue by deleting the castai-agent pod. The deployment will recreate the pod and resolve the issue.

Refused connection to control plane

When enabling automated cluster optimization for the first time, the user runs a pre-generated script to grant Cast AI the required permissions. The error message No access to Kubernetes API server, please check your firewall settings indicates that a firewall prevents communication between the control plane and Cast AI.

To solve this issue, allow access to Cast AI IP 35.221.40.21 and then enable automated optimization again.