Common Deployment Challenges
Learn how to troubleshoot Kubernetes Pod scheduling issues and common deployment challenges when using CAST AI.
Topology spread constraints
Kubernetes Scheduler
Pods might not get scheduled as intended due to conflicting or incorrectly configured topology spread constraints.
Here's what you can do to troubleshoot this issue:
- Review constraints: ensure your topology spread constraints are configured correctly in the Pod specs. Verify the
maxSkew
,topologyKey
, andwhenUnsatisfiable
parameters. - Verify node labels: ensure your nodes have accurate and required labels. If
topologyKey
references node labels, these must be consistent across all nodes. - Check the scheduler logs: inspect the Kubernetes scheduler logs for any issues or warnings related to Pod scheduling. Use
kubectl describe pod <pod-name>
to see events and errors. - Inspect resource utilization: ensure that your nodes have enough resources to accommodate the Pods following spread constraints. Insufficient resources or tight constraints might lead to scheduling issues.
Event: Cannot Satisfy Zone Topology Spread
topology.kubernetes.io/zone
is a supported topology spread constraint key in the Node Placer, allowing you to deploy Pods in a highly available way to take advantage of the cloud zone redundancy.
The Kubernetes Scheduler will take all cluster nodes and extract availability zones (AZs), which are used for distributing workloads across cluster nodes and utilizing topology spread constraints.
CAST AI's Autoscaler won't scale Pods up when it detects that it can't create new nodes in the AZs for which the cluster isn't configured, even when it contains nodes in non-configured zones. Instead, it will add a Cannot Satisfy Zone Topology Spread
event to them.
Here's what you can do to troubleshoot this issue:
- Check for which AZs the cluster is configured: if the cluster has nodes in AZs that aren't configured, remove those nodes from the cluster.
- Configure the cluster to use the required AZs for nodes.
- Configure stricter node affinity on workload specifying only the configured AZs.
After implementing the troubleshooting steps, initiate reconciliation on the updated cluster.
Example Deployment
This example deployment is expected to spawn at least 3 nodes in different zones, each of them running 10
replicas.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: az-topology-spread
name: az-topology-spread
spec:
replicas: 30
selector:
matchLabels:
app: az-topology-spread
template:
metadata:
labels:
app: az-topology-spread
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- az-topology-spread
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 1000m
Custom secret management
There are many technologies for managing Secrets in GitOps. Some store the encrypted secret data in a git repository and use a cluster add-on to decrypt the data during deployment. Some other use a reference to an external secret manager/vault.
The agent helm chart provides the parameter apiKeySecretRef
to enable the use of CAST AI with custom secret managers.
# Name of secret with Token to be used for authorizing agent access to the API
# apiKey and apiKeySecretRef are mutually exclusive
# The referenced secret must provide the token in .data["API_KEY"]
apiKeySecretRef: ""
An example of the CAST AI agent
Here's an example of using a CAST AI agent helm chart with a custom secret:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-agent castai-helm/castai-agent -n castai-agent \
--set apiKeySecretRef=<your-custom-secret> \
--set clusterID=<your-cluster-id>
An example of the CAST AI cluster controller
An example of using CAST AI cluster controller helm chart with a custom secret:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-agent castai-helm/castai-cluster-controller -n castai-agent \
--set castai.apiKeySecretRef=<your-custom-secret> \
--set castai.clusterID=<your-cluster-id>
TLS handshake timeout issue
In some edge cases, due to specific cluster network setup, the agent might fail with the following message in the agent container logs:
time="2021-11-13T05:19:54Z" level=fatal msg="agent failed: registering cluster: getting namespace \"kube-system\": Get \"https://100.10.1.0:443/api/v1/namespaces/kube-system\": net/http: TLS handshake timeout" provider=eks version=v0.22.1
You can resolve this issue by deleting the castai-agent
pod. The deployment will recreate the pod and resolve the issue.
Refused connection to control plane
When enabling automated cluster optimization for the first time, the user runs a pre-generated script to grant required permissions to CAST AI. The error message No access to Kubernetes API server, please check your firewall settings indicates that a firewall prevents communication between the control plane and CAST AI.
To solve this issue, allow access to CAST AI IP 35.221.40.21
and then enable automated optimization again.
Updated 8 months ago