Risks and Detection

CAST AI manages Kubernetes cluster capacity automatically. Businesses rely on CAST AI to add necessary capacity during critical moments when scaling up, handling peaks, or spot droughts.

Risks

Kubernetes cluster is not cost-optimized

If CAST AI becomes unavailable during scale-down or quiet periods, it poses minimal to no risk to workload availability. It can only impact its cost efficiency.


Application pods can't start due to insufficient capacity

Workload Pods are Pending, usually due to an increase in the number of Pod Replicas through the HPA or KEDA mechanisms. It is expected that CAST AI will be able to add new Kubernetes Nodes in the next 60-120 seconds so that K8s Scheduler can schedule all Pending Pods.


Detection

Pending Pod age

You can use your own metrics from the Kubernetes API about Pending Pods' age.

It is recommended to set alerts if the pending period is 15 minutes or more. However, such notifications can add to alert noise, as misconfiguration on the workload manifest will leave the Pod in the Pending phase indefinitely.

kube_pod_status_phase{phase="Pending"}

Action: This alert should trigger further investigation, as it does not tell whether it's application misconfiguration or another issue.


K8s Cluster is not communicating with CAST AI

CAST AI's operations rely on the platform's components running in the subject Kubernetes cluster and network communication with https://api.cast.ai.

castai_autoscaler_agent_snapshots_received_total metric indicates that the CAST AI agent can deliver Kubernetes API changes ("snapshots") to CAST AI SaaS and get a delivery confirmation. If this metric drops to 0, the CAST AI agent is not running, or something else prevents the snapshot delivery.

Action: Check if the CAST AI agent is running in the castai-agent namespace. If it's not, describe the pod, and check its logs. It can be due to a network/firewall issue or another reason why CAST AI is unreachable over the internet.


CAST AI receives data from the agent, but it doesn't process the snapshot

If castai_autoscaler_agent_snapshots_received_total is not 0, but castai_autoscaler_agent_snapshots_processed_total is 0, it's 100% CAST AI issue.

CAST AI team should be aware of this, but you can raise an incident within the CAST AI console or [report it on Slack]