Risks and detection

CAST AI manages Kubernetes cluster capacity automatically. Businesses rely on CAST AI to add necessary capacity during critical moments when scaling up, handling peaks, or spot droughts.

Risks

Kubernetes cluster is not cost-optimized

If Cast AI becomes unavailable during scale-down or quiet periods, it poses minimal to no risk to workload availability. It can only impact its cost efficiency.


Application pods can't start due to insufficient capacity

Workload Pods are pending, usually due to an increase in the number of Pod Replicas through the HPA or KEDA mechanisms. Cast AI is expected to be able to add new Kubernetes Nodes in the next 60-120 seconds so that the Kubernetes Scheduler can schedule all Pending Pods.


Detection

Pending Pod age

You can use your metrics from the Kubernetes API to determine the age of the pending pods.

It is recommended that alerts be set if the pending period is 15 minutes or more. However, such notifications can add to alert noise, as misconfiguration of the workload manifest will leave the Pod indefinitely in the Pending phase.

kube_pod_status_phase{phase="Pending"}

Action: This alert should trigger further investigation, as it does not tell whether it's an application misconfiguration or another issue.

Kubernetes Cluster is not communicating with Cast AI

Cast AI's operations rely on the platform's components running in the subject Kubernetes cluster and network communication with https://api.cast.ai.

castai_autoscaler_agent_snapshots_received_total metric indicates that the Cast AI agent can deliver Kubernetes API changes ("snapshots") to Cast AI SaaS and get a delivery confirmation. If this metric drops to 0, the Cast AI agent is not running, or something else prevents the snapshot delivery.

Action: Check if the Cast AI agent runs in the castai-agent namespace. If it's not, describe the pod and check its logs. It can be due to a network/firewall issue or another reason why Cast AI is unreachable over the internet.

Cast AI receives data from the agent, but it doesn't process the snapshot

If castai_autoscaler_agent_snapshots_received_total is not 0, but castai_autoscaler_agent_snapshots_processed_total is 0, it's 100% Cast AI issue.

The Cast AI team should know this, but you can raise an incident within the Cast AI console or report it on Slack.