Cluster and node status overview

This guide includes an overview of cluster status values, defining the current state of the cluster's connection to Cast AI, and an overview of node status values, indicating the health and readiness of nodes to accept pods.

Overview of cluster Status values

The cluster's status in the Cast AI console defines the current state of its connection to Cast AI and indicates whether the platform can perform automated optimization actions on it.

Status

Explanation

Action

Connecting

Cluster is in the process of being connected to Cast AI in Read only mode

OR

Cluster is transitioning from the Read only mode to Cast AI managed mode (where the customer can set up automation).

Read only

Cluster is connected to Cast AI in read-only mode. Reporting features are enabled.

Connected

Cluster is connected to Cast AI managed mode, reporting features are enabled, and automation can be set up.

Warning

The Cast AI-managed cluster has encountered a transient error and is attempting to recover from it automatically. Autoscaling is not working.

Not responding (Read only)

Cast AI has recently lost connectivity to a cluster that was previously connected in the Read only mode, if connection is not restored in 5 minutes, status will change to Disconnected (Read only).

Check the status of castai-agent pod in the castai-agent namespace.

Not responding

Cast AI has recently lost connectivity to a cluster. Autoscaling is not working.

Check the status of castai-agent pod in the castai-agent namespace.

Failed

Cast AI has encountered an error and can't recover from it automatically. Autoscaling is not working.

Hover over the Status to view error details.

Check the status of Cast AI components in castai-agent namespace.

Disconnecting

The cluster is being disconnected from Cast AI.

Disconnected

Cluster, which was previously connected to Cast AI, is now disconnected.

Hover the Status to see when the cluster was disconnected.

Overview of Node status values

The node's status in the console indicates its health and readiness to accept pods.

Status

Explanation

Action

Cordoned

When a Kubernetes node is in the Cordoned state, scheduling new pods onto that node is temporarily disabled. A user or system might have cordoned a node in preparation for node deletion.

Cast AI also cordons and leaves a node in the cluster if pods were not evicted during rebalancing (with the Graceful Rebalancing option turned on).

Inspect the node to understand the reason behind cordoning.

If a node was cordoned during rebalancing, adjust the pod disruption budget, and un-cordon the node.

Creating

Cast AI is in the process of creating a node.

Deleted

A short-term status indicates that a node was deleted.

Deleting

Cast AI is in the process of deleting the node.

Detached

A node that is still present in the cloud but has been detached from the Kubernetes cluster.

Inspect the node and delete it manually from the cloud.

Draining

The node is being drained; Kubernetes gracefully evicts existing pods from the node.

Interrupted

A couple of scenarios might trigger this spot node status. In all cases, Cast AI is managing the interruption and preparing the necessary capacity:

  • Interruption event received from a cloud provider
  • A rebalancing recommendation is received from the cloud provider, indicating a possible interruption
  • Cast AI predicted node interruption

Cast AI is handling the interruption and is preparing replacement capacity.

Lost

A node is no longer part of the Kubernetes cluster; however, Cast AI has not yet deleted it.

If a node is in this state for a prolonged period, contact Cast AI support to troubleshoot the issue.

Not ready

A node is temporarily unable to accept new workloads either because Cast AI is still preparing it as part of the provisioning process or it is experiencing issues, such as network problems or insufficient resources, that prevent it from properly communicating with the control plane.

If a node is in this state for a prolonged period, contact Cast AI support to troubleshoot the issue.

Ready

Node is fully operational and ready to accept pods.