Business Continuity
Fallback options in case CAST AI is not operational
All cloud providers can occasionally go down, and SaaS companies using their services could become temporarily non-operational.
This article outlines how you can minimize the operational impact of your applications on the Kubernetes cluster if CAST AI can't provide services.
Risks
CAST AI manages Kubernetes cluster capacity automatically. Businesses rely on CAST AI to add necessary capacity during critical moments when scaling up, handling peaks, or spot droughts.
Kubernetes cluster is not cost-optimized
If CAST AI becomes unavailable during scale-down or quiet periods, it poses minimal to no risk to workload availability. It can only impact its cost efficiency.
Application pods can't start due to insufficient capacity
Workload Pods are Pending, usually due to an increase in the number of Pod Replicas through the HPA or KEDA mechanisms. It is expected that CAST AI will be able to add new Kubernetes Nodes in the next 60-120 seconds so that K8s Scheduler can schedule all Pending Pods.
Detection
Using an OpenMetrics monitoring solution like Prometheus is recommended to collect metrics from your Kubernetes cluster and scrape complimentary metrics from CAST AI SaaS. Additionally, it's also useful to set up notifications, for example, using Grafana Alerts.
Pending Pod age
You can use your own metrics from the Kubernetes API about Pending Pods' age.
It is recommended to set alerts if the pending period is 15 minutes or more. However, such notifications can add to alert noise, as misconfiguration on the workload manifest will leave the Pod in the Pending phase indefinitely.
kube_pod_status_phase{phase="Pending"}
Action: This alert should trigger further investigation, as it does not tell whether it's application misconfiguration or another issue.
K8s Cluster is not communicating with CAST AI
CAST AI's operations rely on the platform's components running in the subject Kubernetes cluster and network communication with https://api.cast.ai.
castai_autoscaler_agent_snapshots_received_total
metric indicates that the CAST AI agent can deliver Kubernetes API changes ("snapshots") to CAST AI SaaS and get a delivery confirmation. If this metric drops to 0, the CAST AI agent is not running, or something else prevents the snapshot delivery.
Action: Check if the CAST AI agent is running in the castai-agent namespace. If it's not, describe the pod, and check its logs. It can be due to a network/firewall issue or another reason why CAST AI is unreachable over the internet.
CAST AI receives data from the agent, but it doesn't process the snapshot
If castai_autoscaler_agent_snapshots_received_total
is not 0, but castai_autoscaler_agent_snapshots_processed_total
is 0, it's 100% CAST AI issue.
CAST AI team should be aware of this, but you can raise an incident within the CAST AI console or report it on Slack.
Cluster Status in CAST AI console is not Healthy
CAST AI requires the agent and the cluster controller to always run to add new Nodes. There might be various issues, from cloud credentials no longer being valid to invalidated API tokens, etc.
Use the Notification Webhook functionality to get notified about a failed or warning cluster status.
Fallback scaling
Adding the necessary capacity to the Kubernetes cluster to make sure all critical applications are running (not pending) is a top priority. This emergency capacity can be added manually or triggered automatically on time delay.
Manual
In GKE or AKS, the cluster autoscaler is integrated into a node pool group. During the CAST AI onboarding, the cluster autoscaler is usually disabled in the native node pool. In an emergency, you can enable it again in GCP and Azure portals in a web browser.
AKS node pool "Scale" configuration to "Autoscale".
To enable the cluster autoscaler on the default node pool using az cli, use the following command:
az aks nodepool update --enable-cluster-autoscaler -g MyResourceGroup -n default --cluster-name MyCluster
GKE default node pool configuration "Enable cluster autoscaler".
In EKS, the autoscaler usually runs inside the cluster, and with CAST AI is scaled to 0 replicas. To manually fall back autoscaling, scale the cluster autoscaler replica to 1.
Enabling the cluster autoscaler again will reinflate the Auto Scaling Group/node pool based on Pendings Pods.
Fallback when Node Templates are in use
Some workloads will have placement logic that uses NodeAffinity
or NodeSelector
to require nodes with specific labels, such as "mycompany=spark-app" or "pool=gpu."
To satisfy these placement requirements, users set up Node Templates. However, if you need to fall back to node pools or Auto Scaling groups, you need to configure labels similar to those used on the Node Templates. As in the above scenario, the cluster autoscaler should be enabled for these node pools or ASGs.
Automatic
Automatic fallback logic assumes that the cluster autoscaler is always on, but it works with a delay. This setting prioritizes the Nodes CAST AI creates, but if the platform becomes unable to add capacity, the cluster autoscaler resumes its Node creation abilities.
Each Node Template would have a matching fallback ASG created to align with corresponding Node Template selectors as described in the "Fallback when Node Templates are in use" section above.
Modifying or setting up a fallback autoscaler
Add extra flags to the container command section:
--new-pod-scale-up-delay=600s
– to configure a global delay.
If setting up a secondary cluster autoscaler:
--leader-elect-resource-name=cluster-autoscaler-fallback
– lease name to avoid duplicating the legacy autoscaler.
Make sure the cluster role used by CAST AI's RBAC permissions allows the creation of a new lease:
- apiGroups:
- coordination.k8s.io
resourceNames:
- cluster-autoscaler
- cluster-autoscaler-fallback
resources:
- leases
verbs:
- get
- update
Example of a fallback autoscaler configuration:
spec:
containers:
- command:
- ./cluster-autoscaler
- --v=4
- --leader-elect-resource-name=cluster-autoscaler-fallback
- --new-pod-scale-up-delay=600s
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/v1
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.22.2
Updated 28 days ago