Pausing a cluster

CAST AI provides a set of CronJobs that can be used to pause and resume Kubernetes cluster on a defined schedule. When executed CAST AI components will continue to run on a defined single node, while the rest of the cluster capacity will be removed. Once cluster is set to resume, CAST AI will use standard Autoscaler capabilities to provide most cost efficient nodes to run pending pods.

In order to pause and resume a cluster two CronJobs will be executed:

Hibernate-pause Job

Disable Unscheduled Pod Policy (to prevent growing cluster)
Prepare Hibernation node (node that will stay hosting essential components)
Mark essential Deployments with Hibernation toleration
Delete all other nodes (only hibernation node should stay running)

Hibernate-resume Job

Renable Unscheduled Pod Policy to allow cluster to expand to needed size

Override default hibernate-node size

Set the HIBERNATE_NODE environment variable to override the default node sizing selections. Make sure the size selected is appropriate for your cloud.

Install hibernate

Run this command to install Hibernate CronJobs

kubectl apply -f https://raw.githubusercontent.com/castai/hibernate/main/deploy.yaml

Change API key

Create API token with Full Access permissions and encode base64

echo -n "98349587234524jh523452435kj2h4k5h2k34j5h2kj34h5k23h5k2345jhk2" | base64

use this value to update Secret

apiVersion: v1
kind: Secret
metadata:
  name: castai-Hibernate
  namespace: castai-agent
type: Opaque
data:
  API_KEY: >-
    CASTAI-API-KEY-REPLACE-ME-WITH-ABOVE==

OR for convenience use one liner

kubectl get secret castai-hibernate -n castai-agent -o json | jq --arg API_KEY "$(echo -n 9834958-CASTAI-API-KEY-REPLACE-ME-5k2345jhk2 | base64)" '.data["API_KEY"]=$API_KEY' | kubectl apply -f -

Set Cloud env variable

AKS is set by default, but requires changing in both CronJobs "Cloud" env variable to [EKS|GKE|AKS]


What’s Next