Pausing a cluster

🚧

Legacy Feature

This page describes a legacy method for pausing clusters using CronJobs. While this approach continues to be supported for existing implementations, we recommend using our improved Cluster Hibernation feature for new implementations.

Cast AI provides a set of CronJobs that can be used to pause and resume Kubernetes clusters on a defined schedule. When executed, Cast AI components will continue to run on a defined single node while the rest of the cluster capacity will be removed. Once the cluster is set to resume, Cast AI will use standard Autoscaler capabilities to provide the most cost-efficient nodes to run pending pods.

How It Works

The pausing mechanism consists of two main CronJobs.

Hibernate-pause Job

  • Disables the Unscheduled Pod Policy (to prevent growing the cluster)
  • Prepares a Hibernation node (node that will stay hosting essential components)
  • Marks essential Deployments with Hibernation toleration
  • Deletes all other nodes (only the hibernation node will stay running)

Hibernate-resume Job

  • Reenables the Unscheduled Pod Policy to allow the cluster to expand to the needed size

Installation

Install via kubectl

Run this command to install the Hibernate CronJobs:

kubectl apply -f https://raw.githubusercontent.com/castai/hibernate/main/deploy.yaml

Configure API Key

Create an API token with Full Access permissions and encode it in base64:

echo -n "your-cast-ai-api-key" | base64

Use this value to update the Secret:

apiVersion: v1
kind: Secret
metadata:
  name: castai-hibernate
  namespace: castai-agent
type: Opaque
data:
  API_KEY: >-
    CASTAI-API-KEY-REPLACE-ME-WITH-ABOVE==

Alternatively, use this one-liner for convenience:

kubectl get secret castai-hibernate -n castai-agent -o json | jq --arg API_KEY "$(echo -n your-cast-ai-api-key | base64)" '.data["API_KEY"]=$API_KEY' | kubectl apply -f -

Install with Helm

Add Cast AI helm charts repository:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update

Install the hibernate component (update cloud and apiKey variables):

helm upgrade -i castai-hibernate castai-helm/castai-hibernate -n castai-agent --set cloud=<AKS|EKS|GKE> --set apiKey=<CASTAI-API-KEY-REPLACE-ME-WITH-BASE64_ENCODE>

Configuration

Scheduling CronJobs

Update the hibernate-pause and hibernate-resume CronJob schedules according to your business needs. The schedules follow the Kubernetes CronJob syntax.

Default examples:

# Update hibernate-pause schedule according to business needs
pauseCronSchedule: "0 22 * * 1-5"

# Update hibernate-resume schedule according to business needs
resumeCronSchedule: "0 7 * * 1-5"

Beginning with Kubernetes v1.25 and later versions, you can define a time zone for a CronJob by setting a valid time zone name to .spec.timeZone. For example, setting .spec.timeZone: "Etc/UTC" will interpret the schedule according to Coordinated Universal Time (UTC). You can find a list of valid time zones in the tz database time zones list.

Advanced Configuration Options

The pause and resume jobs support several configuration options through environment variables:

Override default hibernate-node size

Set the HIBERNATE_NODE environment variable to override the default node sizing selections. Make sure the size selected is appropriate for your cloud provider.

Specify namespaces to preserve

Set the NAMESPACES_TO_KEEP environment variable to override the default preserved namespaces (defaults to "opa,istio"). The namespaces listed here will not be affected by the hibernation process.

Protect nodes marked as removal-disabled

Set PROTECT_EVICTION_DISABLED to "true" to prevent the removal of nodes that have the autoscaling.cast.ai/removal-disabled="true" label. This is useful for preserving nodes running critical or stateful workloads.

Set API region

If you need to use a different API URL (e.g., for the Europe region), you can provide the URL via the environment variable:

API_URL: "https://api.eu.cast.ai"

The default is https://api.cast.ai

Advanced Use Cases

Selective Namespace Hibernation

If you need to pause and resume specific namespaces rather than the entire cluster, you can:

  1. Set the NAMESPACES_TO_KEEP environment variable to specify which namespaces should be preserved during hibernation
  2. All other namespaces will be affected by the hibernation process

Example configuration:

env:
- name: NAMESPACES_TO_KEEP
  value: "kube-system,monitoring,logging,istio-system"

Time-sharing GPU Resources

For clusters with GPU resources that you want to time-share among different teams or applications:

  1. Configure the hibernate CronJobs to run on a schedule that matches your time-sharing requirements
  2. Set NAMESPACES_TO_KEEP to include the namespaces that should always have access to the resources
  3. Configure separate hibernate CronJobs for different time slots if needed

For more fine-grained control over scaling specific workloads (rather than the entire cluster), you can combine this approach with kube-downscaler. When using kube-downscaler:

  1. Configure it to scale specific deployments to zero pods during off-hours
  2. Enable Cast AI's Evictor in aggressive mode to ensure the cluster shrinks after pods are removed
  3. This approach provides more granular control than completely hibernating the cluster

Upgrading

To upgrade the hibernate component to the latest version, run:

helm repo update castai-helm
helm upgrade castai-hibernate castai-helm/castai-hibernate --reuse-values -n castai-agent

Troubleshooting

Check the logs

To view the logs of the hibernate jobs:

kubectl logs -l app=castai-hibernate-pause -n castai-agent
kubectl logs -l app=castai-hibernate-resume -n castai-agent

Common issues

  1. Jobs not running at scheduled time: Check the CronJob schedules and ensure they are properly defined according to your timezone
  2. Nodes not being removed: Verify that there are no pods preventing node draining, or check if nodes have the autoscaling.cast.ai/removal-disabled label
  3. Cluster not resuming properly: Check if the Unscheduled Pod Policy is enabled in the Cast AI console

For persistent issues, contact Cast AI support.


What’s Next