Pausing a cluster
Legacy Feature
This page describes a legacy method for pausing clusters using CronJobs. While this approach continues to be supported for existing implementations, we recommend using our improved Cluster Hibernation feature for new implementations.
Cast AI provides a set of CronJobs that can be used to pause and resume Kubernetes clusters on a defined schedule. When executed, Cast AI components will continue to run on a defined single node while the rest of the cluster capacity will be removed. Once the cluster is set to resume, Cast AI will use standard Autoscaler capabilities to provide the most cost-efficient nodes to run pending pods.
How It Works
The pausing mechanism consists of two main CronJobs.
Hibernate-pause Job
- Disables the Unscheduled Pod Policy (to prevent growing the cluster)
- Prepares a Hibernation node (node that will stay hosting essential components)
- Marks essential Deployments with Hibernation toleration
- Deletes all other nodes (only the hibernation node will stay running)
Hibernate-resume Job
- Reenables the Unscheduled Pod Policy to allow the cluster to expand to the needed size
Installation
Install via kubectl
kubectl
Run this command to install the Hibernate CronJobs:
kubectl apply -f https://raw.githubusercontent.com/castai/hibernate/main/deploy.yaml
Configure API Key
Create an API token with Full Access permissions and encode it in base64
:
echo -n "your-cast-ai-api-key" | base64
Use this value to update the Secret:
apiVersion: v1
kind: Secret
metadata:
name: castai-hibernate
namespace: castai-agent
type: Opaque
data:
API_KEY: >-
CASTAI-API-KEY-REPLACE-ME-WITH-ABOVE==
Alternatively, use this one-liner for convenience:
kubectl get secret castai-hibernate -n castai-agent -o json | jq --arg API_KEY "$(echo -n your-cast-ai-api-key | base64)" '.data["API_KEY"]=$API_KEY' | kubectl apply -f -
Install with Helm
Add Cast AI helm charts repository:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
Install the hibernate component (update cloud
and apiKey
variables):
helm upgrade -i castai-hibernate castai-helm/castai-hibernate -n castai-agent --set cloud=<AKS|EKS|GKE> --set apiKey=<CASTAI-API-KEY-REPLACE-ME-WITH-BASE64_ENCODE>
Configuration
Scheduling CronJobs
Update the hibernate-pause and hibernate-resume CronJob schedules according to your business needs. The schedules follow the Kubernetes CronJob syntax.
Default examples:
# Update hibernate-pause schedule according to business needs
pauseCronSchedule: "0 22 * * 1-5"
# Update hibernate-resume schedule according to business needs
resumeCronSchedule: "0 7 * * 1-5"
Beginning with Kubernetes v1.25 and later versions, you can define a time zone for a CronJob by setting a valid time zone name to .spec.timeZone
. For example, setting .spec.timeZone: "Etc/UTC"
will interpret the schedule according to Coordinated Universal Time (UTC). You can find a list of valid time zones in the tz database time zones list.
Advanced Configuration Options
The pause and resume jobs support several configuration options through environment variables:
Override default hibernate-node size
Set the HIBERNATE_NODE
environment variable to override the default node sizing selections. Make sure the size selected is appropriate for your cloud provider.
Specify namespaces to preserve
Set the NAMESPACES_TO_KEEP
environment variable to override the default preserved namespaces (defaults to "opa,istio"). The namespaces listed here will not be affected by the hibernation process.
Protect nodes marked as removal-disabled
Set PROTECT_EVICTION_DISABLED
to "true" to prevent the removal of nodes that have the autoscaling.cast.ai/removal-disabled="true"
label. This is useful for preserving nodes running critical or stateful workloads.
Set API region
If you need to use a different API URL (e.g., for the Europe region), you can provide the URL via the environment variable:
API_URL: "https://api.eu.cast.ai"
The default is https://api.cast.ai
Advanced Use Cases
Selective Namespace Hibernation
If you need to pause and resume specific namespaces rather than the entire cluster, you can:
- Set the
NAMESPACES_TO_KEEP
environment variable to specify which namespaces should be preserved during hibernation - All other namespaces will be affected by the hibernation process
Example configuration:
env:
- name: NAMESPACES_TO_KEEP
value: "kube-system,monitoring,logging,istio-system"
Time-sharing GPU Resources
For clusters with GPU resources that you want to time-share among different teams or applications:
- Configure the hibernate CronJobs to run on a schedule that matches your time-sharing requirements
- Set
NAMESPACES_TO_KEEP
to include the namespaces that should always have access to the resources - Configure separate hibernate CronJobs for different time slots if needed
For more fine-grained control over scaling specific workloads (rather than the entire cluster), you can combine this approach with kube-downscaler. When using kube-downscaler
:
- Configure it to scale specific deployments to zero pods during off-hours
- Enable Cast AI's Evictor in aggressive mode to ensure the cluster shrinks after pods are removed
- This approach provides more granular control than completely hibernating the cluster
Upgrading
To upgrade the hibernate component to the latest version, run:
helm repo update castai-helm
helm upgrade castai-hibernate castai-helm/castai-hibernate --reuse-values -n castai-agent
Troubleshooting
Check the logs
To view the logs of the hibernate jobs:
kubectl logs -l app=castai-hibernate-pause -n castai-agent
kubectl logs -l app=castai-hibernate-resume -n castai-agent
Common issues
- Jobs not running at scheduled time: Check the CronJob schedules and ensure they are properly defined according to your timezone
- Nodes not being removed: Verify that there are no pods preventing node draining, or check if nodes have the
autoscaling.cast.ai/removal-disabled
label - Cluster not resuming properly: Check if the Unscheduled Pod Policy is enabled in the Cast AI console
For persistent issues, contact Cast AI support.
Updated 8 days ago