Cluster hibernation

Cluster hibernation allows you to optimize costs by temporarily scaling your cluster to zero nodes while preserving the control plane and cluster state. This feature is ideal for non-production environments, development clusters, or any cluster that doesn't need to run 24/7.

How Hibernation Works

When you hibernate a cluster:

  1. The cluster enters a "Hibernating" state
  2. All nodes are systematically removed from the cluster
  3. The control plane remains active, but no workloads run
  4. Costs are minimized to just the control plane charges from your cloud provider

When you resume a hibernated cluster:

  1. Temporary nodes are created to run essential system components
  2. Cast AI's autoscaler is reactivated
  3. Workloads are scheduled according to their requirements
  4. The cluster returns to its normal operational state

Prerequisites

Before using cluster hibernation, ensure your environment meets these requirements:

Required Component Tolerations

Cast AI components must have the following toleration to operate properly during hibernation cycles. Add this to both castai-agent and castai-cluster-controller components:

tolerations:
- effect: NoSchedule
  key: provisioning.cast.ai/temporary
  operator: Equal
  value: "resuming"

You may configure this in each corresponding values.yaml file for each component.

Feature Flag Enablement

Cluster hibernation is disabled by default for all organizations. Contact Cast AI support to enable this feature for your account.

Cloud-Specific Requirements

AWS Requirements

IAM Permissions

Your Cast AI role requires specific permissions to manage nodes during hibernation and resumption. Add the following permission blocks to your existing IAM role:

  1. Role Passing Permission for EKS

This allows Cast AI to pass IAM roles to the EKS service when creating node groups.

{
  "Sid": "PassRoleEKS",
  "Action": "iam:PassRole",
  "Effect": "Allow",
  "Resource": "arn:aws:iam::*:role/*",
  "Condition": {
    "StringEquals": {
      "iam:PassedToService": "eks.amazonaws.com"
    }
  }
}
  1. Autoscaling Group Management

These permissions enable Cast AI to create and manage the autoscaling groups needed for temporary nodes during hibernation operations.

{
  "Sid": "AutoscalingActionsTagRestriction",
  "Effect": "Allow",
  "Action": [
    "autoscaling:UpdateAutoScalingGroup",
    "autoscaling:CreateAutoScalingGroup",
    "autoscaling:DeleteAutoScalingGroup",
    "autoscaling:SuspendProcesses",
    "autoscaling:ResumeProcesses",
    "autoscaling:TerminateInstanceInAutoScalingGroup"
  ],
  "Resource": "arn:aws:autoscaling:*:*:autoScalingGroup:*:autoScalingGroupName/*",
  "Condition": {
    "StringEquals": {
      "autoscaling:ResourceTag/kubernetes.io/cluster/<cluster-name>": [
        "owned",
        "shared"
      ]
    }
  }
}
  1. EKS Node Group Management

These permissions allow Cast AI to create and manage EKS node groups during cluster resumption.

{
  "Sid": "EKS",
  "Effect": "Allow",
  "Action": [
    "eks:Describe*",
    "eks:List*",
    "eks:TagResource",
    "eks:UntagResource",
    "eks:CreateNodegroup",
    "eks:DeleteNodegroup"
  ],
  "Resource": [
    "arn:aws:eks:*:*:cluster/<cluster-name>",
    "arn:aws:eks:*:*:nodegroup/<cluster-name>/*/*"
  ]
}

Node Group Configuration

For EKS clusters, you need to configure a node group ARN that will be used for temporary nodes during hibernation and resumption. This requires:

  • Creating an EKS node group in your AWS account
  • Retrieving its IAM role ARN
  • Updating your Cast AI node configuration with this ARN
  1. First, get the IAM ARN from your pre-created node group:
# Get ARN of your pre-created node group
IAM_ARN=$(aws eks describe-nodegroup --cluster-name <cluster-name> --nodegroup-name <nodegroup-name> --query 'nodegroup.nodeRole' --output text)
  1. Then, update your Cast AI node configuration by adding the nodeGroupArn field to the EKS configuration:
# Update Cast AI node configuration
curl --request POST \
  --url https://api.cast.ai/v1/kubernetes/clusters/<cluster_id>/node-configurations/<node_config_id> \
  --header 'accept: application/json' \
  --header 'content-type: application/json' \
  --header 'X-API-Key: <your-api-key>' \
  --data '{
    "eks": {
      "instanceProfileArn": "<instance_profile_arn>",
      "nodeGroupArn": "'$IAM_ARN'"
    },
  }'

📘

Note

The nodeGroupArn is a required field for EKS clusters that use the hibernation feature. Other configuration fields can remain unchanged.

Hibernating a Cluster

You can hibernate a cluster using the Cast AI Hibernate API:

curl --request POST \
     --url https://api.cast.ai/v1/kubernetes/external-clusters/clusterId/hibernate \
     --header 'X-API-Key: <your-api-key>' \
     --header 'accept: application/json'
     --data '{}'

Replace the clusterId in the path with the ID of your cluster.

The cluster will transition to a "Hibernating" state, and the following actions will occur:

  1. The autoscaler will be disabled to prevent new nodes from being created
  2. All nodes will be systematically removed from the cluster in stages
    • First, regular workload nodes are removed
    • Temporary nodes keep essential services running until all other nodes are removed
    • Finally, temporary nodes are removed as well
  3. The control plane will remain active, but no workloads will run

📘

Note

During hibernation, you will still incur charges for the cloud provider's control plane and any persistent volumes that remain.

Resuming a Hibernated Cluster

To resume a hibernated cluster, use the Cast AI Resume API:

curl --request POST \
     --url https://api.cast.ai/v1/kubernetes/external-clusters/<cluster_id>/resume \
     --header 'X-API-Key: <your-api-key>' \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
  "instanceType": "<node-instance-type>"
}'

You must specify an instanceType parameter that's sufficient to run the cluster's essential components. The instance type requirements depend on your specific workloads and cloud service provider.

The resumption process works as follows:

  1. Temporary nodes are created using your specified instance type, and the rest of the node template configuration, if provided
  2. These nodes run essential system components, including Cast AI agents
  3. The autoscaler is reactivated and takes over normal cluster scaling operations
  4. The cluster returns to its "Ready" state

Troubleshooting Resume Operations

Insufficient Resources

If the resumption process fails, it's typically because:

  1. The specified instanceType cannot be provisioned in your cloud environment
  2. The instance type doesn't have enough resources to run all cluster-critical components

In these scenarios, manual user intervention is required to resolve the issue and restore the cluster to a running state.

To restore the cluster to a running state, you have two options:

Option 1: Try resuming again with a different instance type

curl --request POST \
     --url https://api.cast.ai/v1/kubernetes/external-clusters/<cluster_id>/resume \
     --header 'X-API-Key: <your-api-key>' \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '{
       "instanceType": "<larger-instance-type>"
     }'

Option 2: Add nodes manually using your cloud provider

  1. Create nodes large enough to run the critical components
  2. Ensure the nodes are properly joined to the cluster

Once you successfully provision nodes with sufficient capacity to schedule all cluster-critical components (either through the resume API or manually), Cast AI will detect this and automatically complete the cluster resumption process.

🚧

Important

Choosing an appropriate instance type is crucial for successful cluster resumption. Requirements vary by cluster type and size - while some clusters may operate with smaller instances (2 vCPU, 8GB RAM), other environments might require larger instances (16 vCPU, 64GB RAM). When in doubt, we recommend erring on the side of larger instances, as once the cluster is resumed, the Cast AI Autoscaler will automatically replace these with more cost-efficient nodes based on your actual workload requirements.

API Reference

For detailed API specifications, see our reference documentation:

Comparison with Legacy Pause Feature

📘

Note

Cast AI previously offered a different mechanism for pausing clusters. The new hibernation feature provides improved reliability and better cost optimization. For information on the legacy pause feature, see Pausing a cluster.

Key differences between hibernation and the legacy pause feature:

FeatureCluster HibernationLegacy Pause
ImplementationAPI-driven with automated resumptionUses Kubernetes CronJobs
Setup complexitySimple API calls with one-time configurationRequires manual creation and scheduling of CronJobs
Cloud service provider (CSP) supportAll major cloud providersLimited provider support