Cluster hibernation
Cluster hibernation allows you to optimize costs by temporarily scaling your cluster to zero nodes while preserving the control plane and cluster state. This feature is ideal for non-production environments, development clusters, or any cluster that doesn't need to run 24/7.
How Hibernation Works
When you hibernate a cluster:
- The cluster enters a "Hibernating" state
- All nodes are systematically removed from the cluster
- The control plane remains active, but no workloads run
- Costs are minimized to just the control plane charges from your cloud provider
When you resume a hibernated cluster:
- Temporary nodes are created to run essential system components
- Cast AI's autoscaler is reactivated
- Workloads are scheduled according to their requirements
- The cluster returns to its normal operational state
Prerequisites
Before using cluster hibernation, ensure your environment meets these requirements:
Required Component Tolerations
Cast AI components must have the following toleration to operate properly during hibernation cycles. Add this to both castai-agent
and castai-cluster-controller
components:
tolerations:
- effect: NoSchedule
key: provisioning.cast.ai/temporary
operator: Equal
value: "resuming"
You may configure this in each corresponding values.yaml
file for each component.
Feature Flag Enablement
Cluster hibernation is disabled by default for all organizations. Contact Cast AI support to enable this feature for your account.
Cloud-Specific Requirements
AWS Requirements
IAM Permissions
Your Cast AI role requires specific permissions to manage nodes during hibernation and resumption. Add the following permission blocks to your existing IAM role:
- Role Passing Permission for EKS
This allows Cast AI to pass IAM roles to the EKS service when creating node groups.
{
"Sid": "PassRoleEKS",
"Action": "iam:PassRole",
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/*",
"Condition": {
"StringEquals": {
"iam:PassedToService": "eks.amazonaws.com"
}
}
}
- Autoscaling Group Management
These permissions enable Cast AI to create and manage the autoscaling groups needed for temporary nodes during hibernation operations.
{
"Sid": "AutoscalingActionsTagRestriction",
"Effect": "Allow",
"Action": [
"autoscaling:UpdateAutoScalingGroup",
"autoscaling:CreateAutoScalingGroup",
"autoscaling:DeleteAutoScalingGroup",
"autoscaling:SuspendProcesses",
"autoscaling:ResumeProcesses",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": "arn:aws:autoscaling:*:*:autoScalingGroup:*:autoScalingGroupName/*",
"Condition": {
"StringEquals": {
"autoscaling:ResourceTag/kubernetes.io/cluster/<cluster-name>": [
"owned",
"shared"
]
}
}
}
- EKS Node Group Management
These permissions allow Cast AI to create and manage EKS node groups during cluster resumption.
{
"Sid": "EKS",
"Effect": "Allow",
"Action": [
"eks:Describe*",
"eks:List*",
"eks:TagResource",
"eks:UntagResource",
"eks:CreateNodegroup",
"eks:DeleteNodegroup"
],
"Resource": [
"arn:aws:eks:*:*:cluster/<cluster-name>",
"arn:aws:eks:*:*:nodegroup/<cluster-name>/*/*"
]
}
Node Group Configuration
For EKS clusters, you need to configure a node group ARN that will be used for temporary nodes during hibernation and resumption. This requires:
- Creating an EKS node group in your AWS account
- Retrieving its IAM role ARN
- Updating your Cast AI node configuration with this ARN
- First, get the IAM ARN from your pre-created node group:
# Get ARN of your pre-created node group
IAM_ARN=$(aws eks describe-nodegroup --cluster-name <cluster-name> --nodegroup-name <nodegroup-name> --query 'nodegroup.nodeRole' --output text)
- Then, update your Cast AI node configuration by adding the
nodeGroupArn
field to the EKS configuration:
# Update Cast AI node configuration
curl --request POST \
--url https://api.cast.ai/v1/kubernetes/clusters/<cluster_id>/node-configurations/<node_config_id> \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--header 'X-API-Key: <your-api-key>' \
--data '{
"eks": {
"instanceProfileArn": "<instance_profile_arn>",
"nodeGroupArn": "'$IAM_ARN'"
},
}'
Note
The
nodeGroupArn
is a required field for EKS clusters that use the hibernation feature. Other configuration fields can remain unchanged.
Hibernating a Cluster
You can hibernate a cluster using the Cast AI Hibernate API:
curl --request POST \
--url https://api.cast.ai/v1/kubernetes/external-clusters/clusterId/hibernate \
--header 'X-API-Key: <your-api-key>' \
--header 'accept: application/json'
--data '{}'
Replace the clusterId
in the path with the ID of your cluster.
The cluster will transition to a "Hibernating" state, and the following actions will occur:
- The autoscaler will be disabled to prevent new nodes from being created
- All nodes will be systematically removed from the cluster in stages
- First, regular workload nodes are removed
- Temporary nodes keep essential services running until all other nodes are removed
- Finally, temporary nodes are removed as well
- The control plane will remain active, but no workloads will run
Note
During hibernation, you will still incur charges for the cloud provider's control plane and any persistent volumes that remain.
Resuming a Hibernated Cluster
To resume a hibernated cluster, use the Cast AI Resume API:
curl --request POST \
--url https://api.cast.ai/v1/kubernetes/external-clusters/<cluster_id>/resume \
--header 'X-API-Key: <your-api-key>' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '
{
"instanceType": "<node-instance-type>"
}'
You must specify an instanceType
parameter that's sufficient to run the cluster's essential components. The instance type requirements depend on your specific workloads and cloud service provider.
The resumption process works as follows:
- Temporary nodes are created using your specified instance type, and the rest of the node template configuration, if provided
- These nodes run essential system components, including Cast AI agents
- The autoscaler is reactivated and takes over normal cluster scaling operations
- The cluster returns to its "Ready" state
Troubleshooting Resume Operations
Insufficient Resources
If the resumption process fails, it's typically because:
- The specified
instanceType
cannot be provisioned in your cloud environment - The instance type doesn't have enough resources to run all cluster-critical components
In these scenarios, manual user intervention is required to resolve the issue and restore the cluster to a running state.
To restore the cluster to a running state, you have two options:
Option 1: Try resuming again with a different instance type
curl --request POST \
--url https://api.cast.ai/v1/kubernetes/external-clusters/<cluster_id>/resume \
--header 'X-API-Key: <your-api-key>' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '{
"instanceType": "<larger-instance-type>"
}'
Option 2: Add nodes manually using your cloud provider
- Create nodes large enough to run the critical components
- Ensure the nodes are properly joined to the cluster
Once you successfully provision nodes with sufficient capacity to schedule all cluster-critical components (either through the resume API or manually), Cast AI will detect this and automatically complete the cluster resumption process.
Important
Choosing an appropriate instance type is crucial for successful cluster resumption. Requirements vary by cluster type and size - while some clusters may operate with smaller instances (2 vCPU, 8GB RAM), other environments might require larger instances (16 vCPU, 64GB RAM). When in doubt, we recommend erring on the side of larger instances, as once the cluster is resumed, the Cast AI Autoscaler will automatically replace these with more cost-efficient nodes based on your actual workload requirements.
API Reference
For detailed API specifications, see our reference documentation:
Comparison with Legacy Pause Feature
Note
Cast AI previously offered a different mechanism for pausing clusters. The new hibernation feature provides improved reliability and better cost optimization. For information on the legacy pause feature, see Pausing a cluster.
Key differences between hibernation and the legacy pause feature:
Feature | Cluster Hibernation | Legacy Pause |
---|---|---|
Implementation | API-driven with automated resumption | Uses Kubernetes CronJobs |
Setup complexity | Simple API calls with one-time configuration | Requires manual creation and scheduling of CronJobs |
Cloud service provider (CSP) support | All major cloud providers | Limited provider support |
Updated 10 days ago