Rebalancing

Rebalancing is a Cast AI feature that allows your cluster to reach its most optimal and up-to-date state. During this process, suboptimal nodes are automatically replaced with new ones that are more cost-efficient and run the most up-to-date Node configuration settings.

Rebalancing works by taking all the workloads in your cluster and finding the most optimal ways to distribute them among the cheapest nodes.

Rebalancing uses the same algorithms that drive the Cast AI Autoscaling engine to find optimal node configurations for your workloads. The only difference is that all workloads are run through them rather than just unschedulable pods.

Purpose

The rebalancing process has multiple purposes:

Rebalance the cluster during the initial onboarding to immediately achieve cost savings. The rebalancer makes it easy to start using Cast AI by running your cluster through the Cast AI algorithms and reshaping your cluster into an optimal state during onboarding.
Remove fragmentation, which is a normal byproduct of everyday cluster execution. Autoscaling is a reactive process that aims to satisfy unschedulable pods. As these reactive decisions accumulate, your cluster might become too fragmented.

Consider this example: you are upscaling your workloads by one replica every hour. That replica is requesting 6 CPUs. The cluster will end up with 24 new nodes with 8 CPU capacities each after a day. This means that you will have 48 unused, fragmented CPUs. The rebalancer aims to solve this by consolidating the workloads into fewer, cheaper nodes, reducing waste.
Replace specific nodes due to cost inefficiency or outdated Node configuration. During the rebalancing operation, targeted nodes will be replaced with the most optimal set of nodes running the latest node configuration settings.

Scope

You can rebalance the entire cluster or only a specific set of nodes.

To rebalance the whole cluster, choose Cluster > Rebalance.
To rebalance a subset of nodes, select them using Cluster > Node list and then choose Actions > Rebalance nodes.

After assessing the operation's scope, generate a Rebalancing plan to review planned changes and their effect on the cluster composition and costs. Only nodes without problematic workloads will be considered for rebalancing.

To reduce the number of problematic workloads and avoid service disruption, check the Preparation for the rebalancing guide.

Execution

Rebalancing consists of three distinct phases:

Create new optimal nodes.
Drain old, suboptimal nodes.
Delete old, suboptimal nodes. Nodes are deleted one by one as soon as they have been drained.

Rebalancing with Workload Autoscaler recommendations

✨
Early Access
This feature is currently in early access. The user interface is still under development as we continue to refine this capability based on customer feedback.

Cast AI offers rebalancing with Workload Autoscaler recommendations in the rebalancing process. This integration enables your rebalancing operations to consider resource optimizations recommended by the Workload Autoscaler, creating a more comprehensive approach to cluster optimization.

How it works

When rebalancing with Workload Autoscaler recommendations is enabled, the rebalancing process considers both traditional node optimization factors and workload-level resource requirements:

The system retrieves the current Workload Autoscaler recommendations for all workloads
These recommendations are integrated into the rebalancing plan calculation
New nodes are provisioned with capacity that matches both optimal node types and the recommended resource requirements
Workloads are rescheduled accordingly

Considerations

Integrating Workload Autoscaler with rebalancing optimizes both nodes and workloads simultaneously. This unified approach often leads to greater cost savings, especially when workloads are recommended for downsizing. Your cluster can also proactively adjust resources before pending pods trigger reactive autoscaling.

Keep in mind the following:

Potential cost increases: If Workload Autoscaler recommends resource increases, rebalancing may increase costs while improving performance and stability
Savings threshold: When using a savings threshold with scheduled rebalancing, operations that would increase costs due to workload optimization won't execute. See scheduled rebalancing documentation.
Version requirements: For optimal results, this feature works best with Workload Autoscaler version 0.31.0 or higher. See upgrading your Workload Autoscaler version.

Configuration

Rebalancing with Workload Autoscaler recommendations is currently available as a feature flag option. To enable this functionality, please contact Cast AI support.

When enabled, you can verify the effectiveness of a rebalancing operation by monitoring your cluster's resource utilization trends afterward. You should observe visible changes in your resource metrics, typically a reduction in provisioned resources relative to your workload requirements, as shown in the example below:

In this example, the rebalancing with Workload Autoscaler recommendations was executed at the point indicated by the arrows, resulting in:

A decrease in provisioned CPUs from ~180 to ~120 CPU cores
A reduction in provisioned memory from ~600 to ~450 GiB

These changes should persist over time rather than triggering immediate autoscaling events, indicating that the new resource allocation suits your workload requirements served by the Workload Autoscaler.

Please contact Cast AI support for assistance with enabling or configuring this feature.

Rebalancing timeout behavior

During rebalancing, Cast AI enforces a timeout to ensure the process is completed in a timely manner. A hard 80-minute timeout applies to the entire rebalancing operation, including:

Node creation
Node draining
Node deletion

By default, rebalancing performs forceful draining. If a node fails to drain within this non-configurable timeout period, it will be forcefully drained and deleted from the cluster.

When graceful rebalancing is enabled, nodes that fail to drain within the timeout period will be marked with the annotation rebalancing.cast.ai/status=drain-failed. The node will remain cordoned off but not deleted from the cluster.

To enable graceful rebalancing, do the following:

For manual rebalancing:

Go to Rebalancer > Prepare new plan
Enable the Evict nodes gracefully toggle in rebalancing settings
Generate and execute your plan

For scheduled rebalancing:

Go to Rebalancer > Schedule rebalancing
Enable the Evict nodes gracefully option in settings
Configure and save your schedule

Graceful node eviction enabled in a rebalancing schedule

Handling failed node drains

If a node becomes stuck in the draining state during rebalancing:

Check for blockers:
- Review Pod Disruption Budgets (PDBs) that might prevent pod eviction
- Look for pods that cannot be rescheduled due to resource or other constraints
- Check for pods with local storage or node affinity requirements
Manual recovery steps:
- Remove the rebalancing.cast.ai/status=drain-failed annotation
- Uncordon the node if you want it to remain active in the cluster
- Or manually drain and delete the node if you still want to remove it