Scheduled rebalancing

How it works

Scheduled rebalancing enables full or partial rebalancing process on a user-defined schedule, scope, and trigger to automate various use cases. Most commonly scheduled rebalancing is used in the following scenarios:

  • identify the most expensive spot instances and replace them with cheaper alternatives,
  • perform full rebalancing of clusters on weekends,
  • roll old nodes,
  • periodically target and replace nodes with specific labels.

A rebalancing schedule is an organization-wide object that can be assigned to one or multiple clusters. Once triggered, it creates a rebalancing plan that is scoped by parameters provided in the node selection preferences, while the problematic workloads are excluded automatically.

To set up scheduled rebalancing, navigate to the organizational-level Optimization menu or access the Rebalancer view within the cluster.

Setup

There are two main components in scheduled rebalancing functionality

  • A rebalancing job is a job that triggers a rebalancing schedule on an associated cluster.
  • Rebalancing schedule - an organization-wide rebalancing schedule that can be used on multiple rebalancing jobs.

The Rebalancing schedule consists of four parts:

  • The schedule describes when to run the scheduling (periodically, within the maintenance window, etc.)
  • Node selection preferences contain rules for picking specific nodes, how many nodes to target, and similar rules that mimic decision-making when manually rebalancing a cluster.
  • Trigger requirements are decision rules for executing the rebalancing plan. They include rules like the “Savings threshold,” which determine whether the generated plan should be executed.
  • Execution safeguard: CAST AI stops rebalancing before draining the original nodes if it finds that the user-specified minimum level of savings can't be achieved. This additional protection layer is required as planned nodes might temporarily be unavailable in the cloud provider's inventory during the rebalancing.

Settings

The following settings can be adjusted when setting up the scheduled rebalancing:

SettingDescription
Specify resource offeringSpot, On-demand or Any
Target using labelsKey-value pairs are provided as nodeSelector terms. In the UI, when multiple terms are provided, the values are handled using AND logic, i.e., only nodes that satisfy all listed selector terms will be targeted.

For OR logic between label values, use the CAST AI API endpoint POST /v1/rebalancing-schedules. In the API request, use the In operator and specify multiple values for the same label key in the matchExpressions section. This API-only feature allows targeting nodes matching any specified label values. Note that the UI may not display all values correctly when using this API method.
Minimum node ageAmount of time since the node creation before the node can be considered for rebalancing. 0 - means a node of any age can be considered.
Evict nodes gracefullyDefines whether the nodes that failed to get drained until a predefined timeout of 20 minutes will be kept with a rebalancing.cast.ai/status=drain-failed annotation instead of being forcefully drained.
Maximum batch sizeMaximum number of nodes that will be selected for rebalancing. '0' indicates that all nodes in the cluster can be selected
Sort selected nodesThe algorithm used to sort selected nodes:
Highest normalized CPU price - sorts by the most expensive nodes based on the price of normalized CPU (node cost / CPU count).
Highest requested resource price (CPU + RAM) - sorts by the most expensive nodes based on the requested resource price, considering both CPU and RAM in a 1:7 ratio. This helps optimize costs for memory-focused workloads and node templates.
Least utilized - sorts by the least utilized nodes first, regardless of price.
Aggressive modeRebalance problematic pods, those without a controller, job pods, and pods with the removal-disabled annotation
Savings thresholdThe savings threshold can be turned off to initiate a rebalance regardless of cost impact (e.g., when nodes need to be replaced due to upgrades).

Target savings - the minimum projected savings to be achieved. Rebalancing will not be executed if the plan does not project savings that meet or exceed the specified value.

Guaranteed minimum savings - When capacity becomes unavailable between the time the plan was generated and the creation of nodes, the Rebalancer will create alternative nodes. If the new nodes' cost is higher, thus not generating the required minimum savings, the newly created nodes will be deleted, and the rebalancing will be aborted.
Execution timesExecution time can be adjusted by providing a timezone and a crontab expression

Using OR Conditions for Node Labels in Scheduled Rebalancing

While the console UI supports AND logic for node labels, you can use OR conditions through the CAST AI API only. Here's how:

  1. Use the appropriate CAST AI API endpoint for creating rebalancing schedules: POST /v1/rebalancing-schedules
  2. In your API request, focus on the matchExpressions section within the nodeSelectorTerms.
  3. Use the In operator and specify multiple values for the same label key to create an OR condition.

Example JSON payload snippet:

"matchExpressions": [
  {
    "key": "nodetemplate",
    "operator": "In",
    "values": [
      "customerA",
      "customerB",
      "customerC"
    ]
  }
]

This configuration targets nodes with the nodetemplate label matching any of the specified values.

🚧

Warning

When using this API method, the UI may not display all values correctly. You'll see just one label in the UI, and you'll need to assign the specific cluster to it.

For more details, refer to the CAST AI API reference documentation for creating rebalancing schedules.