Scheduled rebalancing

How it works

Scheduled rebalancing enables full or partial rebalancing process on a user-defined schedule, scope, and trigger to automate various use cases. Most commonly scheduled rebalancing is used in the following scenarios:

  • identify the most expensive Spot Instances and replace them with cheaper alternatives,
  • perform full rebalancing of clusters on weekends,
  • roll old nodes,
  • periodically target and replace nodes with specific labels.

A rebalancing schedule is an organization-wide object that can be assigned to one or multiple clusters. Once triggered, it creates a rebalancing plan scoped by parameters provided in the node selection preferences, while the problematic workloads are excluded automatically.

To set up scheduled rebalancing, navigate to the organizational-level Optimization menu or access the Rebalancer view within the cluster.

Setup

There are two main components in scheduled rebalancing functionality

  • A rebalancing job is a job that triggers a rebalancing schedule on an associated cluster.
  • Rebalancing schedule - an organization-wide rebalancing schedule that can be used on multiple rebalancing jobs.

The Rebalancing schedule consists of four parts:

  • The schedule describes when to run the scheduling (periodically, within the maintenance window, etc.)
  • Node selection preferences contain rules for picking specific nodes, how many to target, and similar rules that mimic decision-making when manually rebalancing a cluster.
  • Trigger requirements are decision rules for executing the rebalancing plan. They include rules like the “Savings threshold,” determining whether the generated plan should be executed.
  • Execution safeguard: Cast AI stops rebalancing before draining the original nodes if it finds that the user-specified minimum level of savings can't be achieved. This additional protection layer is required as planned nodes might temporarily be unavailable in the cloud provider's inventory during the rebalancing.

Settings

The following settings can be adjusted when setting up the scheduled rebalancing:

SettingDescription
Specify resource offeringSpot, On-demand or Any
Target using labelsKey-value pairs are provided as nodeSelector terms. In the UI, when multiple terms are provided, the values are handled using AND logic, i.e., only nodes that satisfy all listed selector terms will be targeted.

For OR logic between label values, use the Cast AI API endpoint POST /v1/rebalancing-schedules. In the API request, use the In operator and specify multiple values for the same label key in the matchExpressions section. This API-only feature allows targeting nodes matching any specified label values. Note that the UI may not display all values correctly when using this API method.
Minimum node ageAmount of time since the node creation before the node can be considered for rebalancing. 0 - means a node of any age can be considered.
Evict nodes gracefullyDefines whether the nodes that failed to get drained until a predefined timeout of 20 minutes will be kept with a rebalancing.cast.ai/status=drain-failed annotation instead of being forcefully drained.
Maximum batch sizeMaximum number of nodes that will be selected for rebalancing. '0' indicates that all nodes in the cluster can be selected
Sort selected nodesThe algorithm used to sort selected nodes:
Highest normalized CPU price - sorts by the most expensive nodes based on the price of normalized CPU (node cost / CPU count).
Highest requested resource price (CPU + RAM) - sorts by the most expensive nodes based on the requested resource price, considering both CPU and RAM in a 1:7 ratio. This helps optimize costs for memory-focused workloads and node templates.
Least utilized - sorts by the least utilized nodes first, regardless of price.
Aggressive modeRebalance problematic pods, those without a controller, job pods, and pods with the removal-disabled annotation
Savings thresholdThe savings threshold can be turned off to initiate a rebalance regardless of cost impact (e.g. when nodes need to be replaced due to upgrades).

Target savings - the minimum projected savings to be achieved. Rebalancing will not be executed if the plan does not project savings that meet or exceed the specified value.

Guaranteed minimum savings - When capacity becomes unavailable between the time the plan was generated and the creation of nodes, the Rebalancer will create alternative nodes. If the new nodes' cost is higher, thus not generating the required minimum savings, the newly created nodes will be deleted, and the rebalancing will be aborted.
Execution timesExecution time can be adjusted by providing a timezone and a crontab expression

Node targeting

When configuring node targeting for scheduled rebalancing, you can choose between two methods: Labels or Fields. This selection determines how you'll identify nodes for the rebalancing operation.

Using Labels

To configure label-based targeting:

  1. Select "Labels" under "Target using"
  2. Configure your label-matching rules:
    • Key: Enter the Kubernetes label key (e.g., scheduling.cast.ai)
    • Operator: Choose from:
      • Exists: Match nodes where the label exists
      • DoesNotExist: Match nodes where the label is absent
      • In: Match nodes where the label value matches one in a list
      • NotIn: Match nodes where the label value doesn't match any in a list
    • Value: Enter the label value to match (required for In and NotIn operators)
  3. Add additional label rules using the "+ Add" button if needed

Multiple label rules are combined using theAND logic - nodes must match all specified conditions to be selected for rebalancing.

Using Fields

Field-based targeting provides an alternative method when label-based selection isn't sufficient. When selecting Fields, you can target nodes based on their Kubernetes field values rather than labels.

Example

An example use case for field targeting is rebalancing cordoned (unschedulable) nodes. Here's how to configure this:

  1. Select "Fields" under "Target using"
  2. Configure the field selector:
    • Key: spec.unschedulable
    • Operator: In
    • Value: true

This configuration will target nodes cordoned in your cluster, allowing you to rebalance workloads from these nodes to healthy ones automatically.

For more information on using field selectors, refer to Kubernetes documentation.

Using OR Conditions for Node Labels in Scheduled Rebalancing

While the console UI supports AND logic for node labels, you can use OR conditions through the Cast AI API only. Here's how:

  1. Use the appropriate Cast AI API endpoint for creating rebalancing schedules: POST /v1/rebalancing-schedules
  2. In your API request, focus on the matchExpressions section within the nodeSelectorTerms.
  3. Use the In operator and specify multiple values for the same label key to create an OR condition.

Example JSON payload snippet:

"matchExpressions": [
  {
    "key": "nodetemplate",
    "operator": "In",
    "values": [
      "customerA",
      "customerB",
      "customerC"
    ]
  }
]

This configuration targets nodes with the nodetemplate label matching any of the specified values.

🚧

Warning

When using this API method, the UI may not display all values correctly. You'll see just one label in the UI, and you'll need to assign the specific cluster to it.

For more details, refer to the Cast AI API reference documentation for creating rebalancing schedules.

Sort selected nodes

When configuring scheduled rebalancing, the Sort selected nodes setting determines how nodes are prioritized for rebalancing. Each algorithm serves different optimization goals.

Highest normalized CPU price

  • Sorts nodes by dividing the node's total cost by its CPU count, which prioritizes removing expensive nodes based on their provisioned (total available) CPUs.
  • Best for: Spot Instance clusters when you want to remove the most expensive instances or a partial rebalancing where you just want to remove the most expensive nodes

📘

Note

It doesn't consider actual resource utilization, only the raw cost per CPU.

Highest requested resource price (CPU + RAM)

  • Sorts nodes by calculating the cost of actually requested (used) resources, considering both CPU and memory in a 1:7 ratio. For example, a $5/hour node using 1 CPU and 7GB RAM has a resource price of $2.50/unit, while the same node using 4 CPU and 28GB RAM has a resource price of $0.63/unit.
  • Best for: Most scenarios, especially mixed workloads with varying memory and CPU needs, as it provides the best overall cost optimization by targeting nodes with the highest wasted resource cost.

Least utilized

  • Sorts nodes purely based on CPU utilization percentage, regardless of node size or cost. For example, a 2-core node with 1 core used (50% utilized) would be prioritized over a 64-core node with 20 cores used (31% utilized), even though the larger node has more wasted resources
  • Best for: Specific use cases where you want to consolidate many small, underutilized nodes, regardless of cost impact

📘

Note

It may miss opportunities to remove larger, more expensive nodes that have higher percentage utilization but more wasted resources as a sum.

For most cost optimization scenarios, the Highest requested resource price (CPU + RAM) sorting algorithm provides the best balance for removing expensive and underutilized nodes.

Aggressive Mode Configuration

When using aggressive mode in scheduled rebalancing, you can configure additional options to control how problematic workloads are handled.

Local Persistent Volumes Configuration

By default, even in aggressive mode, nodes with Local Persistent Volumes (LPVs) are marked as "Not ready" for rebalancing. This is because LPVs are tied to specific nodes through node affinity rules (typically using the kubernetes.io/hostname topology label) and cannot be migrated between nodes.

You can configure aggressive mode to ignore these local volume constraints.

Configuration via API

When creating or updating a rebalancing schedule through the API, include the aggressiveModeConfig in your request:

{
  "launchConfiguration": {
    "rebalancingOptions": {
      "aggressiveMode": true,
      "aggressiveModeConfig": {
        "ignoreLocalPersistentVolumes": true
      }
    }
  }
}

This configuration:

  • Requires aggressiveMode to be set to true
  • Allows rebalancing to include nodes with Local Persistent Volumes

🚧

Warning

Setting ignoreLocalPersistentVolumes to true will allow rebalancing to evict pods with local storage. This will result in data loss for applications using Local Persistent Volumes. Only use this option when data persistence is not critical.

For complete API documentation and the latest schema, refer to our Scheduled Rebalancing API Reference.

Configuration via Terraform

In Terraform, use the aggressive_mode_config block to configure Local Persistent Volumes handling:

resource "castai_rebalancing_schedule" "example" {
  # Required configuration elements...
  
  launch_configuration {
    # Enable aggressive mode
    aggressive_mode = true
    
    # Configure aggressive mode options
    aggressive_mode_config {
      ignore_local_persistent_volumes = true
    }
    
    # Other launch configuration settings...
  }
}

🚧

Warning

Setting ignore_local_persistent_volumes to true will allow rebalancing to evict pods with local storage. This will result in data loss for applications using Local Persistent Volumes. Only use this option when data persistence is not critical.

For more information about Terraform configuration options, refer to the Cast AI Terraform Provider documentation.