Continuous rebalancing
Early Access FeatureThis feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.
Continuous rebalancing is a Kentroller feature that autonomously optimizes your cluster on a recurring cycle without any manual intervention. Rather than waiting for a scheduled trigger or a user-initiated rebalancing plan, Kentroller continuously monitors your cluster and replaces inefficient nodes with cheaper alternatives whenever it identifies a worthwhile opportunity.
This is distinct from scheduled rebalancing, which fires on a cron expression you define. Continuous rebalancing runs on its own polling interval and reacts to the cluster's current state at each cycle.
NoteContinuous rebalancing operates on Karpenter-managed nodes only. Nodes not managed by Karpenter are excluded. See rebalancing scope for details.
How it works
On each cycle, Kentroller collects the current state of your cluster — nodes, pods, NodePools, NodeClaims, and PodDisruptionBudgets — and intelligently evaluates which nodes are candidates for replacement or removal. Only nodes that have been running for at least the configured minimum age are considered, which prevents churn on recently provisioned nodes.
After evaluating the cluster, Kentroller generates a RebalancePlan only if projected savings meet both configured thresholds (percentage and absolute monthly amount). The plan is executed automatically — no human approval required. Only one plan can be active at a time; if a plan is already running, the cycle waits until it completes.
If a cycle fails to produce a useful plan, Kentroller backs off automatically to avoid generating noise. Repeated failures result in longer backoff intervals before the next attempt.
Modes
Continuous rebalancing supports three modes that control what Kentroller is allowed to do each cycle. Modes form a hierarchy — more aggressive modes include the behaviors of simpler ones.
delete-empty
Removes nodes that have no running workloads. This is the most conservative mode and carries no risk of workload disruption. Kentroller deletes the NodeClaim for any empty node whose NodePool has consolidation enabled and that Karpenter has marked as consolidatable.
drain-only
Bin-packs pods from underutilized nodes onto other nodes with available capacity. Once a node is emptied, it is deleted. Replacement with new nodes does not occur — only existing capacity is used. If a pod cannot be rescheduled on existing nodes, the cycle produces no plan for that run.
This mode includes delete-empty behavior.
full
Replaces underutilized or overpriced nodes with new nodes running cheaper instance types. Kentroller creates replacement NodeClaims, drains the old nodes, and deletes them. This is the most aggressive mode and can achieve the greatest cost savings.
This mode includes both drain-only and delete-empty behaviors.
Enable Continuous Rebalancing
How you configure Continuous Rebalancing depends on which chart you installed.
Using the umbrella chart (kent mode)
Most Karpenter installations use the castai-umbrella chart with the kent profile, which bundles Kentroller as a subchart. Pass Kentroller values under the castai-kentroller key:
castai-kentroller:
castai:
continuousRebalancing:
enabled: true
mode: fullApply the change:
helm upgrade castai castai/castai-umbrella \
--reuse-values \
-f values.yamlUsing the standalone Kentroller chart
If you installed castai-kentroller directly, set values at the top level:
castai:
continuousRebalancing:
enabled: true
mode: fullApply the change:
helm upgrade castai-kentroller castai/castai-kentroller \
--reuse-values \
-f values.yamlConfiguration reference
Helm values
The following values are available under castai.continuousRebalancing (or castai-kentroller.castai.continuousRebalancing when using the umbrella chart).
| Value | Default | Description |
|---|---|---|
enabled | false | Enable or disable Continuous Rebalancing. |
mode | full | Rebalancing mode: delete-empty, drain-only, or full. |
cycleIntervalSeconds | 60 | How often (in seconds) the rebalancing cycle runs. Lower values increase responsiveness but also churn. |
savingsThresholdPercentage | 15 | Minimum projected savings percentage required before executing a plan. Applies to full mode only. |
minNodeAgeSeconds | 300 | Minimum node age in seconds before a node is considered a candidate. Prevents rebalancing freshly provisioned nodes. |
evictionConfig | [] | Ordered list of selector + settings pairs controlling pod and node eviction behavior. See Eviction config. |
Advanced environment variables
For fine-grained control, you can set additional environment variables on the Kentroller deployment. These are typically configured at installation time.
| Variable | Default | Description |
|---|---|---|
CONTINUOUS_REBALANCING_MIN_NODES_TO_CONSIDER | 1 | Minimum number of eligible nodes required to run a cycle. |
CONTINUOUS_REBALANCING_MAX_NODES_PER_ITERATION | 100 | Maximum number of candidate nodes evaluated per cycle. |
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY | 50.0 | Minimum absolute monthly savings in USD required before executing a plan. |
CONTINUOUS_REBALANCING_FAILURE_BACKOFF_DURATION | 30m | Base backoff duration after a failed cycle. |
Node exclusions
Protecting individual nodes
To prevent a specific node from being selected by Continuous Rebalancing, apply the autoscaling.cast.ai/removal-disabled label:
kubectl label node <node-name> autoscaling.cast.ai/removal-disabled=trueRemove the label to make the node eligible again:
kubectl label node <node-name> autoscaling.cast.ai/removal-disabled-NodePool-level exclusions
Continuous rebalancing respects Karpenter's consolidation policies at the NodePool level:
- Static NodePools (fixed replica count) — nodes from these pools are never selected for rebalancing.
WhenEmptyconsolidation policy — only empty nodes are selected from NodePools withconsolidationPolicy: WhenEmpty. Nodes with running workloads in such pools are excluded fromdrain-onlyandfullmodes.
spec:
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30sNodePool disruption budgets
Continuous rebalancing respects the disruption.budgets configuration on your NodePools. A rebalancing plan can include as many nodes as needed to achieve cost savings, but the actual deletions are gated by each NodePool's budget at execution time.
During the deletion phase, Kentroller checks how many nodes in a given NodePool are already being disrupted (draining or NotReady) and compares that against the budget limit. If the budget is exhausted, those deletions are deferred. Kentroller re-checks every 30 seconds — as draining nodes complete and free up budget slots, the remaining deletions proceed automatically.
This means a single plan can safely replace a large number of nodes while still honoring the disruption rate limits you've configured in Karpenter. You don't need to size your rebalancing plans conservatively to avoid overwhelming a NodePool.
Eviction config
Eviction config gives you fine-grained control over how Kentroller handles individual pods and nodes during rebalancing. It is an ordered list of selector + settings pairs — each entry matches pods, nodes, or both and applies an eviction policy to them.
By default, no eviction config is set and all pods are subject to standard eviction rules: PodDisruptionBudgets are respected, bare pods block node drain, and so on.
Behaviors
Each selector entry can enable one or more of the following settings:
removalDisabled — Prevents Kentroller from evicting or draining matching pods or nodes. Use this for critical workloads or dedicated nodes that must never be disrupted.
aggressive — Allows evicting pods that would ordinarily block node drain: single-replica pods and pods protected by a PodDisruptionBudget. Local persistent volumes are still respected.
disposable — Removes all eviction guards. Matching pods can be evicted unconditionally regardless of PDBs, replica count, or local storage. Suitable for batch jobs and ephemeral workloads.
The strictness order is removalDisabled > disposable > aggressive. This order applies both when multiple behaviors are enabled on a single entry and when multiple entries match the same pod or node — the most restrictive matching setting always wins.
Selectors
Each entry targets pods, nodes, or both:
podSelector — Matches pods by one or more criteria:
| Field | Description |
|---|---|
namespace | Namespace to match. Empty matches all namespaces. |
kind | Owner kind to match, e.g. Deployment, StatefulSet, Job. Empty matches any owner. |
replicasMin | Minimum desired replica count of the owning controller. 0 disables the filter. Does not apply to DaemonSet or Job pods. |
labelSelector | Standard Kubernetes label selector (matchLabels / matchExpressions). |
nodeSelector — Matches nodes by:
| Field | Description |
|---|---|
labelSelector | Standard Kubernetes label selector. |
Each entry must use either podSelector or nodeSelector, not both.
Configuration
Add the evictionConfig list under castai.continuousRebalancing. The following example prevents Kentroller from touching GPU nodes, and treats batch Jobs as fully disposable:
castai:
continuousRebalancing:
enabled: true
mode: full
evictionConfig:
- nodeSelector:
labelSelector:
matchLabels:
dedicated: gpu
settings:
removalDisabled:
enabled: true
- podSelector:
kind: Job
namespace: batch
settings:
disposable:
enabled: trueUse matchExpressions for more flexible matching. The following example aggressively evicts pods belonging to a Deployment with at least 3 replicas, as long as they are not labeled environment: production:
evictionConfig:
- podSelector:
kind: Deployment
replicasMin: 3
labelSelector:
matchExpressions:
- key: environment
operator: NotIn
values:
- production
settings:
aggressive:
enabled: trueApply the change:
helm upgrade castai-kentroller castai/castai-kentroller \
--reuse-values \
-f values.yamlFor the umbrella chart, nest the values under castai-kentroller.castai.continuousRebalancing.evictionConfig.
Savings thresholds
A rebalancing plan is only executed if projected savings meet both of the following thresholds:
- Percentage threshold — projected savings must be at or above
savingsThresholdPercentage(default: 15%) - Absolute monthly threshold — projected savings in USD must be at or above
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY(default: $50/month)
If either threshold is not met, no plan is created and the cycle waits for the next polling interval.
Troubleshooting
Continuous rebalancing is enabled but no plans are created
-
Verify the feature is enabled and the mode is set. For the umbrella chart:
helm get values castai | grep -A5 continuousRebalancingFor the standalone chart:
helm get values castai-kentroller | grep -A5 continuousRebalancing -
Check that enough candidate nodes exist. If fewer nodes are available than
CONTINUOUS_REBALANCING_MIN_NODES_TO_CONSIDER, the cycle exits early without creating a plan. -
Verify the savings thresholds are reachable. If the cluster is already well-optimized, plans may be discarded because projected savings fall below
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY. -
Check Kentroller logs for cycle output:
kubectl logs -n castai-agent -l app=castai-kentroller | grep "continuous-rebalancing"
A node is not being selected despite being underutilized
Check whether the node has the autoscaling.cast.ai/removal-disabled=true label:
kubectl get node <node-name> --show-labelsAlso verify the node's NodePool does not use a WhenEmpty consolidation policy and that the node is older than CONTINUOUS_REBALANCING_MIN_NODE_AGE.
The node may also be excluded due to problematic pods running on it. Common causes include:
- A PodDisruptionBudget that allows zero disruptions, blocking eviction of any pod on the node
- Pods without a controller (bare pods) that cannot be rescheduled
- Pods with local persistent volumes
To check for restrictive PDBs:
kubectl get pdb -ALook for any PDB where ALLOWED DISRUPTIONS is 0 and whose selector matches pods on the node in question.
Continuous rebalancing is backing off
If the controller has encountered repeated plan failures, it applies exponential backoff. Check the logs for backoff messages:
kubectl logs -n castai-agent -l app=castai-kentroller | grep -i "backoff\|consecutive"Backoff resets after a successful cycle.
Related resources
The in-cluster controller that runs Continuous Rebalancing and coordinates with Karpenter.
Cron-based rebalancing using Kubernetes-native CRDs.
All optimization features available for Karpenter-managed clusters.
How Cast AI extends Karpenter with optimization capabilities.
Updated about 1 hour ago
