Kentroller
Early Access FeatureThis feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.
Kentroller is the in-cluster control-plane component that coordinates Cast AI's node rebalancing and automation features with Karpenter. It runs inside your cluster as a Kubernetes controller, watches Cast AI custom resources, and drives optimization workflows by interacting directly with Karpenter's NodeClaims and NodePools.
What Kentroller does
Kentroller is responsible for:
- Generating and reconciling rebalancing plans using CRDs in the
autoscaling.cast.ai/v1alphaAPI group - Handling claim- and schedule-driven rebalancing flows
- Running continuous in-cluster rebalancing to find cost-saving node replacements
- Replacing spot instances proactively based on interruption predictions
- Coordinating node consolidation with Karpenter NodeClaim deletion
How it works
Component interactions
Kentroller interacts with the following systems:
| System | Purpose |
|---|---|
| Kubernetes API server | Watches and updates Nodes, Pods, NodeClaims, NodePools, and Cast AI CRDs |
| Karpenter | Reads NodeClaims and NodePools; delegates node and instance termination to Karpenter |
| Cast AI API | Plan coordination, configuration, and audit logging |
| Cast AI ML API | Spot interruption prediction model queries |
| AWS EC2 and Pricing APIs | Instance inventory and cost-aware rebalancing decisions |
Kentroller maintains a persistent connection to the Cast AI backend for real-time plan coordination. This connection is automatically re-established if it drops.
CRDs managed by Kentroller
Kentroller defines and reconciles the following Custom Resource Definitions (CRDs) in the autoscaling.cast.ai/v1alpha API group:
| CRD | Short name | Scope | Description |
|---|---|---|---|
RebalancePlanSchedule | rps | Cluster | Defines when rebalancing should run on a cron schedule |
RebalancePlanClaim | rpc | Cluster | Represents a single rebalancing execution request |
RebalancePlan | rp | Cluster | Holds the concrete plan: which NodeClaims to add, remove, and which pods to migrate |
RebalanceMigrationPlan | — | Cluster | Coordinates container live migrations as part of a rebalancing plan |
Feature gating
All Kentroller features are enabled and disabled dynamically through the Cast AI console without restarting the controller.
Rebalancing workflows
Kentroller supports three ways to initiate rebalancing:
Schedule-driven rebalancing
When a RebalancePlanSchedule fires, Kentroller creates a RebalancePlanClaim from the schedule template and submits it to Cast AI for plan generation. For details, see Scheduled rebalancing for Karpenter clusters.
Claim-driven rebalancing
A RebalancePlanClaim can also be created manually without a schedule. This triggers a single on-demand rebalancing operation with the configuration specified in the claim's spec.
Continuous in-cluster rebalancing
The continuous rebalancing controller runs periodically inside the cluster and optimizes nodes without calling the Cast AI backend for plan generation. It works entirely from local cluster state and AWS pricing data:
- Resource collection — Kentroller reads all Nodes, NodeClaims, NodePools, Pods, DaemonSets, PodDisruptionBudgets, and EC2NodeClasses from the cluster.
- Node analysis — Nodes are classified as candidates for replacement based on age, workload compatibility, and NodePool constraints. Nodes with blocking workloads (such as local PVs or pods without a controller) are excluded unless aggressive mode is configured.
- Optimization search — Kentroller evaluates candidate node sets to find the largest subset of nodes that can be replaced for a net cost saving.
- Savings validation — A plan is only executed if it meets both a savings percentage threshold (default: 0%) and an absolute monthly savings threshold (default: $50/month).
- Plan creation — A
RebalancePlanis created directly in the cluster without aRebalancePlanClaim. Kentroller then executes the plan by provisioning replacement NodeClaims and deleting the originals.
The cycle repeats on a configurable polling period (default: 10 seconds). Only one active plan can exist at a time — if a plan is already running, the cycle skips until it completes.
Spot interruption prediction
The spot interruption prediction service polls the Cast AI ML API at a configurable interval (default: every 1 minute) to identify spot nodes at risk of interruption. When a node is predicted to be interrupted:
- Kentroller labels the node with
autoscaling.cast.ai/predicted-interruption=true - A
RebalancePlanClaimis created with the at-risk node in its scope, triggering proactive replacement - A
SpotInterruptionPredictedevent is emitted on the node - After replacement, a cooldown period (default: 15 minutes) prevents the same node from being replaced again immediately
This gives workloads significantly more lead time before the actual interruption compared to AWS's standard two-minute warning.
Rebalance execution
Plan lifecycle
Once a RebalancePlanClaim is created, it progresses through the following states:
| State | Description |
|---|---|
Pending | Claim created, not yet submitted to Cast AI |
Generating | Cast AI is computing the rebalancing plan |
Ready | Plan generated, waiting for execution to begin |
Executing | Rebalancing is actively running (nodes being replaced) |
Completed | Rebalancing completed successfully |
Failed | Rebalancing failed; see status.errorMessage for details |
Execution phases
When a RebalancePlan runs, Kentroller proceeds in two phases:
Creation phase — New NodeClaims are provisioned through Karpenter. If a provisioning attempt fails due to insufficient capacity, Kentroller retries with the next available instance type from the NodePool's requirements. Each attempt is recorded in status.nodesCreation:
| Field | Description |
|---|---|
instanceType | The instance type attempted |
nodeClaimName | NodeClaim name, including -attempt-N suffix for retries |
status | InProgress, Success, or Failed |
description | Details, such as InsufficientCapacityError |
Deletion phase — Old NodeClaims are drained and deleted. If container live migration is available, Kentroller creates a RebalanceMigrationPlan to move workloads without restarts before deleting the node. Deletion events are recorded in status.nodesDeletion:
| Status | Description |
|---|---|
LiveMigrationCreated | A RebalanceMigrationPlan was created for this node |
LiveMigrationInProgress | Live migration is running |
LiveMigrationSucceeded | Live migration completed; node is safe to delete |
LiveMigrationFailed | Live migration failed; falls back to standard eviction |
InProgress | NodeClaim deletion in progress |
Success | NodeClaim deleted successfully |
Failed | NodeClaim deletion failed |
Savings validation
If executionPolicy.achievedSavingsPercentageThreshold is set on the claim, Kentroller validates actual savings after new nodes are provisioned but before old nodes are deleted. If the realized savings do not meet the threshold, the plan is aborted with FailureReasonInsufficientSavings.
Validated savings are stored in status.validatedSavings on the RebalancePlan. If a fallback instance type was used due to insufficient capacity, the originally planned price is used as a conservative estimate.
Failure recovery
When a plan fails, Kentroller records which phase caused the failure in status.failurePhase to ensure correct cleanup after a controller restart:
| Phase | Meaning | Cleanup action |
|---|---|---|
Creation | Failed while provisioning new NodeClaims | Roll back by deleting newly created NodeClaims |
Deletion | Failed while draining or deleting old nodes | Uncordon new nodes so Karpenter can manage them normally |
Savings reporting
After plan generation, Cast AI populates status.savings on the RebalancePlan:
| Field | Description |
|---|---|
projectedSavingsPercent | Percentage savings relative to current node spend |
projectedSavingsCostMonthly | Estimated monthly savings |
currentMonthlyCost / projectedMonthlyCost | Before and after monthly cost for the rebalanced nodes |
currentClusterMonthlyCost / projectedClusterMonthlyCost | Before and after monthly cost for the entire cluster |
blueNodes / greenNodes | Per-node cost details for nodes being removed and added |
Node consolidation
The node consolidation controller watches Node objects and deletes the corresponding Karpenter NodeClaim when consolidation conditions are met. Deleting the NodeClaim delegates final node and cloud instance termination to Karpenter.
Consolidation is triggered through three paths:
| Path | Trigger condition |
|---|---|
| Expiration | NodeClaim.spec.expireAfter has elapsed |
| Evictor handoff | Node has evictor.cast.ai/eviction-status=done, autoscaling.cast.ai/draining=evictor, and spec.unschedulable=true |
| Empty node | Node is empty, NodePool has consolidation enabled, and the node has Consolidatable=True condition set by Karpenter |
Node consolidation is feature-gated and can be enabled or disabled dynamically without restarting Kentroller.
Configuration
Required environment variables
| Variable | Description |
|---|---|
CLUSTER_ID | Cast AI cluster identifier |
API_KEY | Cast AI API key for authentication |
API_URL | Cast AI REST API endpoint |
GRPC_URL | Cast AI gRPC endpoint for plan coordination |
Optional environment variables
| Variable | Default | Description |
|---|---|---|
GRPC_DISABLE_TLS | false | Disables TLS for gRPC connections |
FEATURES_WATCH_NAMESPACES | castai-agent | Namespaces to watch for feature state ConfigMaps |
DYNAMIC_CONFIG_MAP_NAME | — | ConfigMap name for dynamic configuration |
NODE_CONSOLIDATION_MAX_CONCURRENT_RECONCILES | 10 | Parallel node consolidation operations |
REBALANCER_MAX_CONCURRENT_RECONCILES | 10 | Parallel rebalancing plan operations |
Continuous rebalancing variables
| Variable | Default | Description |
|---|---|---|
CONTINUOUS_REBALANCING_MIN_NODE_AGE | 5m | Minimum node age before a node is considered a rebalancing candidate |
CONTINUOUS_REBALANCING_MIN_NODES_TO_CONSIDER | 1 | Minimum number of candidate nodes required to run a cycle |
CONTINUOUS_REBALANCING_MAX_NODES_PER_ITERATION | 100 | Maximum candidates evaluated per run |
CONTINUOUS_REBALANCING_MAX_BINARY_SEARCH_DEPTH | 20 | Maximum search depth |
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_PERCENT | 0.0 | Minimum savings percentage relative to replaced nodes |
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY | 50.0 | Minimum absolute monthly savings in USD |
CONTINUOUS_REBALANCING_FAILURE_BACKOFF_DURATION | 30m | Backoff after a failed cycle |
CONTINUOUS_REBALANCING_POLLING_PERIOD | 10s | How often the continuous rebalancing cycle runs |
Spot interruption prediction variables
| Variable | Default | Description |
|---|---|---|
SPOT_INTERRUPTION_PREDICTION_MODEL_NAME | — | ML model name to use for predictions |
SPOT_INTERRUPTION_PREDICTION_POLL_INTERVAL | 1m | How often to poll the Cast AI ML API |
SPOT_INTERRUPTION_PREDICTION_SPOT_REPLACEMENT_COOLDOWN | 15m | Cooldown period before the same node can be replaced again |
SPOT_INTERRUPTION_PREDICTION_INTERRUPTION_THRESHOLD | — | Probability threshold (0–1) above which a node is considered at risk |
View resources
List RebalancePlans
To list all rebalancing plans in the cluster:
kubectl get rebalanceplans
# or using the short name:
kubectl get rpExample output:
NAME BEFORE AFTER SAVINGS% STATE AUTO-EXECUTED AGE
nightly-rebalance-1735776000 1234.56 1111.11 10.0 Done true 2hList RebalancePlanClaims
To list all rebalancing plan claims:
kubectl get rebalanceplanclaims
# or using the short name:
kubectl get rpcExample output:
NAME STATE PLANID NODES SAVINGS% REBALANCE AGE
nightly-spot-rebalance-1735776000 Completed abc123 5 12.4 rp-xyz 2hInspect a RebalancePlan
To view the full details of a specific plan:
kubectl describe rebalanceplan <plan-name>The status section shows:
state— current execution state:Pending,Running,Done,DoneWithWarnings, orFailedsavings— cost details for current and projected node configurationsnodesCreation— per-node creation attempt historynodesDeletion— per-node deletion event historyfailurePhase— if failed, which phase caused the failure (CreationorDeletion)
Troubleshooting
RebalancePlanClaim stuck in Pending or Generating
-
Check the claim status:
kubectl describe rpc <claim-name>Review
status.conditionsandstatus.errorMessage. -
Verify the Cast AI agent is connected and communicating with Cast AI.
-
Check if
minSavingsPercentageis set too high — if projected savings don't meet the threshold, the claim will fail without executing.
RebalancePlan stuck in Running
-
Check for failed node creation attempts:
kubectl describe rp <plan-name>Look at
status.nodesCreationforInsufficientCapacityErroron all instance types. If all attempts are exhausted, the plan transitions toFailedwithFailureReasonNodeCreateFailed. -
Check
status.nodesDeletionfor stuck live migrations. If aRebalanceMigrationPlanis stuck, inspect it directly:kubectl get rebalancemigrationplans kubectl describe rebalancemigrationplan <name>
Continuous rebalancing not running
-
Verify the continuous rebalancing feature is enabled in Cast AI.
-
Check that enough candidate nodes exist —
CONTINUOUS_REBALANCING_MIN_NODES_TO_CONSIDERmust be met. -
Verify the savings thresholds are reachable. If the cluster is already well-optimized, plans may be generated but discarded because projected savings fall below
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY. -
Check Kentroller logs:
kubectl logs -n castai-agent -l app=castai-kentroller | grep "continuous-rebalancing"
Node consolidation not triggering
-
Verify the node consolidation feature is enabled in Cast AI.
-
For the empty node path, check that the NodePool has consolidation enabled and that Karpenter is setting the
Consolidatable=Truecondition on empty nodes. -
Check Kentroller logs:
kubectl logs -n castai-agent -l app=castai-kentroller | grep "empty-node-deleter"
Spot interruption replacements not happening
-
Verify the spot interruption prediction feature is enabled in Cast AI.
-
Check that
SPOT_INTERRUPTION_PREDICTION_MODEL_NAMEis set and that Kentroller has connectivity to the Cast AI ML API. -
Look for
SpotInterruptionPredictedevents on spot nodes:kubectl get events --field-selector reason=SpotInterruptionPredicted
Related resources
Set up cron-based rebalancing using Kubernetes-native CRDs.
How Cast AI extends Karpenter with optimization capabilities.
Available optimization features for Karpenter-managed clusters.
All Cast AI in-cluster components and their roles.
Updated about 4 hours ago
