Kentroller

📣

Early Access Feature

This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

Kentroller is the in-cluster control-plane component that coordinates Cast AI's node rebalancing and automation features with Karpenter. It runs inside your cluster as a Kubernetes controller, watches Cast AI custom resources, and drives optimization workflows by interacting directly with Karpenter's NodeClaims and NodePools.

What Kentroller does

Kentroller is responsible for:

  • Generating and reconciling rebalancing plans using CRDs in the autoscaling.cast.ai/v1alpha API group
  • Handling claim- and schedule-driven rebalancing flows
  • Running continuous in-cluster rebalancing to find cost-saving node replacements
  • Replacing spot instances proactively based on interruption predictions
  • Coordinating node consolidation with Karpenter NodeClaim deletion

How it works

Component interactions

Kentroller interacts with the following systems:

SystemPurpose
Kubernetes API serverWatches and updates Nodes, Pods, NodeClaims, NodePools, and Cast AI CRDs
KarpenterReads NodeClaims and NodePools; delegates node and instance termination to Karpenter
Cast AI APIPlan coordination, configuration, and audit logging
Cast AI ML APISpot interruption prediction model queries
AWS EC2 and Pricing APIsInstance inventory and cost-aware rebalancing decisions

Kentroller maintains a persistent connection to the Cast AI backend for real-time plan coordination. This connection is automatically re-established if it drops.

CRDs managed by Kentroller

Kentroller defines and reconciles the following Custom Resource Definitions (CRDs) in the autoscaling.cast.ai/v1alpha API group:

CRDShort nameScopeDescription
RebalancePlanSchedulerpsClusterDefines when rebalancing should run on a cron schedule
RebalancePlanClaimrpcClusterRepresents a single rebalancing execution request
RebalancePlanrpClusterHolds the concrete plan: which NodeClaims to add, remove, and which pods to migrate
RebalanceMigrationPlanClusterCoordinates container live migrations as part of a rebalancing plan

Feature gating

All Kentroller features are enabled and disabled dynamically through the Cast AI console without restarting the controller.

Rebalancing workflows

Kentroller supports three ways to initiate rebalancing:

Schedule-driven rebalancing

When a RebalancePlanSchedule fires, Kentroller creates a RebalancePlanClaim from the schedule template and submits it to Cast AI for plan generation. For details, see Scheduled rebalancing for Karpenter clusters.

Claim-driven rebalancing

A RebalancePlanClaim can also be created manually without a schedule. This triggers a single on-demand rebalancing operation with the configuration specified in the claim's spec.

Continuous in-cluster rebalancing

The continuous rebalancing controller runs periodically inside the cluster and optimizes nodes without calling the Cast AI backend for plan generation. It works entirely from local cluster state and AWS pricing data:

  1. Resource collection — Kentroller reads all Nodes, NodeClaims, NodePools, Pods, DaemonSets, PodDisruptionBudgets, and EC2NodeClasses from the cluster.
  2. Node analysis — Nodes are classified as candidates for replacement based on age, workload compatibility, and NodePool constraints. Nodes with blocking workloads (such as local PVs or pods without a controller) are excluded unless aggressive mode is configured.
  3. Optimization search — Kentroller evaluates candidate node sets to find the largest subset of nodes that can be replaced for a net cost saving.
  4. Savings validation — A plan is only executed if it meets both a savings percentage threshold (default: 0%) and an absolute monthly savings threshold (default: $50/month).
  5. Plan creation — A RebalancePlan is created directly in the cluster without a RebalancePlanClaim. Kentroller then executes the plan by provisioning replacement NodeClaims and deleting the originals.

The cycle repeats on a configurable polling period (default: 10 seconds). Only one active plan can exist at a time — if a plan is already running, the cycle skips until it completes.

Spot interruption prediction

The spot interruption prediction service polls the Cast AI ML API at a configurable interval (default: every 1 minute) to identify spot nodes at risk of interruption. When a node is predicted to be interrupted:

  1. Kentroller labels the node with autoscaling.cast.ai/predicted-interruption=true
  2. A RebalancePlanClaim is created with the at-risk node in its scope, triggering proactive replacement
  3. A SpotInterruptionPredicted event is emitted on the node
  4. After replacement, a cooldown period (default: 15 minutes) prevents the same node from being replaced again immediately

This gives workloads significantly more lead time before the actual interruption compared to AWS's standard two-minute warning.

Rebalance execution

Plan lifecycle

Once a RebalancePlanClaim is created, it progresses through the following states:

StateDescription
PendingClaim created, not yet submitted to Cast AI
GeneratingCast AI is computing the rebalancing plan
ReadyPlan generated, waiting for execution to begin
ExecutingRebalancing is actively running (nodes being replaced)
CompletedRebalancing completed successfully
FailedRebalancing failed; see status.errorMessage for details

Execution phases

When a RebalancePlan runs, Kentroller proceeds in two phases:

Creation phase — New NodeClaims are provisioned through Karpenter. If a provisioning attempt fails due to insufficient capacity, Kentroller retries with the next available instance type from the NodePool's requirements. Each attempt is recorded in status.nodesCreation:

FieldDescription
instanceTypeThe instance type attempted
nodeClaimNameNodeClaim name, including -attempt-N suffix for retries
statusInProgress, Success, or Failed
descriptionDetails, such as InsufficientCapacityError

Deletion phase — Old NodeClaims are drained and deleted. If container live migration is available, Kentroller creates a RebalanceMigrationPlan to move workloads without restarts before deleting the node. Deletion events are recorded in status.nodesDeletion:

StatusDescription
LiveMigrationCreatedA RebalanceMigrationPlan was created for this node
LiveMigrationInProgressLive migration is running
LiveMigrationSucceededLive migration completed; node is safe to delete
LiveMigrationFailedLive migration failed; falls back to standard eviction
InProgressNodeClaim deletion in progress
SuccessNodeClaim deleted successfully
FailedNodeClaim deletion failed

Savings validation

If executionPolicy.achievedSavingsPercentageThreshold is set on the claim, Kentroller validates actual savings after new nodes are provisioned but before old nodes are deleted. If the realized savings do not meet the threshold, the plan is aborted with FailureReasonInsufficientSavings.

Validated savings are stored in status.validatedSavings on the RebalancePlan. If a fallback instance type was used due to insufficient capacity, the originally planned price is used as a conservative estimate.

Failure recovery

When a plan fails, Kentroller records which phase caused the failure in status.failurePhase to ensure correct cleanup after a controller restart:

PhaseMeaningCleanup action
CreationFailed while provisioning new NodeClaimsRoll back by deleting newly created NodeClaims
DeletionFailed while draining or deleting old nodesUncordon new nodes so Karpenter can manage them normally

Savings reporting

After plan generation, Cast AI populates status.savings on the RebalancePlan:

FieldDescription
projectedSavingsPercentPercentage savings relative to current node spend
projectedSavingsCostMonthlyEstimated monthly savings
currentMonthlyCost / projectedMonthlyCostBefore and after monthly cost for the rebalanced nodes
currentClusterMonthlyCost / projectedClusterMonthlyCostBefore and after monthly cost for the entire cluster
blueNodes / greenNodesPer-node cost details for nodes being removed and added

Node consolidation

The node consolidation controller watches Node objects and deletes the corresponding Karpenter NodeClaim when consolidation conditions are met. Deleting the NodeClaim delegates final node and cloud instance termination to Karpenter.

Consolidation is triggered through three paths:

PathTrigger condition
ExpirationNodeClaim.spec.expireAfter has elapsed
Evictor handoffNode has evictor.cast.ai/eviction-status=done, autoscaling.cast.ai/draining=evictor, and spec.unschedulable=true
Empty nodeNode is empty, NodePool has consolidation enabled, and the node has Consolidatable=True condition set by Karpenter

Node consolidation is feature-gated and can be enabled or disabled dynamically without restarting Kentroller.

Configuration

Required environment variables

VariableDescription
CLUSTER_IDCast AI cluster identifier
API_KEYCast AI API key for authentication
API_URLCast AI REST API endpoint
GRPC_URLCast AI gRPC endpoint for plan coordination
Optional environment variables
VariableDefaultDescription
GRPC_DISABLE_TLSfalseDisables TLS for gRPC connections
FEATURES_WATCH_NAMESPACEScastai-agentNamespaces to watch for feature state ConfigMaps
DYNAMIC_CONFIG_MAP_NAMEConfigMap name for dynamic configuration
NODE_CONSOLIDATION_MAX_CONCURRENT_RECONCILES10Parallel node consolidation operations
REBALANCER_MAX_CONCURRENT_RECONCILES10Parallel rebalancing plan operations
Continuous rebalancing variables
VariableDefaultDescription
CONTINUOUS_REBALANCING_MIN_NODE_AGE5mMinimum node age before a node is considered a rebalancing candidate
CONTINUOUS_REBALANCING_MIN_NODES_TO_CONSIDER1Minimum number of candidate nodes required to run a cycle
CONTINUOUS_REBALANCING_MAX_NODES_PER_ITERATION100Maximum candidates evaluated per run
CONTINUOUS_REBALANCING_MAX_BINARY_SEARCH_DEPTH20Maximum search depth
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_PERCENT0.0Minimum savings percentage relative to replaced nodes
CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY50.0Minimum absolute monthly savings in USD
CONTINUOUS_REBALANCING_FAILURE_BACKOFF_DURATION30mBackoff after a failed cycle
CONTINUOUS_REBALANCING_POLLING_PERIOD10sHow often the continuous rebalancing cycle runs
Spot interruption prediction variables
VariableDefaultDescription
SPOT_INTERRUPTION_PREDICTION_MODEL_NAMEML model name to use for predictions
SPOT_INTERRUPTION_PREDICTION_POLL_INTERVAL1mHow often to poll the Cast AI ML API
SPOT_INTERRUPTION_PREDICTION_SPOT_REPLACEMENT_COOLDOWN15mCooldown period before the same node can be replaced again
SPOT_INTERRUPTION_PREDICTION_INTERRUPTION_THRESHOLDProbability threshold (0–1) above which a node is considered at risk

View resources

List RebalancePlans

To list all rebalancing plans in the cluster:

kubectl get rebalanceplans
# or using the short name:
kubectl get rp

Example output:

NAME                          BEFORE     AFTER      SAVINGS%   STATE   AUTO-EXECUTED   AGE
nightly-rebalance-1735776000  1234.56    1111.11    10.0       Done    true            2h

List RebalancePlanClaims

To list all rebalancing plan claims:

kubectl get rebalanceplanclaims
# or using the short name:
kubectl get rpc

Example output:

NAME                                  STATE       PLANID   NODES   SAVINGS%   REBALANCE   AGE
nightly-spot-rebalance-1735776000     Completed   abc123   5       12.4       rp-xyz      2h

Inspect a RebalancePlan

To view the full details of a specific plan:

kubectl describe rebalanceplan <plan-name>

The status section shows:

  • state — current execution state: Pending, Running, Done, DoneWithWarnings, or Failed
  • savings — cost details for current and projected node configurations
  • nodesCreation — per-node creation attempt history
  • nodesDeletion — per-node deletion event history
  • failurePhase — if failed, which phase caused the failure (Creation or Deletion)

Troubleshooting

RebalancePlanClaim stuck in Pending or Generating

  1. Check the claim status:

    kubectl describe rpc <claim-name>

    Review status.conditions and status.errorMessage.

  2. Verify the Cast AI agent is connected and communicating with Cast AI.

  3. Check if minSavingsPercentage is set too high — if projected savings don't meet the threshold, the claim will fail without executing.

RebalancePlan stuck in Running

  1. Check for failed node creation attempts:

    kubectl describe rp <plan-name>

    Look at status.nodesCreation for InsufficientCapacityError on all instance types. If all attempts are exhausted, the plan transitions to Failed with FailureReasonNodeCreateFailed.

  2. Check status.nodesDeletion for stuck live migrations. If a RebalanceMigrationPlan is stuck, inspect it directly:

    kubectl get rebalancemigrationplans
    kubectl describe rebalancemigrationplan <name>

Continuous rebalancing not running

  1. Verify the continuous rebalancing feature is enabled in Cast AI.

  2. Check that enough candidate nodes exist — CONTINUOUS_REBALANCING_MIN_NODES_TO_CONSIDER must be met.

  3. Verify the savings thresholds are reachable. If the cluster is already well-optimized, plans may be generated but discarded because projected savings fall below CONTINUOUS_REBALANCING_SAVINGS_THRESHOLD_COST_MONTHLY.

  4. Check Kentroller logs:

    kubectl logs -n castai-agent -l app=castai-kentroller | grep "continuous-rebalancing"

Node consolidation not triggering

  1. Verify the node consolidation feature is enabled in Cast AI.

  2. For the empty node path, check that the NodePool has consolidation enabled and that Karpenter is setting the Consolidatable=True condition on empty nodes.

  3. Check Kentroller logs:

    kubectl logs -n castai-agent -l app=castai-kentroller | grep "empty-node-deleter"

Spot interruption replacements not happening

  1. Verify the spot interruption prediction feature is enabled in Cast AI.

  2. Check that SPOT_INTERRUPTION_PREDICTION_MODEL_NAME is set and that Kentroller has connectivity to the Cast AI ML API.

  3. Look for SpotInterruptionPredicted events on spot nodes:

    kubectl get events --field-selector reason=SpotInterruptionPredicted

Related resources