📣
Early Access Feature
This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

This guide walks you through connecting your Karpenter-managed cluster to Cast AI, deploying optimization features, and running your first rebalancing operation. By the end, you'll have Cast AI working alongside Karpenter with active optimization in place.

What you'll accomplish:

Connect your EKS cluster running Karpenter to Cast AI
Deploy the full Cast AI for Karpenter suite
Run your first cluster rebalancing
Enable workload autoscaling for continuous optimization

Prerequisites

Required tools

Ensure the following tools are installed and accessible:

kubectl (v1.19+) configured to access your cluster
Install kubectl
Helm (v3.0+) for deploying Cast AI components
Install Helm

The onboarding script handles component installation via Helm automatically.

Cluster requirements

Your cluster must meet these requirements:

Amazon EKS cluster with Karpenter version 0.32.0 or later installed and operational
kubectl access with cluster-admin or equivalent permissions to create namespaces and deploy workloads
Outbound internet access to api.cast.ai and grpc.cast.ai

AWS permissions

The IAM principal running the onboarding script needs permissions to:

Create IAM roles and policies
Create instance profiles
Manage EC2 security groups

For the minimal required permissions policy, see Cloud permissions.

Cast AI account

If you don't have a Cast AI account, sign up here. The savings report and basic features are available on the free tier.

Step 1: Connect cluster and deploy Cast AI

In the connection modal, select EKS as your provider and Karpenter as your autoscaling tool, then click Generate script.

Copy the generated script on the Enable automation screen and run it in your terminal.

The full Cast AI for Karpenter suite includes Rebalancer, Workload Autoscaler, Evictor, Spot interruption prediction, and Pod mutations. These components work alongside Karpenter without replacing it.

What components get installed

The onboarding script installs Cast AI components in the castai-agent namespace:

Component	Purpose
`castai-agent`	Collects cluster metrics and workload data
`castai-aws-node`	AWS VPC CNI integration for container networking
`castai-cluster-controller`	Coordinates optimization actions
`castai-evictor`	Handles intelligent workload consolidation
`castai-kentroller`	Integrates Cast AI with Karpenter
`castai-live-controller`	Orchestrates container live migration
`castai-live-daemon`	Per-node agent for live migration operations
`castai-live-patch-daemonset`	Patches nodes for live migration support
`castai-workload-autoscaler`	Continuous workload rightsizing
`castai-pod-mutator`	Automates Pod spec adjustments

For more details, see Hosted components.

CASTAI_API_TOKEN=9b3e7f0a2d48c1e6f5a9b0d7c3e8f12a4d6b9c0e7f1a2b3d5c6e8f0a9b1d2c3e /bin/bash -c "$(curl -fsSL 'https://api.cast.ai/kent/v1alpha/install.sh')"

The script automatically adds Helm repositories, installs all components, detects your cluster ID, and configures feature integrations. Deployment takes 2-3 minutes.

After deployment completes, verify all pods are running:

kubectl get pods -n castai-agent

NAME READY STATUS RESTARTS AGE

castai-agent 2/2 Running 0 5m

castai-aws-node 1/2 Running 0 7m

castai-cluster-controller 2/2 Running 0 3m

castai-evictor 1/1 Running 0 2m

castai-kentroller 1/1 Running 0 6m

castai-live-controller 1/1 Running 0 4m

castai-live-daemon 1/1 Running 0 8m

castai-live-patch-daemonset 1/1 Running 0 9m

castai-workload-autoscaler 1/1 Running 0 1m

All pods should show Running status.

Step 2: Review optimization potential

After closing the confirmation dialog, you'll land on the Get started page.

The Get started page is your optimization dashboard. At the top, you'll see:

Potential savings – Total monthly savings available
Current cluster cost vs. Optimal cluster cost – Where you are now vs. where you could be

Below that are feature cards that show each optimization opportunity, along with projected savings and its current status. The Rebalancer card shows your biggest immediate opportunity. This is what you'll run first.

What if features show 'Install' instead of data?

If you deselected features during connection, those cards will show an Install button instead of data. You can install them anytime:

Click Install on the feature card
Follow the installation prompts
The feature will appear as active once installation completes

The Get started page always reflects what's currently deployed in your cluster.

Where did this data come from?

Cast AI analyzed your cluster after onboarding:

Inventoried all Nodes, Pods, and workloads
Measured actual CPU and memory usage
Identified over-provisioned resources
Compared your current instance types against more optimal alternatives
Calculated potential consolidation opportunities

This analysis runs continuously, so the numbers stay current as your cluster changes.

You're now ready to either run Rebalancer immediately or configure features first. Most users start with Rebalancer to see immediate results.

Step 3: Configure features (optional)

Before running your first rebalancing, you may want to enable Evictor and review your Karpenter configuration in Cast AI.

Enable Evictor for ongoing consolidation

Evictor is disabled by default, allowing you to control when active Pod consolidation begins. To enable it, navigate to Settings and toggle Evictor under Additional features.

Container live migration with Evictor is automatically enabled, allowing workloads to move with potentially no downtime when supported. Evictor will now continuously monitor your cluster and consolidate workloads when opportunities arise, respecting Pod Disruption Budgets throughout.

Should I enable Evictor now or later?

Enable now if: You want fully automated, continuous optimization. Evictor runs safely in the background and only consolidates when it won't disrupt workloads.

Enable later if: You prefer to run Rebalancer manually first to see how Cast AI optimizes your cluster, then enable continuous consolidation once you're comfortable.

Either approach works—Rebalancer provides immediate, one-time optimization, while Evictor provides ongoing efficiency.

Review your Karpenter configuration

Cast AI automatically imported your Karpenter NodePools and NodeClasses. Navigate to Karpenter in the left sidebar to see them.

This page shows how Cast AI recognizes and preserves your existing Karpenter setup.

📘
NodePools and NodeClasses are read-only in the Cast AI console. To modify your Karpenter configuration, use your existing GitOps workflows or kubectl commands. Cast AI respects your Karpenter CRDs as the source of truth.

What are Node overlays?

You'll see a Node overlays tab marked "Coming soon." Node overlays (a Karpenter 1.6+ feature) let clusters override scheduler decisions for special hardware, OS configs, or capacity types. Cast AI support for Node overlays is planned for a future release.

Step 4: Run your first rebalancing

Rebalancing replaces nodes with more cost-efficient alternatives. This is your first active optimization.

Locate the Rebalancer card on the Get started page or the navigation sidebar.
Review the projected savings and configuration comparison:
- Current cluster cost and instance count
- Rebalanced cluster cost and optimized instance count

Click Rebalance now on the Rebalancer card.
Review the rebalancing plan that appears:
- Nodes to be replaced
- Replacement instance types
- Expected cost reduction
Click Generate plan for Cast AI to generate a rebalancing plan for the selected Nodes.

What happens during rebalancing?

When you start rebalancing:

Cast AI creates new NodeClaims through Karpenter for the optimized instance types
Karpenter provisions the new nodes
Workloads are drained using standard Kubernetes eviction
Old nodes are deleted once empty

The entire process respects Pod Disruption Budgets and coordinates with Karpenter's native lifecycle management.

Monitor rebalancing progress

Navigate to Rebalancer in the left sidebar.
View the active rebalancing operation showing:
- Nodes being replaced
- Progress percentage
- Time remaining

Tracking progress via CRDs

Currently, for Karpenter clusters, the Cast AI console may not show detailed progress during rebalancing. For full visibility, inspect the Rebalancing CRD directly:

kubectl get rebalancing -n castai-agent
kubectl describe rebalancing <name> -n castai-agent

The Rebalancing CRD contains:

Current phase and stage
Status messages
Related events for each step
Failure reasons if something goes wrong

This is the source of truth for rebalancing status.

Once rebalancing is complete, return to the Get Started page to view your updated savings figures.

Step 5: Enable workload autoscaling

Workload Autoscaler continuously rightsizes your workloads based on actual resource usage. This optimization runs continuously in the background.

Locate the Workload Autoscaler card on the Get started page.
Review the projected savings:
- CPU and memory over-provisioning to be reduced
- Monthly cost savings from rightsizing

Go to Workloads in the side navigation menu, then Scaling policies in the top right.
Once there, toggle the default scaling policies to ON:

This will enable automated workload rightsizing for your workloads with settings that are data-driven and proven across thousands of Cast AI clusters.

Workload Autoscaler will begin gradually generating resource recommendations and conservatively applying them to your workloads. As it gains more confidence from observing your workloads over the next 48 hours, it will likely reduce resource requests more significantly.

What happens after enabling

Workload Autoscaler:

Begins monitoring resource usage across all workloads
Generates rightsizing recommendations almost immediately
Applies recommendations automatically (if configured) or when your Pods restart
Updates continuously as usage patterns change

You can view recommendations and adjustments on the Workloads page.

Next steps

You've successfully connected your Karpenter cluster and enabled active optimization. From here, you can:

Customize feature settings
Visit the Settings page to adjust advanced configurations for each feature.

Understand Karpenter integration
Learn more about how Cast AI works alongside Karpenter in Cast AI for Karpenter overview and Cast AI for Karpenter features.

Set up alerts and reporting
Configure Slack or email notifications for optimization events and schedule regular savings reports. See Notifications.

Troubleshooting

Components not starting

If pods in the castai-agent namespace are not reaching Running status:

Check pod events:

kubectl describe pod -n castai-agent <pod-name>

Check component logs:

kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-agent

Verify IAM permissions were created correctly by the onboarding script.

Cluster not appearing in console

If your cluster doesn't appear after running the script:

Verify the agent is running:

kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-agent

Check agent logs for connection errors:

kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-agent -c agent

Ensure your cluster has outbound internet access to api.cast.ai and grpc.cast.ai.

Karpenter not detected

If Cast AI doesn't detect Karpenter in your cluster:

Verify Karpenter is installed:
```
kubectl get pods -n karpenter
```
Check that Karpenter CRDs exist:
```
kubectl get crd | grep karpenter
```

If Karpenter is installed but not detected, contact Cast AI support with your cluster details.

Rebalancing not starting

If rebalancing won't start or shows errors:

Verify the Kentroller component is running:

kubectl get pods -n castai-agent -l instance=castai-kentroller

Check Kentroller logs:

kubectl logs -n castai-agent -l instance=castai-kentroller

Ensure Karpenter is operational and able to provision nodes:
```
kubectl get nodeclaims
```

Rebalancing requires both Cast AI and Karpenter to be fully operational.

NodeClaims stuck in Waiting or Provisioning

If NodeClaims remain in Waiting or Provisioning state for more than 10-15 minutes:

Check NodePool CPU limits—this commonly blocks scaling:
```
kubectl get nodepool -o yaml | grep -A5 "limits:"
```
Review instance family and generation constraints. Overly strict filters cause most capacity failures:
```
kubectl get nodepool -o yaml | grep -A10 "requirements:"
```
For Spot capacity, check if the requested instance types have availability in your region.
Inspect the NodeClaim for specific errors:
```
kubectl describe nodeclaim <nodeclaim-name>
```

If NodeClaims remain stuck after checking these, contact Cast AI support.

Evictor not consolidating nodes

If Evictor isn't taking consolidation actions:

Verify Evictor is running with replica count of 1:

kubectl get deployment castai-evictor -n castai-agent -o jsonpath='{.spec.replicas}'

Check that Cast AI labels are applied correctly to nodes:
```
kubectl get nodes -l karpenter.sh/do-not-disrupt
```

Review Evictor logs for blocked evictions:

kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-evictor --tail=100

Check if Evictor shows zero disruptable nodes:

kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-evictor | grep -i "disruptable"

Note: Evictor respects Karpenter's consolidateAfter setting. If your NodePool has a long consolidation window (e.g., 30 minutes), Evictor will wait accordingly before consolidating.

Spot interruption predictions not appearing

If the Spot interruption prediction model is enabled but no predictions appear during active Spot usage:

Verify the interruption queue is configured in your Karpenter deployment:
```
kubectl get deployment -n karpenter karpenter -o yaml | grep -i queue
```
Check that IAM permissions for SQS access are correctly configured.
Confirm the prediction model is enabled in Cast AI Settings.

Verify Kentroller is running and connected:

kubectl logs -n castai-agent -l instance=castai-kentroller --tail=50

If predictions still don't appear during active Spot usage, contact Cast AI support.

Rebalancing failed or stuck

If rebalancing operations fail or appear stuck:

Check the Rebalancing CRD for status and failure reasons:

kubectl get rebalancing -n castai-agent
kubectl describe rebalancing <name> -n castai-agent

Review Kentroller logs for detailed error information:

kubectl logs -n castai-agent -l instance=castai-kentroller --tail=200

The Rebalancing CRD contains:

Current phase and stage (including failures)
Human-readable failure reasons
Related events for each step

This CRD is the source of truth for diagnosing rebalancing issues. The UI may not show detailed progress, but the CRD provides full visibility.

Escalate to Cast AI support if:

Rebalance plans fail repeatedly with the same error
NodeClaims created during rebalancing are stuck for more than 15 minutes
Kentroller logs show persistent connection or API errors

Unexpected node replacements

If nodes are being replaced unexpectedly:

Check if Karpenter's drift detection triggered the replacement:
```
kubectl get nodeclaim -o yaml | grep -A5 "disruption"
```
Review if node expiration (expireAfter) is configured in your NodePool:
```
kubectl get nodepool -o yaml | grep -i expireafter
```

Check Karpenter controller logs for consolidation or drift events:

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100 | grep -i "disruption\|drift\|consolidat"

Common causes of unexpected replacements:

NodePool expireAfter setting triggering periodic node repaving
Drift detection from AMI updates, security group changes, or label modifications
Karpenter's native consolidation (if Evictor isn't marking nodes with do-not-disrupt)

Workload Autoscaler not generating recommendations

If Workload Autoscaler is enabled but not showing recommendations:

Verify the Workload Autoscaler pod is running:

kubectl get pods -n castai-agent -l app=castai-workload-autoscaler

Check that metrics-server is installed and working:
```
kubectl top nodes
kubectl top pods -A
```

Review Workload Autoscaler logs for errors:

kubectl logs -n castai-agent -l app=castai-workload-autoscaler

Workload Autoscaler requires metrics-server for resource usage data. The onboarding script installs it automatically if not present.

Related resources

Cast AI for Karpenter overview – How Cast AI extends Karpenter
Cast AI for Karpenter features – Detailed feature documentation
Rebalancer – Understanding cluster rebalancing
Workload Autoscaler – Workload rightsizing details