Getting started with Cast AI for Karpenter


This guide walks you through connecting your Karpenter-managed cluster to Cast AI, deploying optimization features, and running your first rebalancing operation. By the end, you'll have Cast AI working alongside Karpenter with active optimization in place.

What you'll accomplish:

  • Connect your EKS cluster running Karpenter to Cast AI
  • Deploy the full Karpenter Enterprise suite
  • Run your first cluster rebalancing
  • Enable workload autoscaling for continuous optimization

Prerequisites

Required tools

Ensure the following tools are installed and accessible:

The onboarding script handles component installation via Helm automatically.

Cluster requirements

Your cluster must meet these requirements:

  • Amazon EKS cluster with Karpenter version 0.32.0 or later installed and operational
  • kubectl access with cluster-admin or equivalent permissions to create namespaces and deploy workloads
  • Outbound internet access to api.cast.ai and grpc.cast.ai
AWS permissions

The IAM principal running the onboarding script needs permissions to:

  • Create IAM roles and policies
  • Create instance profiles
  • Manage EC2 security groups

For the minimal required permissions policy, see Cloud permissions.

Cast AI account

If you don't have a Cast AI account, sign up here. The savings report and basic features are available on the free tier.

Step 1: Connect cluster and deploy Cast AI

  1. Log in to the Cast AI console and click Connect cluster.
  1. In the connection modal, select EKS as your provider and Karpenter as your autoscaling tool, then click Generate script.
  1. Copy the generated script on the Enable automation screen and run it in your terminal.

The full Karpenter Enterprise suite includes Rebalancer, Workload Autoscaler, Evictor, Spot interruption prediction, and Pod mutations. These components work alongside Karpenter without replacing it.

What components get installed

The onboarding script installs Cast AI components in the castai-agent namespace:

ComponentPurpose
castai-agentCollects cluster metrics and workload data
castai-aws-nodeAWS VPC CNI integration for container networking
castai-cluster-controllerCoordinates optimization actions
castai-evictorHandles intelligent workload consolidation
castai-kentrollerIntegrates Cast AI with Karpenter
castai-live-controllerOrchestrates container live migration
castai-live-daemonPer-node agent for live migration operations
castai-live-patch-daemonsetPatches nodes for live migration support
castai-workload-autoscalerContinuous workload rightsizing
castai-pod-mutatorAutomates Pod spec adjustments

For more details, see Hosted components.

CASTAI_API_TOKEN=9b3e7f0a2d48c1e6f5a9b0d7c3e8f12a4d6b9c0e7f1a2b3d5c6e8f0a9b1d2c3e /bin/bash -c "$(curl -fsSL 'https://api.cast.ai/kent/v1alpha/install.sh')"

The script automatically adds Helm repositories, installs all components, detects your cluster ID, and configures feature integrations. Deployment takes 2-3 minutes.

  1. After deployment completes, verify all pods are running:
kubectl get pods -n castai-agent
NAME READY STATUS RESTARTS AGE
castai-agent 2/2 Running 0 5m
castai-aws-node 1/2 Running 0 7m
castai-cluster-controller 2/2 Running 0 3m
castai-evictor 1/1 Running 0 2m
castai-kentroller 1/1 Running 0 6m
castai-live-controller 1/1 Running 0 4m
castai-live-daemon 1/1 Running 0 8m
castai-live-patch-daemonset 1/1 Running 0 9m
castai-workload-autoscaler 1/1 Running 0 1m

All pods should show Running status.

Step 2: Review optimization potential

After closing the confirmation dialog, you'll land on the Get started page.

The Get started page is your optimization dashboard. At the top, you'll see:

  • Potential savings – Total monthly savings available
  • Current cluster cost vs. Optimal cluster cost – Where you are now vs. where you could be

Below that are feature cards that show each optimization opportunity, along with projected savings and its current status. The Rebalancer card shows your biggest immediate opportunity. This is what you'll run first.

What if features show 'Install' instead of data?

If you deselected features during connection, those cards will show an Install button instead of data. You can install them anytime:

  1. Click Install on the feature card
  2. Follow the installation prompts
  3. The feature will appear as active once installation completes

The Get started page always reflects what's currently deployed in your cluster.

Where did this data come from?

Cast AI analyzed your cluster after onboarding:

  • Inventoried all Nodes, Pods, and workloads
  • Measured actual CPU and memory usage
  • Identified over-provisioned resources
  • Compared your current instance types against more optimal alternatives
  • Calculated potential consolidation opportunities

This analysis runs continuously, so the numbers stay current as your cluster changes.

You're now ready to either run Rebalancer immediately or configure features first. Most users start with Rebalancer to see immediate results.

Step 3: Configure features (optional)

Before running your first rebalancing, you may want to enable Evictor and review your Karpenter configuration in Cast AI.

Enable Evictor for ongoing consolidation

Evictor is disabled by default, allowing you to control when active Pod consolidation begins. To enable it, navigate to Settings and toggle Evictor under Additional features.

Container live migration with Evictor is automatically enabled, allowing workloads to move with potentially no downtime when supported. Evictor will now continuously monitor your cluster and consolidate workloads when opportunities arise, respecting Pod Disruption Budgets throughout.

Should I enable Evictor now or later?

Enable now if: You want fully automated, continuous optimization. Evictor runs safely in the background and only consolidates when it won't disrupt workloads.

Enable later if: You prefer to run Rebalancer manually first to see how Cast AI optimizes your cluster, then enable continuous consolidation once you're comfortable.

Either approach works—Rebalancer provides immediate, one-time optimization, while Evictor provides ongoing efficiency.

Review your Karpenter configuration

Cast AI automatically imported your Karpenter NodePools and NodeClasses. Navigate to Karpenter in the left sidebar to see them.

This page shows how Cast AI recognizes and preserves your existing Karpenter setup.

What are Node overlays?

You'll see a Node overlays tab marked "Coming soon." This feature will allow Cast AI to dynamically adjust NodePool configurations based on commitment utilization, Spot availability, and other optimization signals—without permanently modifying your NodePool definitions.

For now, Cast AI optimizes by working with your existing NodePools and NodeClasses as-is.

Once you've reviewed Settings and your Karpenter configuration, you're ready to run your first rebalancing.

Step 4: Run your first rebalancing

Rebalancing replaces nodes with more cost-efficient alternatives. This is your first active optimization.

  1. Locate the Rebalancer card on the Get started page or the navigation sidebar.

  2. Review the projected savings and configuration comparison:

    • Current cluster cost and instance count
    • Rebalanced cluster cost and optimized instance count
  1. Click Rebalance now on the Rebalancer card.

  2. Review the rebalancing plan that appears:

    • Nodes to be replaced
    • Replacement instance types
    • Expected cost reduction
  3. Click Start rebalancing to begin the operation.

What happens during rebalancing

The rebalancing process:

  1. Creates new Nodes with more cost-efficient instance types
  2. Drains workloads from old Nodes (using live migration where supported)
  3. Deletes old Nodes once workloads are safely moved
  4. Respects Pod Disruption Budgets throughout

Rebalancing typically completes within 10-15 minutes, depending on cluster size.

Monitor rebalancing progress

You can monitor rebalancing from the Rebalancer page:

  1. Navigate to Rebalancer in the left sidebar.
  2. View the active rebalancing operation showing:
    • Nodes being replaced
    • Progress percentage
    • Time remaining

Once rebalancing is complete, return to the Get Started page to view your updated savings figures.

Step 5: Enable workload autoscaling

Workload Autoscaler continuously rightsizes your workloads based on actual resource usage. This optimization runs continuously in the background.

  1. Locate the Workload Autoscaler card on the Get started page.

  2. Review the projected savings:

    • CPU and memory over-provisioning to be reduced
    • Monthly cost savings from rightsizing

[SCREENSHOT: Workload Autoscaler card showing projected CPU and memory reductions]

  1. Click Configure on the Workload Autoscaler card.

[SCREENSHOT: Workload Autoscaler configuration page showing recommendation mode options]

  1. On the configuration page, review the default settings:
    • Recommendation application mode (automatic or manual)
    • Resource safety margins
    • Workload selection criteria

[SCREENSHOT: Close-up of recommendation settings with safety margins and workload filters]

  1. Keep the defaults and click Enable Workload Autoscaler.

What happens after enabling

Workload Autoscaler:

  • Begins monitoring resource usage across all workloads
  • Generates rightsizing recommendations within 24-48 hours
  • Applies recommendations automatically (if configured)
  • Updates continuously as usage patterns change

You can view recommendations and adjustments on the Workload Autoscaler page.

[SCREENSHOT: Workload Autoscaler page showing active recommendations and applied changes]

Next steps

You've successfully connected your Karpenter cluster and enabled active optimization. From here, you can:

Review the Cluster Scorecard
The Cluster Scorecard provides an overall health and efficiency rating, showing opportunities for further optimization.

Customize feature settings
Visit the Settings page to adjust advanced configurations for each feature.

Understand Karpenter integration
Learn more about how Cast AI works alongside Karpenter in Karpenter Enterprise overview and Karpenter Enterprise features.

Set up alerts and reporting
Configure Slack or email notifications for optimization events and schedule regular savings reports. See Notifications.

Troubleshooting

Components not starting

If pods in the castai-agent namespace are not reaching Running status:

  1. Check pod events:

    kubectl describe pod -n castai-agent <pod-name>
  2. Check component logs:

    kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-agent
  3. Verify IAM permissions were created correctly by the onboarding script.

For more guidance, see Troubleshooting Cast AI components.

Cluster not appearing in console

If your cluster doesn't appear after running the script:

  1. Verify the agent is running:

    kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-agent
  2. Check agent logs for connection errors:

    kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-agent -c agent
  3. Ensure your cluster has outbound internet access to api.cast.ai and grpc.cast.ai.

For EKS-specific issues, see EKS connection troubleshooting.

Karpenter not detected

If Cast AI doesn't detect Karpenter in your cluster:

  1. Verify Karpenter is installed:

    kubectl get pods -n karpenter
  2. Check that Karpenter CRDs exist:

    kubectl get crd | grep karpenter

If Karpenter is installed but not detected, contact Cast AI support with your cluster details.

Rebalancing not starting

If rebalancing won't start or shows errors:

  1. Verify the Kentroller component is running:

    kubectl get pods -n castai-agent -l app=castai-kentroller
  2. Check Kentroller logs:

    kubectl logs -n castai-agent -l app=castai-kentroller
  3. Ensure Karpenter is operational and able to provision nodes:

    kubectl get nodeclaims

Rebalancing requires both Cast AI and Karpenter to be fully operational.

Workload Autoscaler not generating recommendations

If Workload Autoscaler is enabled but not showing recommendations after 48 hours:

  1. Verify the Workload Autoscaler pod is running:

    kubectl get pods -n castai-agent -l app=castai-workload-autoscaler
  2. Check that metrics-server is installed and working:

    kubectl top nodes
    kubectl top pods -A
  3. Review Workload Autoscaler logs for errors:

    kubectl logs -n castai-agent -l app=castai-workload-autoscaler

Workload Autoscaler requires metrics-server for resource usage data. The onboarding script installs it automatically if not present.

Related resources