Cast AI Operator

Automated lifecycle management for Cast AI components

The Cast AI Operator (castware-operator) automates the installation, configuration, and updating of Cast AI components in your Kubernetes cluster. The Operator reduces manual effort by managing component lifecycles through a single, self-updating control plane component.

What the Operator manages

The Cast AI Operator currently manages:

  • castai-agent: Cluster monitoring and data collection

Additional components are coming soon: cluster-controller, evictor, spot-handler, and other Cast AI components will be managed by the Operator in future releases.

The Operator automatically:

  • Installs the latest versions of managed components
  • Applies semi-automated updates through the Cast AI console
  • Self-upgrades when new Operator versions are released

Architecture

The Operator runs as a single pod in the castai-agent namespace and uses minimal resources:

  • Memory: Typically less than 256 MB
  • CPU: Typically less than 0.5 CPU
kubectl get pods -n castai-agent | grep -i operator
NAME                                 READY   STATUS    RESTARTS   AGE
castware-operator-8667b6744d-kmqnn   1/1     Running   0          10m

The Operator uses Kubernetes Custom Resources (CRs) to manage configuration:

  • Cluster CR: Global configuration shared across components (API URLs, cloud provider, cluster metadata)
  • Component CRs: Individual component configuration (version, Helm values, enablement status)

Installation

New cluster onboarding

For new clusters onboarded via the Cast AI Console, the Operator is installed automatically during Phase 1 (read-only) onboarding:

  1. The onboarding script installs castware-operator
  2. The Operator pod restarts once after generating certificates (this is expected behavior)
  3. The Operator then installs the latest version of castai-agent
  4. The cluster remains in Connecting... status until both components are successfully installed:
    kubectl get pods -n castai-agent -w
    NAME                                 READY   STATUS    RESTARTS   AGE
    castware-operator-8667b6744d-kmqnn   1/1     Running   0          93s
    castai-agent-74989f5596-knfrs        2/2     Running   0          38s
    castai-agent-74989f5596-vnqvk        2/2     Running   0          38s
    castai-agent-cpvpa-964fc94b6-n5dkq   1/1     Running   0          38s
📘

Note

Some onboarding journeys (such as AI Enabler or Cloud Connect) do not include the Operator by default. In these cases, castai-agent is installed using the traditional method, but you can enable the Operator later via Component Control.

Existing clusters

For clusters already onboarded to Cast AI, you can enable the Operator in three ways:

Option 1: Component Control (recommended)

  1. In the Cast AI console, select Manage Organization in the top right

  2. Navigate to Component control in the left menu

  3. Click Enable Now → in the Operator widget

  4. Select the cluster where you want to install the Operator and click Enable

  5. Copy and run the provided installation script

The console-provided Helm script handles cluster registration and automatic migration of existing castai-agent installations.

Option 2: Manual installation via Helm

For manual installation or automation scenarios:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update castai-helm

helm upgrade --install castware-operator castai-helm/castware-operator \
  --namespace castai-agent \
  --create-namespace \
  --set apiKeySecret.apiKey="<your-api-key>" \
  --set defaultCluster.provider="<eks|gke|aks>" \
  --set defaultCluster.api.apiUrl="https://api.cast.ai" \
  --set defaultComponents.enabled=false \
  --atomic \
  --wait

Parameters:

  • apiKeySecret.apiKey: Your Cast AI Full Access API key (required)
  • defaultCluster.provider: Cloud provider - eks, gke, or aks (required)
  • defaultCluster.api.apiUrl: Cast AI API endpoint (use https://api.eu.cast.ai for EU region)
  • defaultComponents.enabled: Automatically install castai-agent after Operator installation (defaults to true; set to false for existing clusters with castai-agent installed already)
  • defaultCluster.migrationMode: How existing components should be migrated - write (migrate and manage), autoUpgrade (migrate, manage, and upgrade), or read (detect only, no management). Defaults to write

Migrating existing castai-agent installations:

When you install the Operator on a cluster with an existing castai-agent, the Operator automatically detects and migrates the agent:

  • Helm-installed agents: Creates a Component CR matching the current version and takes over management
  • YAML manifest agents: Extracts configuration, creates a Component CR, and converts to Helm-managed

The migration process:

  1. Detects the existing castai-agent deployment
  2. Extracts configuration (environment variables, resource settings, version)
  3. Creates a Component CR with detected configuration
  4. Updates the agent's API key secret reference to use the Operator-managed secret
  5. Takes over lifecycle management
💡

Migration impact

During migration, the agent may restart once, causing 1-2 cluster snapshots to be skipped. This is normal and does not affect cluster functionality.

Option 3: During castai-agent update

When updating castai-agent, the console prompts you to install the Operator:

  1. Open the castai-agent update drawer
  2. Copy the update script (Operator installation is included by default)
  3. Run the script to install the Operator, which will in turn upgrade castai-agent
📘

Opt-out option

The update drawer includes an option to opt out of Operator installation if needed.

Terraform installation

Operator installation via Terraform is coming soon. Terraform support will include:

  • Operator installation during cluster onboarding
  • castai-agent updates for Terraform-managed clusters

Updating components

Semi-automated updates

Once the Operator is installed, you can update both the Operator and castai-agent through the Cast AI console without running additional scripts.

Updating the Operator

  1. Navigate to Component Control
  2. Open the Cast AI Operator widget
  3. Click Update when a new version is available
  4. The Operator performs a self-upgrade

Updating castai-agent

  1. Navigate to Component Control
  2. Locate castai-agent in the component list
  3. Click Update
  4. The Operator handles the upgrade process automatically

Removing the Operator

Before you remove

Removing the Operator stops automated component management. After removal:

  • castai-agent and other managed components continue running normally
  • You'll need to use manual Helm commands or console scripts to update components
  • Custom Resources (CRs) and Custom Resource Definitions (CRDs) are removed from the cluster

Uninstall via Helm

To completely remove the Operator and its associated resources:

# Uninstall the Operator
helm uninstall castware-operator -n castai-agent
⚠️

Important

Uninstalling the Operator does not remove castai-agent or other Cast AI components. These components continue operating independently. If you want to fully remove all Cast AI components, see the cluster disconnection documentation.

Console behavior after removal

After manually uninstalling the Operator:

  • The console shows the Operator as enabled until the next sync cycle (~15 minutes)
  • Component update prompts may still reference the Operator temporarily
  • Use manual Helm commands, Terraform, or console-provided scripts to manage components
📘

Note

UI-based Operator disconnection and re-enablement is planned for future releases.

Troubleshooting

Cluster stuck in Connecting status

Symptom: Cluster remains in Connecting... status for more than 10 minutes after running the onboarding script.

Cause: The Operator may have installed successfully, but castai-agent failed to deploy.

Solution:

  1. Check the status of both the Operator and the agent:

    kubectl get pods -n castai-agent
  2. Check Operator logs for errors:

    kubectl logs -l app.kubernetes.io/name=castware-operator -n castai-agent
  3. If castai-agent is not running, check the Component CR status:

    kubectl get component castai-agent -n castai-agent -o yaml
  4. If the status doesn't change after 10 minutes, try the Enable button in Component Control; alternatively, if that fails, rerun the installation script

  5. If the issue persists, contact Cast AI Customer Success

Component updates failing

Symptom: Clicking Update for a component results in an error, or the update doesn't complete.

Cause: The component may require additional permissions that the Operator doesn't have, or there may be a configuration conflict.

Solution:

  1. Check the Component CR status for error messages:

    kubectl get component <component-name> -n castai-agent -o yaml
  2. Review Operator logs during the update attempt:

    kubectl logs -l app.kubernetes.io/name=castware-operator -n castai-agent --tail=50
  3. If permission errors appear, the Operator may need additional RBAC permissions. Contact Cast AI Customer Success

Migration from existing agent not working

Symptom: After installing the Operator, the existing castai-agent is not migrated automatically.

Possible causes:

  • The agent deployment doesn't have the required labels or annotations (due to modifications)
  • The agent was installed using a non-standard method
  • The Operator doesn't have sufficient permissions to modify the existing deployment

Solution:

  1. Check if the agent has Helm labels:

    kubectl get deployment castai-agent -n castai-agent -o jsonpath='{.metadata.labels}'
  2. Check if a Component CR was created:

    kubectl get component castai-agent -n castai-agent
  3. If migration fails repeatedly, contact Cast AI Customer Success with the agent deployment details

Failed Operator uninstall

Symptom: Running helm uninstall castware-operator fails or hangs, preventing removal of the Operator.

Cause: Helm hooks may be blocking the uninstall process, or there may be finalizers preventing resource deletion.

Solution:

  1. Uninstall without running Helm hooks:

    helm -n castai-agent uninstall castware-operator --no-hooks
  2. Manually remove Custom Resources:

    kubectl delete clusters.castware.cast.ai --all -n castai-agent
    kubectl delete components.castware.cast.ai --all -n castai-agent
  3. Remove Custom Resource Definitions:

    kubectl delete crd clusters.castware.cast.ai
    kubectl delete crd components.castware.cast.ai
  4. Verify all resources are removed:

    kubectl get all -n castai-agent | grep castware-operator

Related documentation