Cast AI Operator
Automated lifecycle management for Cast AI components
The Cast AI Operator (castware-operator) automates the installation, configuration, and updating of Cast AI components in your Kubernetes cluster. The Operator reduces manual effort by managing component lifecycles through a single, self-updating control plane component.
What the Operator manages
The Cast AI Operator currently manages:
castai-agent: Cluster monitoring and data collection
Additional components are coming soon: cluster-controller, evictor, spot-handler, and other Cast AI components will be managed by the Operator in future releases.
The Operator automatically:
- Installs the latest versions of managed components
- Applies semi-automated updates through the Cast AI console
- Self-upgrades when new Operator versions are released
Architecture
The Operator runs as a single pod in the castai-agent namespace and uses minimal resources:
- Memory: Typically less than 256 MB
- CPU: Typically less than 0.5 CPU
kubectl get pods -n castai-agent | grep -i operator
NAME READY STATUS RESTARTS AGE
castware-operator-8667b6744d-kmqnn 1/1 Running 0 10mThe Operator uses Kubernetes Custom Resources (CRs) to manage configuration:
- Cluster CR: Global configuration shared across components (API URLs, cloud provider, cluster metadata)
- Component CRs: Individual component configuration (version, Helm values, enablement status)
Installation
New cluster onboarding
For new clusters onboarded via the Cast AI Console, the Operator is installed automatically during Phase 1 (read-only) onboarding:
- The onboarding script installs
castware-operator - The Operator pod restarts once after generating certificates (this is expected behavior)
- The Operator then installs the latest version of
castai-agent - The cluster remains in
Connecting...status until both components are successfully installed:kubectl get pods -n castai-agent -w NAME READY STATUS RESTARTS AGE castware-operator-8667b6744d-kmqnn 1/1 Running 0 93s castai-agent-74989f5596-knfrs 2/2 Running 0 38s castai-agent-74989f5596-vnqvk 2/2 Running 0 38s castai-agent-cpvpa-964fc94b6-n5dkq 1/1 Running 0 38s
NoteSome onboarding journeys (such as AI Enabler or Cloud Connect) do not include the Operator by default. In these cases,
castai-agentis installed using the traditional method, but you can enable the Operator later via Component Control.
Existing clusters
For clusters already onboarded to Cast AI, you can enable the Operator in three ways:
Option 1: Component Control (recommended)
-
In the Cast AI console, select Manage Organization in the top right
-
Navigate to Component control in the left menu
-
Click Enable Now → in the Operator widget
-
Select the cluster where you want to install the Operator and click Enable
-
Copy and run the provided installation script
The console-provided Helm script handles cluster registration and automatic migration of existing castai-agent installations.
Option 2: Manual installation via Helm
For manual installation or automation scenarios:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update castai-helm
helm upgrade --install castware-operator castai-helm/castware-operator \
--namespace castai-agent \
--create-namespace \
--set apiKeySecret.apiKey="<your-api-key>" \
--set defaultCluster.provider="<eks|gke|aks>" \
--set defaultCluster.api.apiUrl="https://api.cast.ai" \
--set defaultComponents.enabled=false \
--atomic \
--waitParameters:
apiKeySecret.apiKey: Your Cast AI Full Access API key (required)defaultCluster.provider: Cloud provider -eks,gke, oraks(required)defaultCluster.api.apiUrl: Cast AI API endpoint (usehttps://api.eu.cast.aifor EU region)defaultComponents.enabled: Automatically installcastai-agentafter Operator installation (defaults totrue; set tofalsefor existing clusters withcastai-agentinstalled already)defaultCluster.migrationMode: How existing components should be migrated -write(migrate and manage),autoUpgrade(migrate, manage, and upgrade), orread(detect only, no management). Defaults towrite
Migrating existing castai-agent installations:
When you install the Operator on a cluster with an existing castai-agent, the Operator automatically detects and migrates the agent:
- Helm-installed agents: Creates a Component CR matching the current version and takes over management
- YAML manifest agents: Extracts configuration, creates a Component CR, and converts to Helm-managed
The migration process:
- Detects the existing
castai-agentdeployment - Extracts configuration (environment variables, resource settings, version)
- Creates a Component CR with detected configuration
- Updates the agent's API key secret reference to use the Operator-managed secret
- Takes over lifecycle management
Migration impactDuring migration, the agent may restart once, causing 1-2 cluster snapshots to be skipped. This is normal and does not affect cluster functionality.
Option 3: During castai-agent update
castai-agent updateWhen updating castai-agent, the console prompts you to install the Operator:
- Open the
castai-agentupdate drawer - Copy the update script (Operator installation is included by default)
- Run the script to install the Operator, which will in turn upgrade
castai-agent
Opt-out optionThe update drawer includes an option to opt out of Operator installation if needed.
Terraform installation
Operator installation via Terraform is coming soon. Terraform support will include:
- Operator installation during cluster onboarding
castai-agentupdates for Terraform-managed clusters
Updating components
Semi-automated updates
Once the Operator is installed, you can update both the Operator and castai-agent through the Cast AI console without running additional scripts.
Updating the Operator
- Navigate to Component Control
- Open the Cast AI Operator widget
- Click Update when a new version is available
- The Operator performs a self-upgrade
Updating castai-agent
castai-agent- Navigate to Component Control
- Locate
castai-agentin the component list - Click Update
- The Operator handles the upgrade process automatically
Removing the Operator
Before you remove
Removing the Operator stops automated component management. After removal:
castai-agentand other managed components continue running normally- You'll need to use manual Helm commands or console scripts to update components
- Custom Resources (CRs) and Custom Resource Definitions (CRDs) are removed from the cluster
Uninstall via Helm
To completely remove the Operator and its associated resources:
# Uninstall the Operator
helm uninstall castware-operator -n castai-agent
ImportantUninstalling the Operator does not remove
castai-agentor other Cast AI components. These components continue operating independently. If you want to fully remove all Cast AI components, see the cluster disconnection documentation.
Console behavior after removal
After manually uninstalling the Operator:
- The console shows the Operator as enabled until the next sync cycle (~15 minutes)
- Component update prompts may still reference the Operator temporarily
- Use manual Helm commands, Terraform, or console-provided scripts to manage components
NoteUI-based Operator disconnection and re-enablement is planned for future releases.
Troubleshooting
Cluster stuck in Connecting status
Symptom: Cluster remains in Connecting... status for more than 10 minutes after running the onboarding script.
Cause: The Operator may have installed successfully, but castai-agent failed to deploy.
Solution:
-
Check the status of both the Operator and the agent:
kubectl get pods -n castai-agent -
Check Operator logs for errors:
kubectl logs -l app.kubernetes.io/name=castware-operator -n castai-agent -
If
castai-agentis not running, check the Component CR status:kubectl get component castai-agent -n castai-agent -o yaml -
If the status doesn't change after 10 minutes, try the Enable button in Component Control; alternatively, if that fails, rerun the installation script
-
If the issue persists, contact Cast AI Customer Success
Component updates failing
Symptom: Clicking Update for a component results in an error, or the update doesn't complete.
Cause: The component may require additional permissions that the Operator doesn't have, or there may be a configuration conflict.
Solution:
-
Check the Component CR status for error messages:
kubectl get component <component-name> -n castai-agent -o yaml -
Review Operator logs during the update attempt:
kubectl logs -l app.kubernetes.io/name=castware-operator -n castai-agent --tail=50 -
If permission errors appear, the Operator may need additional RBAC permissions. Contact Cast AI Customer Success
Migration from existing agent not working
Symptom: After installing the Operator, the existing castai-agent is not migrated automatically.
Possible causes:
- The agent deployment doesn't have the required labels or annotations (due to modifications)
- The agent was installed using a non-standard method
- The Operator doesn't have sufficient permissions to modify the existing deployment
Solution:
-
Check if the agent has Helm labels:
kubectl get deployment castai-agent -n castai-agent -o jsonpath='{.metadata.labels}' -
Check if a Component CR was created:
kubectl get component castai-agent -n castai-agent -
If migration fails repeatedly, contact Cast AI Customer Success with the agent deployment details
Failed Operator uninstall
Symptom: Running helm uninstall castware-operator fails or hangs, preventing removal of the Operator.
Cause: Helm hooks may be blocking the uninstall process, or there may be finalizers preventing resource deletion.
Solution:
-
Uninstall without running Helm hooks:
helm -n castai-agent uninstall castware-operator --no-hooks -
Manually remove Custom Resources:
kubectl delete clusters.castware.cast.ai --all -n castai-agent kubectl delete components.castware.cast.ai --all -n castai-agent -
Remove Custom Resource Definitions:
kubectl delete crd clusters.castware.cast.ai kubectl delete crd components.castware.cast.ai -
Verify all resources are removed:
kubectl get all -n castai-agent | grep castware-operator
Related documentation
- Component Control: View and manage all Cast AI components
- Helm Charts: Manual component installation using Helm
- Cluster Onboarding: Connect new clusters to Cast AI
Updated about 7 hours ago
