Hosted components

Cast AI components hosted on customer clusters.

The Cast AI cluster connection process installs several components into a customer's cluster in phases, providing different levels of functionality:

  • Phase 1: Provides visibility into connected clusters without the ability to tune them. This phase operates in a read-only mode.
  • Phase 2: Enables full functionality of the Cast AI platform, primarily for cluster optimization. In this phase, Cast AI can instruct clusters and Cloud Providers to reorganize resources for optimal performance.

Phase 1 components

Phase 1 provides visibility into connected clusters, but does not allow for modification. This phase operates in read-only mode and installs the following components:

» kubectl get pods -n castai-agent
NAME                                         READY   STATUS    RESTARTS   AGE
castai-agent-7f9d7ff65b-8qm7p                1/1     Running   0          78m
castai-agent-cpvpa-56f749fb-n2wzp            1/1     Running   0          22d
castai-spot-handler-44shj                    1/1     Running   0          43m
  • The Cast AI Kubernetes Agent sends cluster state data (snapshots) to the Cast AI SaaS platform.
  • The Cluster Proportional Vertical Autoscaler adjusts allocated resources for castai-agent Pods based on a predefined formula.
  • The Spot Handler monitors Spot Instance interruption events from cloud providers and reports them to Cast AI. This data improves Cast AI's Spot reliability and interruption prediction models. Spot Handler does not take any action on nodes or workloads.

Phase 2 autoscaling components

When a connected cluster is promoted to Phase 2 by enabling automation, Cast AI installs additional components to support this automated cluster management and feature delivery:

❯ kubectl get pods -n castai-agent
NAME                                             READY   STATUS    RESTARTS   AGE
castai-agent-7f9d7ff65b-8qm7p                    1/1     Running   0          80m
castai-agent-7f9d7ff65b-kf2zp                    1/1     Running   0          5h7m
castai-agent-cpvpa-56f749fb-n2wzp                1/1     Running   0          22d
castai-cluster-controller-757997ff6c-r6x25       1/1     Running   0          27d
castai-cluster-controller-757997ff6c-xw54g       1/1     Running   0          27d
castai-evictor-5684748495-kl2q4                  1/1     Running   0          22d
castai-kvisor-787c5dd946-gmzs5                   1/1     Running   0          6d18h
castai-spot-handler-44shj                        1/1     Running   0          43m
castai-live-controller-6c89d5f7d9-xyz12          1/1     Running   0          2h15m
castai-pod-mutator-7b4f6d9c5a-abc23              1/1     Running   0          4h20m
castai-pod-pinner-56d9f8c7b2-def45               1/1     Running   0          3h15m
  • The Cluster Controller executes actions received from the central platform, such as accepting newly created nodes into the cluster and managing Container Live Migration operations.
  • The Evictor removes pods from underutilized nodes to reduce the overall number of cluster nodes. When Container Live Migration is enabled, Evictor automatically attempts to live-migrate eligible workloads before falling back to traditional eviction.
  • The Live Controller (AWS EKS only) manages Container Live Migration operations, including workload eligibility assessment, migration orchestration, and specialized VPC CNI management. This component is installed automatically during Phase 2 onboarding.
  • The Pod Mutator modifies pod specifications for improved efficiency, implementing optimizations like GPU driver injection and resource adjustments.
  • The Pod Pinner controls pod placement for optimal resource usage, ensuring workloads are placed on appropriate nodes.

Phase 2 workload autoscaling components

  • The Workload Autoscaler dynamically adjusts workload resource requests based on actual usage patterns.
  • The Workload Autoscaler Exporter collects workload metrics from your cluster to support recommendation generation. It is installed automatically alongside the Workload Autoscaler.

Phase 2 security components

  • Kvisor enables image vulnerability scanning, Kubernetes YAML manifest linting, and other security and networking features offered by Cast AI. You will find more information in the Kvisor documentation.
  • The Audit Logs Receiver captures cluster events for analysis and compliance reporting.

Additional components

AI Enabler

  • The AI Enabler Proxy routes LLM requests to the most appropriate provider based on cost and performance. See AI Enabler.

Database optimization

  • The DB Optimizer monitors database performance and provides cost optimization recommendations. See Database Optimizer.

Reporting

  • The GPU Metrics Exporter captures GPU usage metrics for specialized compute workloads.
  • The Egressd Exporter (deprecated) collects network traffic information for visibility and optimization. It has been replaced by Kvisor, which offers all of the capabilities that Egressd used to offer and much more.

OMNI

When OMNI is enabled for cluster extension to other regions and cloud providers, additional components are deployed in the castai-omni namespace:

  • OMNI Agent - Manages edge location connections and node provisioning
  • Liqo components - Enable multi-cluster topology and virtual node functionality
    • liqo-controller-manager
    • liqo-crd-replicator
    • liqo-fabric
    • liqo-ipam
    • liqo-metric-agent
    • liqo-proxy
    • liqo-webhook

See OMNI Overview for more details about extending your cluster to other regions and cloud providers.

Component upgrade methods

Cast AI components installed in your cluster are upgraded using different methods. Understanding which components upgrade automatically versus those requiring manual intervention helps maintain optimal cluster operation.

The table below outlines the upgrade method for each Cast AI component:

ProductComponentUpgrade MethodFrequencyDescription
Cluster AutoscalingAgentManual*N/AMust be manually upgraded by running the upgrade script or the Helm command

* See "Automatic upgrades" section below
EvictorAuto*Upon new releaseAutomatically upgraded by Cast AI as soon as new versions are available

* See "Automatic upgrades" section below
Spot-handlerManual*N/AMust be manually upgraded using the helm command

* - See "Automatic upgrades" section below
Cluster ControllerManual*Manual processCluster Controller updates are handled through a manual process by Cast AI.
Pod PinnerAutoUpon new releaseAutomatically upgraded by Cast AI as soon as new versions are available
Pod MutatorManualN/AMust be manually upgraded using the helm command
Live ControllerManualN/AMust be manually upgraded using the helm command
Workload AutoscalingWorkload AutoscalerManualN/AMust be manually upgraded using the helm command
Workload Autoscaler ExporterManualN/AUpgraded together with the Workload Autoscaler using the helm command
SecuritykvisorManualN/AMust be manually upgraded using the helm command
audit-logs-receiverManualN/AMust be manually upgraded using the helm command
Reportinggpu-metrics-exporterManualN/AMust be manually upgraded using the helm command
Egressd exporterManualN/AMust be manually upgraded using the helm command
AI Enablerai-optimizer-proxyManualN/AMust be manually upgraded using the helm command
Database Optimizationdb-optimizerManualN/AMust be manually upgraded using the helm command
OMNIomni-agentManualN/AMust be manually upgraded using the helm command
Liqo componentsAutoN/ALiqo components, being OMNI dependancies, are updated automatically when the omni-agent is updated.

Automatic upgrades

Components marked as "Auto" are automatically upgraded by Cast AI to ensure you always have the latest features and security updates. These upgrades typically occur shortly after a new version is released. Cluster administrators do not need to take any action for these components.

While the cluster-controller can theoretically update itself by receiving an update action from Cast AI; these updates are managed through a manual internal process. However, it cannot update other components, such as castai-evictor, castai-spot-handler or castai-agent. You can explicitly bind a role, such as cluster-admin to the castai-cluster-controller service account. This will allow cluster-controller to manage all other Cast AI components automatically. For more details, visit the Cluster controller auto-update documentation.

Self-managed component options

For customers who prefer to manage their own update schedules, we provide self-managed installation options for several components:

Self-managed components can be updated using tools like Argo CD or Helm on your preferred schedule, giving you greater control over your infrastructure.

Manual upgrades

Components marked as "Manual" require cluster administrators to perform upgrades when new versions are released. These upgrades can typically be performed using Helm commands or upgrade scripts provided in the component documentation.

Please refer to each component's dedicated documentation section for detailed instructions for manually upgrading components.

📘

Note

Always check the release notes before upgrading manually updated components to understand potential impacts and required actions.