AKS via GitOps

Onboard an AKS cluster to Cast AI using the umbrella Helm chart and Terraform. Choose the operating mode that fits your needs and install directly into it.

This guide covers onboarding an AKS cluster to Cast AI using a GitOps approach with the umbrella Helm chart (castai-helm/castai).

The umbrella chart uses Helm tags to control which Cast AI components are installed. Each tag activates a different operating mode — for example, --set tags.readonly=true installs only observability components, while --set tags.full=true installs the complete suite for AKS. You switch between modes by flipping tags in a single helm upgrade command. For more background on the umbrella chart, see Terraform provider.

Choose the mode that fits your needs. If you start with a lighter mode, you can upgrade later without reinstalling (see Upgrading between modes).

Umbrella chart modes on AKS

The table below shows which components each mode installs on AKS.

Componentreadonlyworkload-autoscalernode-autoscalerfull
castai-agentYesYesYesYes
castai-spot-handlerYesYesYesYes
castai-kvisorYesYesYesYes
castai-cluster-controllerYesYesYes
castai-evictorYesYesYes
castai-pod-mutatorYesYesYes
castai-workload-autoscalerYesYes
castai-workload-autoscaler-exporterYesYes
castai-pod-pinnerYesYes
castai-live (Container Live Migration)YesYes
📘

Note

Only one mode tag should be true at a time. The full mode combines node autoscaling with Workload Autoscaler.

Prerequisites

All modes require:

  • A Cast AI account and an organization-level API key from console.cast.ai → Service Accounts
  • helm v3.14.0 or higher (required for the --reset-then-reuse-values flag)
  • The castai-helm Helm repository added:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update

Node autoscaler and full modes additionally require:

  • terraform v1.3.2 or higher
  • Azure CLI configured with permissions to create role assignments in your Azure subscription

Installing in read-only mode (Helm only)

Read-only mode installs observability components that let Cast AI monitor your cluster without making any changes to workloads or nodes. No Terraform is needed.

helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
  --set global.castai.apiKey="<your-castai-api-key>" \
  --set global.castai.provider="aks" \
  --set tags.readonly=true

After the pods become ready, your cluster appears as Read only in the Cast AI console.

Verify the installation
  kubectl get pods -n castai-agent

You should see pods for castai-agent, castai-spot-handler, and castai-kvisor in a Running state.

Installing in Workload Autoscaler mode (Helm only)

Workload Autoscaler mode for automatically right-sizing workload CPU/memory requests based on actual usage. No Terraform is needed.

helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
  --set global.castai.apiKey="<your-castai-api-key>" \
  --set global.castai.provider="aks" \
  --set tags.workload-autoscaler=true

Configure scaling policies and workload-level settings from the Cast AI console or via annotations.

Verify the installation
  kubectl get pods -n castai-agent

You should see the shared base components (castai-agent, castai-spot-handler, castai-kvisor) plus castai-cluster-controller, castai-evictor, castai-pod-mutator, castai-workload-autoscaler, and castai-workload-autoscaler-exporter.

Installing in node autoscaler or full mode (Terraform + Helm)

Node autoscaler and full modes enable node provisioning, bin-packing, Spot Instance handling, and workload eviction. These modes require Azure role assignments that grant Cast AI permissions for node management. Terraform creates them.

  • tags.node-autoscaler=true for Node autoscaling only (includes Container Live Migration).
  • tags.full=true for Node autoscaling + Workload Autoscaler (includes Container Live Migration).

Create the Terraform configuration

Create the following files in your Terraform project directory.

versions.tf

terraform {
  required_version = ">= 1.3.2"
  required_providers {
    castai = {
      source  = "castai/castai"
      version = ">= 3.11.0"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 3.0"
    }
    azuread = {
      source  = "hashicorp/azuread"
      version = ">= 2.0"
    }
  }
}

providers.tf

provider "castai" {
  api_url   = var.castai_api_url
  api_token = var.castai_api_token
}

provider "azurerm" {
  subscription_id = var.azure_subscription_id
  features {}
}

provider "azuread" {}

variables.tf

variable "azure_subscription_id" {
  type        = string
  description = "Azure subscription ID where the AKS cluster is located."
}

variable "aks_cluster_name" {
  type        = string
  description = "AKS cluster name."
}

variable "aks_cluster_region" {
  type        = string
  description = "Region of the AKS cluster to be connected to Cast AI."
}

variable "aks_resource_group" {
  type        = string
  description = "Resource group containing the AKS cluster."
}

variable "aks_node_resource_group" {
  type        = string
  description = "Resource group containing the AKS node pool resources."
}

variable "delete_nodes_on_disconnect" {
  type        = bool
  description = "Optionally delete Cast AI created nodes when the cluster is destroyed."
  default     = false
}

variable "castai_api_token" {
  type        = string
  description = "Cast AI API token created in console.cast.ai API Access keys section."
}

variable "castai_api_url" {
  type        = string
  description = "Cast AI API URL."
  default     = "https://api.cast.ai"
}

castai.tf

resource "castai_aks_cluster" "this" {
  name                       = var.aks_cluster_name
  region                     = var.aks_cluster_region
  subscription_id            = var.azure_subscription_id
  tenant_id                  = data.azurerm_subscription.current.tenant_id
  client_id                  = azuread_application.castai.client_id
  client_secret              = azuread_application_password.castai.value
  node_resource_group        = var.aks_node_resource_group
  delete_nodes_on_disconnect = var.delete_nodes_on_disconnect
}

data "azurerm_subscription" "current" {}

resource "azuread_application" "castai" {
  display_name = "castai-${var.aks_cluster_name}"
}

resource "azuread_application_password" "castai" {
  application_id = azuread_application.castai.id
}

resource "azuread_service_principal" "castai" {
  client_id = azuread_application.castai.client_id
}

resource "azurerm_role_assignment" "castai_contributor" {
  scope                = "/subscriptions/${var.azure_subscription_id}/resourceGroups/${var.aks_node_resource_group}"
  role_definition_name = "Contributor"
  principal_id         = azuread_service_principal.castai.object_id
}

outputs.tf

output "cluster_id" {
  value       = castai_aks_cluster.this.id
  description = "Cast AI cluster ID."
}

output "cluster_token" {
  value       = castai_aks_cluster.this.cluster_token
  description = "Cast AI cluster token used by Castware to authenticate to SaaS."
  sensitive   = true
}

tf.vars.example

castai_api_token         = "PLACEHOLDER"
azure_subscription_id    = "PLACEHOLDER"   # e.g. "12345678-1234-1234-1234-123456789012"
aks_cluster_name         = "PLACEHOLDER"
aks_cluster_region       = "PLACEHOLDER"   # e.g. "eastus"
aks_resource_group       = "PLACEHOLDER"
aks_node_resource_group  = "PLACEHOLDER"   # e.g. "MC_mygroup_mycluster_eastus"

Run Terraform

Copy the example variables file, fill in your values, and apply:

cp tf.vars.example terraform.tfvars
# Edit terraform.tfvars with your Azure subscription details
terraform init
terraform apply

Terraform registers the cluster with Cast AI and creates the Azure AD application and role assignment. Capture the outputs. You'll need them for the Helm release:

terraform output cluster_id
terraform output -raw cluster_token
⚠️

Warning

The cluster_token expires after a few hours if no Cast AI component connects. Run the Helm install promptly after this step.

Install the Helm release

For a fresh install, use helm upgrade -i with the mode tag that matches your needs:

helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
  --set global.castai.apiKey="<your-castai-api-key>" \
  --set global.castai.provider="aks" \
  --set tags.full=true

Replace tags.full=true with tags.node-autoscaler=true if you do not need Workload Autoscaler.

Verify the installation
  kubectl get pods -n castai-agent

In full mode, you should see all components from the mode table running, including castai-pod-pinner, castai-live, and both Workload Autoscaler components in addition to the shared base.

For node-autoscaler mode, you should see the shared base components plus castai-cluster-controller, castai-evictor, castai-pod-mutator, castai-pod-pinner, and castai-live.

Upgrading between modes

If you installed a lighter mode and later want to move to a more capable one, flip the relevant tags and pass --reset-then-reuse-values. Any component overrides you previously set carry forward automatically.

📘

Note

Moving to node-autoscaler or full from a Helm-only mode requires completing the Terraform configuration first, since those modes need Azure IAM resources.

For example, upgrading from read-only to full:

helm upgrade castai castai-helm/castai -n castai-agent \
  --reset-then-reuse-values \
  --set tags.readonly=false \
  --set tags.full=true

The shared components that were already running from the previous mode are patched in place with no pod restarts. Only the new components required by the target mode are created.

Next steps