GKE via GitOps
Onboard a GKE cluster to Cast AI using the umbrella Helm chart and Terraform. Start in read-only mode with Helm alone and upgrade to node autoscaling when ready.
This guide covers onboarding a GKE cluster to Cast AI using a GitOps approach with the umbrella Helm chart (castai-helm/castai).
The umbrella chart uses Helm tags to control which Cast AI components are installed. Each tag activates a different operating mode. For example, --set tags.readonly=true installs only observability components, while --set tags.full=true installs the complete suite for GKE. You switch between modes by flipping tags in a single helm upgrade command. For more background on the umbrella chart, see Terraform provider.
Choose the mode that fits your needs. If you start with a lighter mode, you can upgrade later without reinstalling (see Upgrading between modes).
Umbrella chart modes on GKE
The table below shows which components each mode installs on GKE.
| Component | readonly | workload-autoscaler | node-autoscaler | full |
|---|---|---|---|---|
| castai-agent | ✅ | ✅ | ✅ | ✅ |
| castai-spot-handler | ✅ | ✅ | ✅ | ✅ |
| castai-kvisor | ✅ | ✅ | ✅ | ✅ |
| gpu-metrics-exporter | ✅ | ✅ | ✅ | ✅ |
| castai-cluster-controller | — | ✅ | ✅ | ✅ |
| castai-evictor | — | ✅ | ✅ | ✅ |
| castai-pod-mutator | — | ✅ | ✅ | ✅ |
| castai-workload-autoscaler | — | ✅ | — | ✅ |
| castai-workload-autoscaler-exporter | — | ✅ | — | ✅ |
| castai-pod-pinner | — | — | ✅ | ✅ |
NoteOnly one mode tag should be
trueat a time. Thefullmode combines node autoscaling with Workload Autoscaler.
Prerequisites
All modes require:
- A Cast AI account and an organization-level API key from console.cast.ai → Service Accounts
helmv3.14.0 or higher (required for the--reset-then-reuse-valuesflag)kubectlconfigured for your target GKE cluster- The
castai-helmHelm repository added:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo updateNode autoscaler and full modes additionally require:
terraformv1.3.2 or highergcloudconfigured with permissions to create IAM resources in your GCP project
Installing in read-only mode (Helm only)
Read-only mode installs observability components that let you monitor your cluster on Cast AI without changing workloads or nodes. No Terraform is needed.
helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
--set global.castai.apiKey="<your-castai-api-key>" \
--set global.castai.provider="gke" \
--set tags.readonly=trueAfter the pods become ready, your cluster appears as Read only in the Cast AI console.
Verify the installation
kubectl get pods -n castai-agentYou should see pods for castai-agent, castai-spot-handler, castai-kvisor, and gpu-metrics-exporter in a Running state.
Installing in Workload Autoscaler mode (Helm only)
Workload Autoscaler mode for automatically right-sizing workload CPU/memory requests based on actual usage. No Terraform is needed.
helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
--set global.castai.apiKey="<your-castai-api-key>" \
--set global.castai.provider="gke" \
--set tags.workload-autoscaler=trueConfigure scaling policies and workload-level settings from the Cast AI console or via annotations.
Verify the installation
kubectl get pods -n castai-agentYou should see the shared base components (castai-agent, castai-spot-handler, castai-kvisor, gpu-metrics-exporter) plus castai-cluster-controller, castai-evictor, castai-pod-mutator, castai-workload-autoscaler, and castai-workload-autoscaler-exporter.
Installing in node autoscaler or full mode (Terraform + Helm)
Node autoscaler and full modes enable node provisioning, bin-packing, Spot Instance handling, and workload eviction. These modes require a GCP service account with IAM permissions for node management. Terraform creates it.
tags.node-autoscaler=truefor Node autoscaling only.tags.full=truefor Node autoscaling + Workload Autoscaler.
Create the Terraform configuration
Create the following files in your Terraform project directory.
versions.tf
terraform {
required_version = ">= 1.3.2"
required_providers {
castai = {
source = "castai/castai"
version = ">= 3.11.0"
}
}
}providers.tf
provider "castai" {
api_url = var.castai_api_url
api_token = var.castai_api_token
}variables.tf
variable "project_id" {
type = string
description = "GCP project ID in which GKE cluster is located."
}
variable "cluster_name" {
type = string
description = "GKE cluster name in GCP project."
}
variable "cluster_region" {
type = string
description = "Region of the cluster to be connected to Cast AI."
}
variable "subnets" {
type = list(string)
description = "Subnet IDs used by Cast AI to provision nodes."
default = []
}
variable "delete_nodes_on_disconnect" {
type = bool
description = "Optionally delete Cast AI created nodes when the cluster is destroyed."
default = false
}
variable "castai_api_token" {
type = string
description = "Cast AI API token created in console.cast.ai API Access keys section."
}
variable "castai_api_url" {
type = string
description = "Cast AI API URL."
default = "https://api.cast.ai"
}castai.tf
resource "castai_gke_cluster" "this" {
project_id = var.project_id
location = var.cluster_region
name = var.cluster_name
delete_nodes_on_disconnect = var.delete_nodes_on_disconnect
credentials_json = module.castai-gke-iam.private_key
}
module "castai-gke-iam" {
source = "castai/gke-iam/castai"
version = "~> 0.5"
project_id = var.project_id
gke_cluster_name = var.cluster_name
}outputs.tf
output "cluster_id" {
value = castai_gke_cluster.this.id
description = "Cast AI cluster ID."
}
output "cluster_token" {
value = castai_gke_cluster.this.cluster_token
description = "Cast AI cluster token used by Castware to authenticate to SaaS."
sensitive = true
}tf.vars.example
castai_api_token = "PLACEHOLDER"
project_id = "PLACEHOLDER"
cluster_region = "PLACEHOLDER" # e.g. "us-central1" or "us-central1-a"
cluster_name = "PLACEHOLDER"
subnets = ["PLACEHOLDER"] # e.g. ["default"] — optionalRun Terraform
Copy the example variables file, fill in your values, and apply:
cp tf.vars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project detailsterraform init
terraform applyTerraform registers the cluster with Cast AI and creates the GCP service account. Capture the outputs. You'll need them to configure the Helm release
terraform output cluster_id
terraform output -raw cluster_token
WarningThe
cluster_tokenexpires after a few hours if no Cast AI component connects. Run the Helm install promptly after this step.
Install the Helm release
For a fresh install, use helm upgrade -i with the mode tag that matches your needs:
helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
--set global.castai.apiKey="<your-castai-api-key>" \
--set global.castai.provider="gke" \
--set tags.full=trueReplace tags.full=true with tags.node-autoscaler=true if you do not need Workload Autoscaler.
Verify the installation
kubectl get pods -n castai-agentIn full mode, you should see all components from the mode table running, including castai-pod-pinner and both Workload Autoscaler components in addition to the shared base.
For node-autoscaler mode, you should see the shared base components plus castai-cluster-controller, castai-evictor, castai-pod-mutator, and castai-pod-pinner.
Upgrading between modes
If you installed a lighter mode and later want to move to a more capable one, flip the relevant tags and pass --reset-then-reuse-values. Any component overrides you previously set carry forward automatically as well.
NoteMoving to
node-autoscalerorfullfrom a Helm-only mode requires completing the Terraform configuration first, since those modes need GCP IAM resources.
For example, upgrading from read-only to full:
helm upgrade castai castai-helm/castai -n castai-agent \
--reset-then-reuse-values \
--set tags.readonly=false \
--set tags.full=trueThe shared components (castai-agent, castai-kvisor, castai-spot-handler, gpu-metrics-exporter) that were already running from the previous mode are patched in place with no pod restarts. Only the new components required by the target mode are created.
Next steps
- Configure scaling policies for Workload Autoscaler.
- Review node templates and node configuration for node autoscaling.
- Explore pod mutations for automated workload configuration.
Updated about 3 hours ago
