Watchdog
Early Access FeatureThis feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.
What is Watchdog?
Watchdog is a fallback mechanism built for Cast AI-managed Kubernetes clusters to ensure critical workload availability during Cast AI outages. Watchdog mitigates the risk of Cast AI outages by automatically provisioning emergency compute capacity during such events.
Why use Watchdog?
In rare cases when Cast AI is temporarily unavailable due to network issues, platform downtime, or misconfigurations, workloads might remain pending. Watchdog ensures your applications remain available by:
- Detecting Cast AI outages
- Watching pending pods requiring scheduling
- Provisioning fallback nodes through cloud provider APIs
- Enabling workload recovery independently of the Cast AI SaaS control plane
Watchdog is specifically designed to mitigate application availability risks, not cost-optimization concerns.
How it works
When Cast AI cannot provision nodes (during an outage), Watchdog:
- Bypasses placement preferences like node selectors, affinity rules, and taints/tolerations
- Selects On-Demand instances only (no Spot or GPU)
- Uses fallback node pools created ahead of time by Cast AI
Pods scheduled this way are prioritized for availability, not cost.
Key behaviors
- Only affects new pods created during an outage – running pods are unaffected.
- Works only with x86 nodes – workloads must have x86-compatible images.
- No GPU or Spot nodes supported.
- Fallback node pools are automatically created by Cast AI after Watchdog is enabled.
Installation
Limited Availability FeatureThis feature is currently available through feature flags. Contact us to enable access for your organization.
Prerequisites
- Workload Identity must be configured for the
castai-watchdogService Account in thecastai-agentnamespace. - Required GCP IAM Permissions:
compute.instanceGroupManagers.getcompute.machineTypes.getcontainer.clusters.getcontainer.clusters.updatecontainer.operations.get
Enable fallback node pools
Contact Cast AI support to enable Watchdog for your cluster or organization. This will automatically create fallback node pools in your GKE environment for Watchdog to use.
API Key
Watchdog requires a Cast AI API key. The key from castai-cluster-controller Secret can be reused, if desired.
You can also use a Cast AI Service Account key with Viewer role for the Watchdog API key.
NoteIf you see an 403 error in the Watchdog logs with the current API key, you might need to generate a new API key. The Watchdog API key requires a new permission which was added in July 2025.
Install using Helm
helm repo add castai-helm https://castai.github.io/helm-charts
helm upgrade --install castai-watchdog castai-helm/castai-watchdog \
-n castai-agent \
--set castai.apiKeySecretRef=castai-cluster-controller \
--set castai.clusterID=<cluster-id> \
--set castai.organizationID=<organization-id> \
--set gcp.project=<gcp-project-id> \
--set gcp.clusterName=<gke-cluster-name> \
--set gcp.location=<gke-cluster-location>Simulating an outage
To verify that the Watchdog is working properly, you can simulate an outage on your side by scaling down the castai-agent Deployment to zero pods. After a few minutes, Cast AI will mark the cluster as unhealthy, Watchdog will see a failing health check, and start provisioning emergency compute capacity for pending pods.
-
Scale down the
castai-agentDeployment to zero pods:kubectl scale -n castai-agent deployment castai-agent --replicas 0 -
Create a pending pod, which will verify that the Watchdog strips node constraints from the pod and provisions a node for it:
cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: labels: app: watchdog-test name: watchdog-test spec: replicas: 1 selector: matchLabels: app: watchdog-test template: metadata: labels: app: watchdog-test spec: nodeSelector: watchdog: testing containers: - image: nginx name: nginx resources: requests: cpu: 1 EOF -
Wait around 5–7 minutes, till the Watchdog detects the failing health check and starts working. You should see the following log line in the Watchdog:
starting autoscaling
The pending pod should be recreated, stripped from the node selector, and a new node should be created.
Webhook Notifications
Watchdog supports sending HTTP Webhook Notifications when it starts and stops working. You can provide a Go template for the payload to support different receivers (Slack, Discord, etc.).
Add the following values to the Watchdog Helm chart to set up a webhook notification:
webhooks:
- url: https://hooks.slack.com/services/...
method: POST
headers:
x-header-name: header-value
payloadTemplate: |
{{- $e := .Event -}}
{
"attachments": [
{
{{- if eq $e.Type "StartAutoscaling" }}
"color": "#a63636",
"title": "Watchdog started to autoscale",
"text": "Watchdog detected Cast AI is not available and started working.",
{{- else if eq $e.Type "StopAutoscaling" }}
"color": "#36a64f",
"title": "Watchdog stopped to autoscale",
"text": "Watchdog detected that Cast AI is working again.",
{{- else }}
"color": "#cccccc",
"title": "{{ $e.Type }}",
{{- end }}
"fields": [
{
"title": "Time",
"value": "{{ $e.Timestamp.Format "2006-01-02 15:04:05 MST" }}",
"short": true
},
{
"title": "Reason",
"value": "{{ $e.Reason }}",
"short": false
},
{
"title": "Cluster",
"value": "<your-cluster>",
"short": false
}
]
}
]
}Limitations
- Only supports GKE clusters (with fallback node pool creation).
- No GPU or Spot node provisioning.
- Only provisions On-Demand compute.
- Ignores all workload placement preferences (labels, affinities, taints).
- Supports x86 nodes only (workloads and container images must be compatible).
Troubleshooting
Verify Watchdog deployment
kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-watchdogCheck logs
kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-watchdogCommon issues
- No fallback node pool present – check if Cast AI support has enabled Watchdog.
- Permissions error – verify GCP IAM roles and Workload Identity binding are set up properly.
- Pending pods not handled – ensure Watchdog is running, check the logs.
For additional help, contact Cast AI support or visit our community Slack channel.
Updated 26 days ago
