📣
Early Access Feature
This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

What is Watchdog?

Watchdog is a fallback mechanism built for Cast AI-managed Kubernetes clusters to ensure critical workload availability during Cast AI outages. Watchdog mitigates the risk of Cast AI outages by automatically provisioning emergency compute capacity during such events.

Why use Watchdog?

In rare cases when Cast AI is temporarily unavailable due to network issues, platform downtime, or misconfigurations, workloads might remain pending. Watchdog ensures your applications remain available by:

Detecting Cast AI outages
Watching pending pods requiring scheduling
Provisioning fallback nodes through cloud provider APIs
Enabling workload recovery independently of the Cast AI SaaS control plane

Watchdog is specifically designed to mitigate application availability risks, not cost-optimization concerns.

How it works

When Cast AI cannot provision nodes (during an outage), Watchdog:

Bypasses placement preferences like node selectors, affinity rules, and taints/tolerations
Selects On-Demand instances only (no Spot or GPU)
Uses fallback node pools created ahead of time by Cast AI

Pods scheduled this way are prioritized for availability, not cost.

Key behaviors

Only affects new pods created during an outage – running pods are unaffected.
Works only with x86 nodes – workloads must have x86-compatible images.
No GPU or Spot nodes supported.
Fallback node pools are automatically created by Cast AI after Watchdog is enabled.

Installation

🚧
Limited Availability Feature
This feature is currently available through feature flags. Contact us to enable access for your organization.

Prerequisites

Workload Identity must be configured for the castai-watchdog Service Account in the castai-agent namespace.
Required GCP IAM Permissions:
- compute.instanceGroupManagers.get
- compute.machineTypes.get
- container.clusters.get
- container.clusters.update
- container.operations.get

Enable fallback node pools

Contact Cast AI support to enable Watchdog for your cluster or organization. This will automatically create fallback node pools in your GKE environment for Watchdog to use.

API Key

Watchdog requires a Cast AI API key. The key from castai-cluster-controller Secret can be reused, if desired.

You can also use a Cast AI Service Account key with Viewer role for the Watchdog API key.

📘
Note
If you see an 403 error in the Watchdog logs with the current API key, you might need to generate a new API key. The Watchdog API key requires a new permission which was added in July 2025.

Install using Helm

helm repo add castai-helm https://castai.github.io/helm-charts

helm upgrade --install castai-watchdog castai-helm/castai-watchdog \
  -n castai-agent \
  --set castai.apiKeySecretRef=castai-cluster-controller \
  --set castai.clusterID=<cluster-id> \
  --set castai.organizationID=<organization-id> \
  --set gcp.project=<gcp-project-id> \
  --set gcp.clusterName=<gke-cluster-name> \
  --set gcp.location=<gke-cluster-location>

Simulating an outage

To verify that the Watchdog is working properly, you can simulate an outage on your side by scaling down the castai-agent Deployment to zero pods. After a few minutes, Cast AI will mark the cluster as unhealthy, Watchdog will see a failing health check, and start provisioning emergency compute capacity for pending pods.

Scale down the castai-agent Deployment to zero pods:

kubectl scale -n castai-agent deployment castai-agent --replicas 0

Create a pending pod, which will verify that the Watchdog strips node constraints from the pod and provisions a node for it:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: watchdog-test
  name: watchdog-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: watchdog-test
  template:
    metadata:
      labels:
        app: watchdog-test
    spec:
      nodeSelector:
        watchdog: testing
      containers:
      - image: nginx
        name: nginx
        resources:
          requests:
            cpu: 1
EOF

Wait around 5–7 minutes, till the Watchdog detects the failing health check and starts working. You should see the following log line in the Watchdog:
starting autoscaling
The pending pod should be recreated, stripped from the node selector, and a new node should be created.

Webhook Notifications

Watchdog supports sending HTTP Webhook Notifications when it starts and stops working. You can provide a Go template for the payload to support different receivers (Slack, Discord, etc.).

Add the following values to the Watchdog Helm chart to set up a webhook notification:

webhooks:
  - url: https://hooks.slack.com/services/...
    method: POST
    headers:
      x-header-name: header-value
    payloadTemplate: |
      {{- $e := .Event -}}
      {
        "attachments": [
          {
          {{- if eq $e.Type "StartAutoscaling" }}
            "color": "#a63636",
            "title": "Watchdog started to autoscale",
            "text": "Watchdog detected Cast AI is not available and started working.",
          {{- else if eq $e.Type "StopAutoscaling" }}
            "color": "#36a64f",
            "title": "Watchdog stopped to autoscale",
            "text": "Watchdog detected that Cast AI is working again.",
          {{- else }}
            "color": "#cccccc",
            "title": "{{ $e.Type }}",
          {{- end }}
            "fields": [
              {
                "title": "Time",
                "value": "{{ $e.Timestamp.Format "2006-01-02 15:04:05 MST" }}",
                "short": true
              },
              {
                "title": "Reason",
                "value": "{{ $e.Reason }}",
                "short": false
              },
              {
                "title": "Cluster",
                "value": "<your-cluster>",
                "short": false
              }
            ]
          }
        ]
      }

Limitations

Only supports GKE clusters (with fallback node pool creation).
No GPU or Spot node provisioning.
Only provisions On-Demand compute.
Ignores all workload placement preferences (labels, affinities, taints).
Supports x86 nodes only (workloads and container images must be compatible).

Troubleshooting

Verify Watchdog deployment

kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-watchdog

Check logs

kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-watchdog

Common issues

No fallback node pool present – check if Cast AI support has enabled Watchdog.
Permissions error – verify GCP IAM roles and Workload Identity binding are set up properly.
Pending pods not handled – ensure Watchdog is running, check the logs.

For additional help, contact Cast AI support or visit our community Slack channel.