Cluster controller

The cluster controller is responsible for handling certain Kubernetes actions, such as draining and deleting nodes, adding labels, and approving CSR requests. It's open source and can be found on Github.

Install cluster-controller

By default, the cluster controller is installed during your cluster onboarding using the helm chart https://github.com/castai/helm-charts/tree/main/charts/castai-cluster-controller

If, for some reason, it was uninstalled, you can install it manually.

Add the Cast AI helm charts repository.

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update

You can list all available components and versions.

helm search repo castai-helm

Expected example output

NAME                                    CHART VERSION   APP VERSION     DESCRIPTION
castai-helm/castai-agent                0.18.0          v0.23.0         CAST AI agent deployment chart.
castai-helm/castai-cluster-controller   0.17.0          v0.14.0         CAST AI cluster controller deployment chart.
castai-helm/castai-evictor              0.10.0          0.5.1           Cluster utilization defragmentation tool
castai-helm/castai-spot-handler         0.3.0           v0.3.0          CAST AI spot handler daemonset chart.

Now let's install it.

helm upgrade --install cluster-controller castai-helm/castai-cluster-controller -n castai-agent \
  --set castai.apiKey=<your-api-token> \
  --set castai.clusterID=<your-cluster-id>

Upgrade cluster-controller

The cluster controller supports auto-update out of the box and is enabled by default. However, sometimes, due to changes in RBAC, it cannot be updated and requires a manual upgrade.

Upgrade to the latest version.

# requires helm eq or above 3.14.0
helm repo update
helm upgrade cluster-controller castai-helm/castai-cluster-controller --reset-then-reuse-values  -n castai-agent 

Troubleshooting

Check cluster-controller logs

kubectl logs -l app.kubernetes.io/name=castai-cluster-controller -n castai-agent

Throttling due to rate limiting

The cluster controller implements a client-side rate limiter to regulate requests to the Kubernetes API server, preventing excessive load on the control plane. This rate limiter uses the token bucket algorithm with default settings designed for typical cluster environments.

While the default configuration works well for most deployments, large or highly dynamic clusters may experience performance issues due to conservative rate limits. You can adjust the rate-limiting parameters if you observe throttling-related delays in a cluster with an appropriately scaled control plane.

Adjusting rate limit settings

Modify the following environment variables in the cluster controller deployment:

# Requests per second rate limit (tokens replenished per second). This should be higher than the observed continuous load
KUBECLIENT_QPS=<number>

# Maximum allowed burst of requests. This can be adjusted to match extreme spikes or expected bursts of operations
KUBECLIENT_BURST=<number>
When to adjust settings

Consider increasing these values when:

  • The cluster has a scaled-up control plane capable of handling higher throughput
  • Logs show frequent throttling messages
  • Operations that modify many resources (like large-scale rebalancing) are taking longer than expected
How to apply changes

When using Helm, you can set these values through the additionalEnv or envFrom parameters:

helm upgrade cluster-controller castai-helm/castai-cluster-controller -n castai-agent \
  --reuse-values \
  --set additionalEnv.KUBECLIENT_QPS=50 \
  --set additionalEnv.KUBECLIENT_BURST=200

For the complete default configuration, refer to the cluster-controller repository.

Auto updates

By default, the cluster-controller component can update itself by receiving an update action (scheduled by Cast AI). It can also update other components, such as castai-evictor, castai-spot-handler or castai-agent as well with one caveat: the cluster-controller can't change permissions for other components (and for the cluster-controller itself either).

However, permission changes are sometimes required for new features. To make this possible, you can explicitly bind a role such as cluster-admin to a cluster-controller service account. This will allow the cluster-controller to manage other Cast AI components automatically without issue.

cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: castai-cluster-controller-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: castai-cluster-controller
    namespace: castai-agent
EOF