Cluster controller
The cluster controller is responsible for handling certain Kubernetes actions, such as draining and deleting nodes, adding labels, and approving CSR requests. It's open source and can be found on Github.
Install cluster-controller
By default, the cluster controller is installed during your cluster onboarding using the helm chart https://github.com/castai/helm-charts/tree/main/charts/castai-cluster-controller
If, for some reason, it was uninstalled, you can install it manually.
Add the Cast AI helm charts repository.
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
You can list all available components and versions.
helm search repo castai-helm
Expected example output
NAME CHART VERSION APP VERSION DESCRIPTION
castai-helm/castai-agent 0.18.0 v0.23.0 CAST AI agent deployment chart.
castai-helm/castai-cluster-controller 0.17.0 v0.14.0 CAST AI cluster controller deployment chart.
castai-helm/castai-evictor 0.10.0 0.5.1 Cluster utilization defragmentation tool
castai-helm/castai-spot-handler 0.3.0 v0.3.0 CAST AI spot handler daemonset chart.
Now let's install it.
helm upgrade --install cluster-controller castai-helm/castai-cluster-controller -n castai-agent \
--set castai.apiKey=<your-api-token> \
--set castai.clusterID=<your-cluster-id>
Upgrade cluster-controller
The cluster controller supports auto-update out of the box and is enabled by default. However, sometimes, due to changes in RBAC, it cannot be updated and requires a manual upgrade.
Upgrade to the latest version.
# requires helm eq or above 3.14.0
helm repo update
helm upgrade cluster-controller castai-helm/castai-cluster-controller --reset-then-reuse-values -n castai-agent
Troubleshooting
Check cluster-controller logs
kubectl logs -l app.kubernetes.io/name=castai-cluster-controller -n castai-agent
Throttling due to rate limiting
The cluster controller implements a client-side rate limiter to regulate requests to the Kubernetes API server, preventing excessive load on the control plane. This rate limiter uses the token bucket algorithm with default settings designed for typical cluster environments.
While the default configuration works well for most deployments, large or highly dynamic clusters may experience performance issues due to conservative rate limits. You can adjust the rate-limiting parameters if you observe throttling-related delays in a cluster with an appropriately scaled control plane.
Adjusting rate limit settings
Modify the following environment variables in the cluster controller deployment:
# Requests per second rate limit (tokens replenished per second). This should be higher than the observed continuous load
KUBECLIENT_QPS=<number>
# Maximum allowed burst of requests. This can be adjusted to match extreme spikes or expected bursts of operations
KUBECLIENT_BURST=<number>
When to adjust settings
Consider increasing these values when:
- The cluster has a scaled-up control plane capable of handling higher throughput
- Logs show frequent throttling messages
- Operations that modify many resources (like large-scale rebalancing) are taking longer than expected
How to apply changes
When using Helm, you can set these values through the additionalEnv
or envFrom
parameters:
helm upgrade cluster-controller castai-helm/castai-cluster-controller -n castai-agent \
--reuse-values \
--set additionalEnv.KUBECLIENT_QPS=50 \
--set additionalEnv.KUBECLIENT_BURST=200
For the complete default configuration, refer to the cluster-controller repository.
Auto updates
By default, the cluster-controller
component can update itself by receiving an update action (scheduled by Cast AI). It can also update other components, such as castai-evictor
, castai-spot-handler
or castai-agent
as well with one caveat: the cluster-controller
can't change permissions for other components (and for the cluster-controller
itself either).
However, permission changes are sometimes required for new features. To make this possible, you can explicitly bind a role such as cluster-admin
to a cluster-controller
service account. This will allow the cluster-controller
to manage other Cast AI components automatically without issue.
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: castai-cluster-controller-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: castai-cluster-controller
namespace: castai-agent
EOF
Updated 17 days ago