Pod Pinner
Pod Pinner addresses the misalignment between the actions of the CAST AI Autoscaler and the Kubernetes cluster scheduler.
For example, while the CAST AI Autoscaler efficiently binpacks pods and creates nodes in the cluster in a cost-optimized manner, the Kubernetes cluster scheduler determines the actual placement of pods on nodes. This can lead to suboptimal pod placement, fragmentation, and unnecessary resource waste, as pods may end up on different nodes than those anticipated by the CAST AI Autoscaler.
Pod Pinner enables the integration of the CAST AI Autoscaler's decisions into your cluster, allowing it to override the decisions of the Kubernetes cluster scheduler. Installing Pod Pinner can directly enhance savings in the cluster. Pod Pinner is a CAST AI in-cluster component, similar to the CAST AI agent, cluster controller, and others.
Limitations
Notice
Pod Pinner may conflict with Spot-webhook, we do not recommend using them together at this time.
Using them together may result in some failed pods during scheduling. This is because Pod Pinner is unaware of changes applied by other webhooks when binding pods to nodes. While these failed pods are typically recreated by Kubernetes without negative impact, we are working on improving compatibility between Pod Pinner and Spot-webhook to fully address this issue.
Installation and version upgrade
For newly onboarded clusters, the latest version of the Pod Pinner castware component castai-pod-pinner
is installed automatically. Therefore, at the beginning:
- Review whether your cluster has the
castai-pod-pinner
deployment in thecastai-agent
namespace available:
$ kubectl get deployments.apps -n castai-agent
NAME READY UP-TO-DATE AVAILABLE AGE
castai-agent 1/1 1 1 15m
castai-agent-cpvpa 1/1 1 1 15m
castai-cluster-controller 2/2 2 2 15m
castai-evictor 0/0 0 0 15m
castai-kvisor 1/1 1 1 15m
castai-pod-pinner 2/2 2 2 15m
Helm
Option 1: CAST AI-Managed (default)
By default, CAST AI manages Pod Pinner, including automatic upgrades.
- Check the currently installed Pod Pinner chart version. If it's >=
1.0.0
, an upgrade is not needed.
You can check the version with the following command:
$ helm list -n castai-agent --filter castai-pod-pinner
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
castai-pod-pinner castai-agent 11 2024-09-26 11:40:00.245427517 +0000 UTC deployed castai-pod-pinner-1.0.2 v1.0.0
- If the version is <
1.0.0
run the following commands to install or upgrade Pod Pinner to the latest version.
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-pod-pinner castai-helm/castai-pod-pinner -n castai-agent
After installation or upgrade to version >= 1.0.0
Pod Pinner will automatically be scaled to 2 replicas and will be managed by CAST AI, as indicated by the charts.cast.ai/managed=true
label applied to the pods of the castai-pod-pinner
deployment. All the following Pod Pinner versions will be updated automatically.
Option 2: Self-Managed
To control the Pod Pinner version yourself:
helm upgrade -i castai-pod-pinner castai-helm/castai-pod-pinner -n castai-agent --set managedByCASTAI=false
This prevents CAST AI from automatically managing and upgrading Pod Pinner.
Re-running Onboarding Script
You can also install Pod Pinner by re-running the phase 2 onboarding script. For more information, see the cluster onboarding documentation.
Terraform Users
For Terraform users, you can manage Pod Pinner installation and configuration through your Terraform scripts. This allows for version control and infrastructure-as-code management of Pod Pinner settings.
Enabling/Disabling Pod Pinner
Pod Pinner is enabled by default but can be disabled in the CAST AI console.
If you disable Pod Pinner this way, the deployment will be scaled down to 0 replicas and not auto-upgraded by CAST AI.
Autoscaler Settings UI
To enable/disable Pod Pinner:
- Navigate to Autoscaler settings in the CAST AI console.
- You'll find the Pod Pinner option under the "Unscheduled pods policy" section.
- Check/uncheck the "Enable pod pinning" box to activate/deactivate the Pod Pinner. Disabling Pod Pinner will scale down the deployment to 0 replicas and turn off auto-upgrades.
Note
When Pod Pinner is disabled through the console and the
charts.cast.ai/managed=true
label is present, CAST AI will scale down the deployment to 0 replicas no matter what. To manually control Pod Pinner while keeping it active, use the self-managed installation option mentioned above.
Ensuring stability
It is suggested that you keep the Pod Pinner pod as stable as possible, especially during rebalancing. You can do so by applying the same approach you are using for castai-agent
.
For instance, you can add the autoscaling.cast.ai/removal-disabled: "true"
label/annotation to the pod. If the Pod Pinner pod restarts during rebalancing, the pods won't get pinned to the nodes as expected by the Rebalancer. It may result in suboptimal pod placement as the Kubernetes cluster scheduler will schedule the pods.
Note
You can scale down the
castai-pod-pinner
deployment anytime. This will result in normal behavior and will not impact the cluster negatively other than the Kubernetes scheduler taking over pod scheduling.
Logs
You can access logs in the Pod Pinner pod to see what decisions are being made. Here is a list of the most important log entries:
Example | Meaning |
---|---|
node placeholder created | A node placeholder has been created. The real node will use this placeholder when it joins the cluster. |
pod pinned | A pod has been successfully bound to a node. Such logs always appear after the node placeholder is created. |
node placeholder not found | This log appears when Pod Pinner tries to bind a pod to a non-existing node. This may occur if Pod Pinner fails to create the node placeholder. |
pinning pod | This log occurs when the Pod Pinner's webhook intercepts a pod creation and binds it to a node. This happens during rebalancing. |
node placeholder deleted | A node placeholder has been deleted. This happens when a node fails to be created in the cloud, and Pod Pinner must clean up the placeholder that was created. |
failed streaming pod pinning actions, restarting... | The connection between the Pod Pinner pod and CAST AI has been reset. This is expected to happen occasionally and will not negatively impact your cluster. |
http: TLS handshake error from 10.0.1.135:48024: EOF | This log appears as part of the certificate rotation performed by the webhook. This is a non-issue log and will not negatively impact the cluster. |
Troubleshooting
Failed pod status reason: OutOf{resource}
OutOf{resource}
OutOfcpu
, OutOfmemory
, OutOf{resource}
pod statuses happen when the scheduler schedules a pod on a node, but the kubelet
rejects it due to a lack of some sort of resource. These are Failed
pods that CAST AI and the Kubernetes control-plane
know how to ignore.
This happens when many pods are upscaled at the same time. The scheduler has various optimizations to deal with large bursts of pods, so it makes scheduling decisions in parallel. Sometimes, those decisions conflict, resulting in pods scheduled on nodes where they don't fit. This happens especially in GKE. If you see this status, don't be afraid. The control-plane
will eventually clean those pods up after a few days.
Pods might get this status when the Kubernetes scheduler takes over scheduling decisions due to a blip in Pod Pinner's availability. However, this does not negatively impact the cluster as Kubernetes recreates the pods.
Failed pod status reason: Node affinity
If you use spot-webhook, your cluster may encounter this issue, which puts the pods in a Failed
status. This occurs because Pod Pinner is unaware of other webhook-applied changes to the pods when binding them to nodes. This means that Pod Pinner may have a pod with different node selectors in mind compared to reality.
As with the OutOf{resource}
pod status, this is simply a visual inconvenience as the pod will get recreated by Kubernetes.
Updated about 2 months ago