Typically, we advise a rebalance after the initial onboarding process to immediately achieve cost savings. Then periodic rebalancing is recommended, as it helps to reduce some of the fragmentation that occurs naturally due to autoscaling.
The minimum count of worker nodes to be part of the rebalancing plan.
What are problematic workloads?
Rebalancing is an operation performed at the node level, meaning it can only be carried out on a node if all the workloads on that node can be scheduled on the new node without any issues. However, if a workload has an affinity to a custom label, it will be considered problematic for rebalancing since the CAST AI autoscaler isn't aware of this custom node selector.
If you still want to proceed with rebalancing, you can add the following label to your workloads:
This label will mark them as disposable, indicating that they can be safely relocated during the rebalancing process.
- key: environment
If you're using the custom label, you have to let CAST know about it via the node template. Once you create a Node template, CAST AI will also support custom labels.
Learn more about it on this page.
Let's say I have two pods running on one spot node in the cluster. The CPU utilization is 80% and memory 80% (the node isn't underutilized). The scheduled rebalance mechanism will check to see if this spot instance type in this zone is the cheapest one - and if not, will it redeploy it on a new spot instance?
If savings match or exceed the target savings value, CAST AI will rebalance if it can find cheaper options.
If it finds a combination that provides a specified level of savings, you can tell it to only run if it will achieve NN% savings. For instance, if you have an 80% utilized node, switching to a different spot instance may reduce costs by 12%. Let's say that you have a cost savings target of 10%. In this case, the rebalance will run. If the new node will only be 3% cheaper, and your savings target is 10%, it won't run.
The customer ran the rebalance operation - everything was working fine until the rebalance was stuck on the last node.
The team checked and saw that there was a pod with a PDB and 0 allowed disruptions. The rebalance was pending for a few minutes, and then the node was deleted, removing the pod that had a PDB.
CAST AI honors PDBs for 20 minutes in the draining phase. After this time passes, it assumes these are invalid PDBs and force drains the node.
If you don't want that to happen, consider adding removal-disabled annotation to the workload. That node will be skipped completely.
The customer gets the following error:
- kind: RemovalDisabled
description: annotated with removal disabled
If the node can be safely rebalanced, they can temporarily add
autoscaling.cast.ai/disposable=”true” and then remove the label or mark this to ignore.
Does CAST AI always drain the node and string first or sometimes removes the nodes without doing a graceful shutdown? I ask this because I see logs stating "Node was drained Initiated by: Rebalancer" but many times I also see "Node was deleted initiated by: autoscaler," which doesn't mention draining the node?
If the node was empty already, it would just be deleted. If the node has pods on it and is selected for eviction, it will be gracefully drained first. Any scenario where there are pods on the node will initiate a drain if the node is to be deleted.
The autoscaler only deletes empty nodes. A rebalance first creates the new nodes, and then drains and deletes the old nodes.
You can specify minNode in rebalance configuration section of update/create nodeTemplate section of the API.
After that, the rebalance should respect the minNode count for that node template.
Currently, this is available through API/Terraform only.
This appears to be an issue on our side related to the migration of this field to the node template. What is happening now, is that the value passed in UI/API is not considered when creating a green node setup. What is considered instead is the RebalanceMinNodes field of a node template.
This field is not exposed in UI but can be changed through API:
This is how you create a rebalancing plan with a default node template with
RebalanceMinNodes=0 (which is the default value). Next, you can update the node template through API and set
RebalanceMinNodes=3. Finally, you can try rebalancing and it should generate the expected number of green nodes.
- kind: "TopologySpread"
description: "Unsupported topology spread key: kubernetes.io/hostname"
This is pretty much impossible to satisfy if you use a skew of 1 and have 5 nodes, but 3 of them are 100% full. If a deployment scales to a replica of 10, then you would need to create 8 new nodes with 1 pod to satisfy the skew.
We recommend switching to podAntiAffinity. Playing around with replica count might help with rebalancing as well. Another approach is using soft affinity.
- kind: PodNodeRequirements
description: PersistentVolume "pvc-ad3c1269-8772-479f-b3a0-bf380309e67a" NodeAffinity topology label "topology.gke.io/zone" declares unsupported az "asia-south1-b"
CAST AI takes zones (locations field in API, "default node zones") from the cluster object, not the node pool. To add new zones into zonal clusters, enable at Cluster object, trigger cluster reconcile, and you should see nodes in specified zones.
- kind: PreGroupScalability
description: Autoscaling for Node Template "default-by-castai" is disabled.
This is a new problematic pod kind. If a node has pods that use the default node template (DNT) and DNT is turned off, then the node will be considered problematic. It's mandatory to have the node template turned on to be able to create nodes for pods that use that template. This has been always the case with the rebalance feature, but it's now explicitly exposed via problematic pods.
In this case, you won't need to enable the default node template (DNT). As long as there aren't any DNT workloads, they won't run into this problematic pod kind.
Unfortunately, CAST AI don’t support pod anti-affinity yet for rebalancing.
The Savings report doesn't take into account node groups or templates. In this case, there are 3 nodes in default and 2 in system-reserved, the rebalance feature cannot combine the two and there may be pod anti-affinity that is forcing the additional nodes.
Yes, it is possible. Each node has some DaemonSet pods running, so having fewer nodes means having fewer pods. Also, the workload could have changed during rebalancing.
Yes, CAST AI sets the minimum and desired sizes for the available ASGs to 0
Updated 14 days ago