After successfully enabling CAST AI's automation features, turning on and testing the following aspects will help you ensure the operation of the platform's autoscaling engine.
Further fine-tuning might be necessary for specific use cases.
- Upscale the cluster during peak hours.
- Binpack or downscale the cluster when excess capacity is no longer required.
- Use spot instances to reduce your infrastructure costs, but have the safety of on-demand instances when needed.
The following section describes the Autoscaling policy setup needed to achieve these goals.
To upscale a cluster, CAST AI needs to react to unschedulable pods. You can achieve this by turning on Unscheduled Pods policy and configuring the Default Node template.
What is an unschedulable pod?
This term refers to a pod stuck in a pending state, meaning that it cannot be scheduled onto a node. Generally, this is because insufficient resources of one type or another prevent scheduling.
Automatically add required capacity to the cluster.
Enable Spot instances to allow CAST AI to handle spot instances & their interruptions.
Enable Spot Fallbacks to automatically switch back and forth to the on-demand capacity when spots are not available in the cloud environment.
CAST AI can constantly binpack the cluster and remove any excess capacity. To achieve this goal, we recommend the following initial setup:
Make sure that empty nodes are not running in the cluster longer than configured.
Enable Evictor for higher node resource utilization & less waste. Evictor continuously simulates scenarios where it tries to eliminate underutilized nodes by checking if the pods could be scheduled in the remaining capacity. Simulation respects PDBs and all the other K8s restrictions that your applications may have.
What is Evictor's aggressive mode?
When Evictor runs in the aggressive mode, it considers workloads with a single replica as potential targets for binpacking. This might cause some disruption in single-replica workloads.
After completing the basic setup, we recommend performing a simple upscale/downscale test to verify that the autoscaler functions correctly and that the cluster can scale up and down as needed.
First, deploy the following spot workload to check if the autoscaler reacts to the need for spot capacity:
kubectl apply -f https://raw.githubusercontent.com/castai/examples/main/evictor-demo-pods/test_pod_spot.yaml
# https://github.com/castai/examples/blob/main/evictor-demo-pods/test_pod_spot.yaml apiVersion: apps/v1 kind: Deployment metadata: name: castai-test-spot namespace: castai-agent labels: app: castai-test-spot spec: replicas: 10 selector: matchLabels: app: castai-test-spot template: metadata: labels: app: castai-test-spot spec: tolerations: - key: scheduling.cast.ai/spot operator: Exists nodeSelector: scheduling.cast.ai/spot: "true" containers: - name: nginx image: nginx:latest ports: - containerPort: 80 resources: requests: cpu: 4
This step ensures that CAST AI has all the relevant access to upscale your cluster automatically.
Once the capacity is added, check that your desired DaemonSet pod count matches the node count.
# Get DaemonSets in all namespaces kubectl get ds -A # Get Node count kubectl get nodes | grep -v NAME | wc -l
Verify that the deployed pod is Running, then run
kubectl scale deployment/castai-test-spot --replicas=0.
Verify that CAST AI eliminates empty nodes in the configured time interval.
- In some situations, when you connect a cluster, it may not be immediately fully managed by CAST AI. This means that some workloads still run on existing legacy node pools / Auto Scaling groups.
- We recommend adding
autoscaling.cast.ai/removal-disabled="true"on such node pools / Auto Scaling groups so that CAST AI can exclude such nodes from the Evictor & Rebalancing features.
Updated about 2 months ago