Autoscaler preparation checklist

After successfully enabling CAST AI's automation features, turning on and testing the following aspects will help you ensure the operation of the platform's autoscaling engine.

Further fine-tuning might be necessary for specific use cases.

Goals

  • Upscale the cluster during peak hours.
  • Binpack or downscale the cluster when excess capacity is no longer required.
  • Use spot instances to reduce your infrastructure costs, but have the safety of on-demand instances when needed.

Recommended setup

The following section describes the Autoscaling policy setup needed to achieve these goals.

Unscheduled pods policy

To upscale a cluster, CAST AI needs to react to unschedulable pods. You can achieve this by turning on Unscheduled Pods policy and configuring the Default Node template.

πŸ“˜

What is an unschedulable pod?

This term refers to a pod stuck in a pending state, meaning that it cannot be scheduled onto a node. Generally, this is because insufficient resources of one type or another prevent scheduling.

Unscheduled pods policy

Unscheduled pods policy with Default Node template

Why?

:white-check-mark: Automatically add required capacity to the cluster.

:white-check-mark: Enable Spot instances to allow CAST AI to handle spot instances & their interruptions.

:white-check-mark: Enable Spot Fallbacks to automatically switch back and forth to the on-demand capacity when spots are not available in the cloud environment.

Node deletion policy / Evictor

CAST AI can constantly binpack the cluster and remove any excess capacity. To achieve this goal, we recommend the following initial setup:

Node deletion policy recommended settings

Node deletion policy recommended settings

Why?

:white-check-mark: Make sure that empty nodes are not running in the cluster longer than configured.

:white-check-mark: Enable Evictor for higher node resource utilization & less waste. Evictor continuously simulates scenarios where it tries to eliminate underutilized nodes by checking if the pods could be scheduled in the remaining capacity. Simulation respects PDBs and all the other K8s restrictions that your applications may have.

🎯

What is Evictor's aggressive mode?

When Evictor runs in the aggressive mode, it considers workloads with a single replica as potential targets for binpacking. This might cause some disruption in single-replica workloads.

Testing

After completing the basic setup, we recommend performing a simple upscale/downscale test to verify that the autoscaler functions correctly and that the cluster can scale up and down as needed.

:arrow-right: First, deploy the following spot workload to check if the autoscaler reacts to the need for spot capacity:

kubectl apply -f https://raw.githubusercontent.com/castai/examples/main/evictor-demo-pods/test_pod_spot.yaml
# https://github.com/castai/examples/blob/main/evictor-demo-pods/test_pod_spot.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: castai-test-spot
  namespace: castai-agent
  labels:
    app: castai-test-spot
spec:
  replicas: 10
  selector:
    matchLabels:
      app: castai-test-spot
  template:
    metadata:
      labels:
        app: castai-test-spot
    spec:
      tolerations:
        - key: scheduling.cast.ai/spot
          operator: Exists
      nodeSelector:
        scheduling.cast.ai/spot: "true"
      containers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 4

:white-check-mark: This step ensures that CAST AI has all the relevant access to upscale your cluster automatically.

:arrow-right: Once the capacity is added, check that your desired DaemonSet pod count matches the node count.

# Get DaemonSets in all namespaces
kubectl get ds -A

# Get Node count
kubectl get nodes | grep -v NAME | wc -l

:arrow-right: Verify that the deployed pod is Running, then run kubectl scale deployment/castai-test-spot --replicas=0.

:white-check-mark: Verify that CAST AI eliminates empty nodes in the configured time interval.

Troubleshooting

Partially managed by CAST AI

Not fully managed by CAST AI

One node not managed by CAST AI

  • In some situations, when you connect a cluster, it may not be immediately fully managed by CAST AI. This means that some workloads still run on existing legacy node pools / Auto Scaling groups.
  • We recommend adding autoscaling.cast.ai/removal-disabled="true" on such node pools / Auto Scaling groups so that CAST AI can exclude such nodes from the Evictor & Rebalancing features.