Minimize Impact

This article outlines how you can minimize the operational impact of your applications on the Kubernetes cluster if CAST AI can't provide services.

Fallback scaling

Manual

AKS

In AKS, the cluster autoscaler is integrated into a node pool group.

Adding the necessary capacity to the Kubernetes cluster to make sure all critical applications are running (not pending) is a top priority. This emergency capacity can be added manually or triggered automatically on time delay.

During the CAST AI onboarding, the cluster autoscaler is usually disabled by the user in the native node pool. In an emergency, you can manually enable it again in Azure portals in a web browser

AKS node pool "Scale" configuration to "Autoscale".

To enable the cluster autoscaler on the default node pool using az cli, use the following command:

az aks nodepool update --enable-cluster-autoscaler -g MyResourceGroup -n default --cluster-name MyCluster

GKE

Adding the necessary capacity to the Kubernetes cluster to make sure all critical applications are running (not pending) is a top priority. This emergency capacity can be added manually or triggered automatically on time delay.

During the CAST AI onboarding, the cluster autoscaler is usually disabled by the user in the native node pool. In an emergency, you can manually enable it again in GCP portals in a web browser.

GKE default node pool configuration "Enable cluster autoscaler".


EKS

In EKS, the autoscaler usually runs inside the cluster, and with CAST AI is scaled to 0 replicas. To manually fall back autoscaling, scale the cluster autoscaler replica to 1.

Enabling the cluster autoscaler again will reinflate the Auto Scaling Group/node pool based on Pendings Pods.

Fallback when Node Templates are in use***

Some workloads will have placement logic that uses NodeAffinity or NodeSelector to require nodes with specific labels, such as "mycompany=spark-app" or "pool=gpu."

To satisfy these placement requirements, users set up Node Templates. However, if you need to fall back to node pools or Auto Scaling groups, you need to configure labels similar to those used on the Node Templates. As in the above scenario, the cluster autoscaler should be enabled for these node pools or ASGs.


Automatic

Automatic fallback logic assumes that the cluster autoscaler is always on, but it works with a delay. This setting prioritizes the Nodes CAST AI creates, but if the platform becomes unable to add capacity, the cluster autoscaler resumes its Node creation abilities.

Each Node Template would have a matching fallback ASG created to align with corresponding Node Template selectors as described in the "Fallback when Node Templates are in use" section above.

Use cluster-autoscaler as fallback autoscaler

Add extra flags to the container command section:

--new-pod-scale-up-delay=600s – to configure a global delay.

If setting up a secondary cluster autoscaler:

--leader-elect-resource-name=cluster-autoscaler-fallback – lease name to avoid duplicating the legacy autoscaler.

Make sure the cluster role used by CAST AI's RBAC permissions allows the creation of a new lease:

- apiGroups:
  - coordination.k8s.io
  resourceNames:
  - cluster-autoscaler
  - cluster-autoscaler-fallback
  resources:
  - leases
  verbs:
  - get
  - update

Example of a fallback autoscaler configuration:

 spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --v=4
        - --leader-elect-resource-name=cluster-autoscaler-fallback
        - --new-pod-scale-up-delay=600s
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/v1
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.22.2

Use Karpenter as fallback autoscaler

If Karpenter is already configured for the Kubernetes cluster, you can use Karpenter as the fallback autoscaler instead of cluster-autoscaler.

Edit the Karpenter deployment and change the environment variables BATCH_MAX_DURATION and BATCH_IDLE_DURATION to 600s:

`  spec:
     containers:
     - env:
        - name: BATCH_MAX_DURATION
          value: 600s
        - name: BATCH_IDLE_DURATION
          value: 600s
        ...

Spot Drought Testing

You can use our spot drought tool and runbook if you would like to understand how your workloads will behave during a mass spot interrupt event, or if you would like to understand how the autoscaler will respond to a lack of spot inventory.

Spot Drought Tool

The spot drought tool required to run this test is available in our github at the following link:

CAST AI Spot Drought Tool

Spot Drought Runbook

Objective:

To assess the performance of applications on a test cluster during a simulated drought of spot instances.

Prerequisites

Test Cluster: A cluster with a significant scale and real workloads running on spot instances. This will be used for the simulation.

Python Environment: Ensure you have Python installed along with the requests library.

API Key: Make sure you have a full access CAST AI API key ready. This can be created in the CAST AI console.

Organization ID: Your CAST AI Organization ID.

Cluster ID: The cluster ID for the target cluster.

Script: The CAST AI Spot Drought Tool provided above.

Steps

Preparing for the Simulation:

  1. Backup: Ensure you have taken necessary backups of your cluster configurations and any important data. This is crucial in case you need to roll back to a previous state.
  2. Set Up the Script:
    1. Open the provided Python script in your terminal or preferred IDE.
    2. Create a CASTAI_API_KEY environment variable in your terminal, or update the castai_api_key variable with your CAST AI API key.
    3. Update the organization_id variable with your CAST AI organization id.
    4. Update the cluster_id variable with the id of the target cluster.

Minimize Spot Instances Availability:

  1. Run the Python script in interactive mode:
    1. python castai_spot_drought.py
  2. You should see a list of options to choose from.
    1. Choose 1: Add to blacklist to restrict the spot instances available to your test cluster
    2. Choose 2: Get blacklist to retrieve the blacklist and confirm that all instance families were added to the blacklist

Run the Interrupt Simulation:

  1. Using the Node List in the CAST AI console, identify a set of spot nodes (25% or more of the nodes in the cluster) and simulate an interruption using the three-dot menu on each node’s row.
  2. Ensure that the interruption takes place and monitor the immediate effects on the cluster.
  3. The response from the CAST AI autoscaler will depend on what policies are enabled, and what features are selected in your node templates.
    1. We recommend performing this test with the Spot Fallback feature enabled in your node templates.
  4. Measure and Test Outcomes:
    1. Service Downtime: Monitor the services running on your cluster. Check if any service goes down or becomes unresponsive during the simulation.
    2. Error Responses: Track error rates and types. Note down any spikes in errors or any new types of errors that appear during the simulation.
    3. Latency: Measure the response time of your applications. Compare the latency during the simulation with the usual latency figures to see any significant differences.
    4. Other Metrics: Based on your application and infrastructure, you might want to monitor other metrics like CPU usage, memory usage, network bandwidth, etc.

Rollback:

Once you have captured all the necessary data and metrics:

  1. Return to your terminal, or run the script again in interactive mode:
    1. python castai_spot_drought.py
  2. Choose 3: Remove from blacklist to lift the restrictions and bring the cluster back to its normal state.
  3. Monitor the cluster to ensure it's back to normal operation and all the spot instances are available again.

Conclusion

After you've completed the simulation, review the captured data and metrics. This will give you insights into how your applications and infrastructure behave under spot instance drought conditions. Use this information to make any necessary adjustments to your setup for better fault-tolerance and performance.