Workload Autoscaler automatically scales your workload requests up or down to ensure optimal performance and cost-effectiveness.

Getting started

To start using workload optimization, you need to install the Workload Autoscaler component and the custom resource definitions for the recommendation objects. You can do this by getting the install script from our API or using our console once you visit the workload optimization page.

📘

Note

Your cluster must be running in automated optimization mode, as workload optimization relies on the cluster controller to create the recommendation objects in the cluster.

Workload Autoscaler automatically scales your workload requests up or down to ensure optimal performance and cost-effectiveness.

Agent version requirements

Different features require specific minimum versions of both the workload-autoscaler component and the Cast AI agent.

Cast AI Workload-autoscaler requirements

FeatureMinimum Version
DaemonSet supportv0.5.0
Argo Rollouts supportv0.6.0
Deferred apply mode for recommendationsv0.5.0
Immediate apply mode for rolloutsv0.21.0
HPA support for rolloutsv0.21.0
Horizontal pod autoscalingv0.9.1
StatefulSet supportv0.13.0

Cast AI agent requirements

FeatureMinimum Version
Horizontal pod autoscalingv0.60.0
Argo Rollouts supportv0.62.2

The Cast AI agent is open source. To understand available features and improvements, you can view the complete changelog and release history in our GitHub repository.

Cast AI cluster-controller requirements

FeatureMinimum Version
Workload Autoscaler supportv0.54.0

Check your current version

To check your current agent version:

kubectl describe pod castai-agent -n castai-agent | grep Image: | grep agent | head -n 1
kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-agent -o jsonpath='{.items[0].spec.containers[?(@.name=="agent")].image}'

To check your current workload-autoscaler version:

kubectl describe pod -n castai-agent -l app.kubernetes.io/name=castai-workload-autoscaler | grep Image:

To check your current cluster-controller version:

kubectl describe pod -n castai-agent -l app.kubernetes.io/name=castai-cluster-controller | grep Image:

DaemonSet support

DaemonSet workloads are supported with the following requirements and limitations:

  • Requires workload-autoscaler version v0.5.0 or later
  • A feature flag must be enabled for your organization by Cast AI
  • DaemonSets are only scaled down (never above original requests) to prevent interference with node autoscaling

This constraint on upward scaling ensures proper functioning of the node autoscaler while still allowing for resource optimization through downscaling when possible.

To enable DaemonSet optimization:

  1. Contact Cast AI support to enable the feature flag for your organization
  2. Once enabled, DaemonSets will automatically be included in workload optimization with the above limitations

Metrics collection and recommendation generation

Cast AI needs to process metrics to generate recommendations, so you must install a metrics server.

Recommendations are regenerated every 30 minutes. The default configuration is maximum usage over 24 hours with 10% overhead for memory and 80th percentile usage over 24 hours for CPU.

Note: All generated recommendations will consider the current requests/limits.

Applying recommendations automatically

Once the recommendation lands in the cluster, the Workload Autoscaler component is notified that a recommendation has been created or updated.

Next, Workload Autoscaler:

  • Works as an admission webhook for pods - when pods are created matching the recommendation target, it modifies the pod to have its requests/limits set to what is defined in the recommendation
  • Finds the controller and triggers an update to cause the pods controlled by the controller to be re-created (for example, for a deployment object, it adds an annotation to the pod template)

Workload Autoscaler currently supports deployments, statefulsets, and rollouts. By default, deployments are updated immediately, which may result in pod restarts.

For rollouts, you can choose between two modes:

  • Deferred mode (default): Workload Autoscaler waits for pods to restart naturally before applying new recommendations
  • Immediate mode (requires v0.21.0+): Similar to deployments, recommendations are applied right away, which may trigger pod restarts

How to enable Workload Autoscaler

Scaling policies

Scaling policies allow you to manage all your workloads centrally, with the ability to apply the same settings to multiple workloads simultaneously. Moreover, you can create your custom policy with different settings and apply it to multiple workloads simultaneously.

When you start using our Workload Autoscaler component, all of your workloads will automatically have a default scaling policy applied to them at first, using our default settings. When a new workload appears in the cluster, it will automatically be assigned to the default policy.

If the default scaling policy is suitable for your workloads, you can enable scaling in two ways:

  • Globally via the scaling policy by enabling Automatically Optimize Workloads – this will enable scaling only for the workloads we have enough data about. Workloads that aren’t ready will be checked later and enabled once the platform has enough data. When this setting is enabled on the default scaling policy, every new workload created in the cluster will be scaled automatically once the platform has enough data.
  • Directly from the workload, – once enabled, autoscaling will start immediately (depending on the autoscaler mode chosen at the policy level).

How to configure recommendations

You can configure recommendations via the API to add additional overhead for a particular resource or change the function used to select the baseline for the recommendation.

For example, you can configure the MAX function to be used for CPU and set the overhead to 20%. This means that the CPU recommendation would be the maximum observed CPU usage over 24 hours plus 20% overhead.

You can find the optimization settings in the scaling policies. You can carry out the following configuration tasks:

  • Scale recommendations by adding overhead.
  • Fine-tune the percentile values for CPU and memory recommendations.
  • Specify the optimization threshold.

You can fine-tune the following settings in the scaling policies:

  • Automatically Optimize Workloads – the policy allows you to specify whether our recommendations should be automatically applied to all workloads associated with the scaling policy. This feature enables automation only when there is enough data available to make informed recommendations.
  • Recommendation percentile – this section determines which percentile CAST AI will recommend considering the last day of usage. The recommendation will be the average target percentile across all pods spanning the recommendation period. Setting the percentile to 100% will no longer use the average of all pods but the maximum observed value over the period.
  • Overhead – it marks how much extra resource should be added on top of the recommendation. By default, it's set to 10% for memory and 0% for CPU.
  • Autoscaler mode - this can be set to immediate or deferred.
  • Optimization threshold – when automation is enabled and Workload Autoscaler works in immediate mode, this value determines when to apply scaling recommendations.
    For upscaling, the threshold is calculated relative to current resource requests, while for downscaling, it's calculated relative to the new recommended value.
    For example, with a threshold of 10%, an upscale from 100m to 120m CPU would be applied immediately (20% increase relative to current 100m), while an upscale from 110m to 120m would not be applied immediately (8% increase relative to new 110m).
    The default value for both memory and CPU is 10%.

Immediate vs. deferred scaling mode

If the autoscaler mode is set to immediate, it will check if a new recommendation meets the user-set optimization thresholds. It won't be applied if the recommendation doesn’t meet these thresholds. If it does pass the threshold, Workload Autoscaler will automatically modify pod requests per the recommendation.

Moreover, Workload Autoscaler will also apply new recommendations upon natural pod restarts, such as a new service release or when a pod dies due to a business or technical error. This helps to avoid unnecessary pod restarts.

If the scaling mode is set to deferred, Workload Autoscaler will not initiate a forced restart of the pod. Instead, it will apply the recommendation whenever external factors initiate pod restarts.

System overrides for scaling mode

🚧

Notice

In certain scenarios, the Workload Autoscaler may override the chosen scaling mode to ensure optimal performance and prevent potential issues.

Here are some cases where the system may default to deferred mode, even if immediate mode is selected:

  1. Hard node requirements: Workloads with certain specific node constraints are set to deferred mode. This includes:
    • Specific pod anti-affinity: If a workload has pod anti-affinity rules that use the kubernetes.io/hostname as the topologyKey within a requiredDuringSchedulingIgnoredDuringExecution block.
    • Host network usage: Pods that require the use of the host's network.

These constraints are considered hard node requirements. Using deferred mode in these cases prevents potential scheduling conflicts and resource issues that could arise from immediate pod restarts.

  1. Rollouts: Workloads of the Rollout kind are always set to deferred mode, as they have their own update mechanisms that could conflict with immediate scaling.

These overrides help maintain cluster stability and prevent scenarios where immediate scaling could lead to increased costs or resource conflicts. Check your workload configurations for these conditions if you notice unexpected deferred scaling behavior.

Mark of recommendation confidence

The "Recommendations Confidence" column can include a mark indicating low confidence in the recommended values.

If an orange mark appears, we don't have sufficient data on workload resource usage to generate trusted recommendations. You can start using Workload Autoscaler if you enable it from a workload level, but we advise waiting at least one week before enrolling your workloads in workload autoscaling.

This mark can appear next to workloads that have run too short for CAST AI to gather enough data and generate accurate recommendations. Workloads that have this mark and belong to a scaling policy that has the "Auto enable" option turned on won't be optimized unless we will get enough data.

How to create a new scaling policy?

Scaling policies are a great tool for managing multiple workloads at once. Some workloads may require a higher overhead, while others would be unnecessary. To create a policy, navigate to Scaling policies and click Create a scaling policy.

Set your desired settings and choose workloads from the list. After everything is set, save the configuration.

Once you have all the required scaling policies, you can switch the policies for your workloads. You can do that in batches or for individual workloads:

  • To change a policy for batch workloads, select your workloads in the table, click Assign the policy, choose the policy you want to use, and save your changes.
  • To change a policy at the workload level, open the workload drawer, choose a new policy in the drop-down list, and save the changes.

When policy is changed, new configuration settings will impact a new recommendation. The newest data will show new values on workload recommendation graphs.

Enabling Workload Autoscaler for a single workload

To enable optimization for a single workload:

  • Select the workload you want to optimize.
  • In the drawer that opens, you can change the settings, review the past 7 days' historical usage and recommendations, and request data.
  • Once you’ve made the review, click the Turn Optimization On button and save the changes.

OOM event handling

Despite careful monitoring and historical data analysis, out-of-memory (OOM) events can occur due to sudden workload spikes or application-level issues.

CAST AI Workload Autoscaler has a robust system for handling possible OOM events in Kubernetes clusters. This feature prevents OOMs by dynamically adjusting memory allocations based on historical data and recent events.

Detection

The system detects OOM events by monitoring pod container statuses.

  1. The CAST AI Agent collects pod data through regular snapshots.
  2. Pod container termination states are extracted from these snapshots:
    {
      "name": "data-analysis-service",
      "lastState": {
        "terminated": {
          "exitCode": 137,
          "reason": "OOMKilled",
          "startedAt": "2023-08-16T13:51:08Z",
          "finishedAt": "2023-08-16T13:58:43Z",
          "containerID": "containerd://8f7e9bc23a1d5f6g987h654i321j987k654l321m987n654o321p987q654r"
        }
      }
    }
    
  3. When an OOMKilled state is detected, an OOM event is emitted.
  4. The OOM event is stored in the database to ensure each event is handled only once.

When a container is terminated with the reason OOMKilled, it triggers the OOM handling process.

Handling

When an OOM event is detected, the system takes the following actions:

  1. The OOM event handler watches for new OOM events.
  2. Upon receiving an event, it adjusts the configuration based on the previous workload state.
  3. A new recommendation is generated with increased memory overhead.

Memory Overhead Adjustment

The system uses an incremental approach to adjust memory overhead:

  1. Initial OOM event: memory overhead is increased by 10% (x1.1).
  2. If no further OOM events occur, the overhead slowly decreases back to the original allocation (x1) over time.
  3. If another OOM event occurs during this decrease:
    • The system starts from the current overhead increase.
    • An additional 10% is added on top of the current value.

For example:

  • If the current overhead is 5% (x1.05) when an OOM occurs again, it will be increased to 15% (x1.15).
  • This process can continue up to a maximum of 2.5x the original allocation.

Recommendation regeneration

After adjusting the configuration, the OOM handler invokes the recommendation generator with the latest settings. This process does the following:

  1. Generates a new recommendation based on the adjusted configuration.
  2. Emits a recommendation generated event.
  3. Applies the new recommendation asynchronously.

Handling memory pressure events

The Workload Autoscaler has built-in logic to handle memory pressure events and prevent pod evictions due to out-of-memory (OOM) issues. This feature is particularly important in tightly packed clusters where increased pod memory usage might trigger eviction events before the OOM kill, and pods might enter eviction loops due to unaddressed memory pressure.

Understanding memory pressure evictions vs OOM kills

It's important to distinguish between memory pressure evictions and OOM kills:

  • Memory pressure evictions: These occur when a node is under memory pressure but before the pod reaches its memory limit. Evictions happen in the range between a pod's memory request and its limit. The Kubernetes scheduler may evict pods to free up resources and maintain node stability.
  • OOM kills: These happen when a pod exceeds its memory limit. The container runtime terminates the pod immediately.

How it works

When a memory pressure eviction event is detected, the autoscaler follows this process:

  1. Check if the pod experiencing the eviction is managed by CAST AI.
  2. Determine if this pod is the one causing the memory pressure.
  3. Verify if the pod's memory usage data is available in the event.
  4. If all conditions are met, create a system override for minimum memory adjustment.

📘

Minimum memory adjustment calculation

  • Pod's memory usage at the time of the eviction event
  • Plus the configured memory overhead (as defined in the vertical scaling policy)

This adjustment is applied when generating the next recommendation for the pod and remains in effect for 8 hours to ensure stability.

Criteria for adjustment

It's important to note that this mechanism only applies to pods that meet the following criteria:

CriterionRequirement
ManagementCAST AI must manage the pod
CauseThe pod must be directly causing memory pressure
Data AvailabilityMemory usage data must be available in the eviction event

🚧

Notice

No automatic adjustment is made for pods that don't meet all these criteria

This targeted approach ensures that the autoscaler efficiently addresses the root cause of memory pressure events.

Troubleshooting

Failed Helm test hooks

When installing the Workload Autoscaler, a Helm test hook is executed to verify proper functionality. The Workload autoscaler is an in-cluster component, which applies recommendations to workloads. For that, it uses two main mechanisms, the functionality of which is verified by the test hook.

The test validates these two key mechanisms:

  • Reconciliation loop on Recommendation objects
  • Admission webhook for Kubernetes Pod objects

A failed test indicates one or both mechanisms aren't working correctly, which prevents workload optimization. Follow these steps to identify and resolve the issue:

Verify Recommendation CRD installation

  1. Check the CRD status:
kubectl describe crd recommendations.autoscaling.cast.ai

The output should describe the object without any failure status.

  1. Verify API server metrics:
kubectl get --raw /metrics | grep apiserver_request_total | grep recommendations

The output should show metrics for Recommendation objects being queried from the Kubernetes API server.

  1. Review Workload Autoscaler logs for reconciliation messages containing reconciling recommendation. This shows that the reconciliation loop is being invoked.

Check admission webhook invocation

  1. Verify webhook metrics:
kubectl get --raw /metrics | grep apiserver_admission_webhook | grep workload-autoscaler.cast.ai

The output should show metrics from the Kubernetes API server invoking the admission webhook.

  1. Check Workload Autoscaler logs for pod mutation messages containing mutating pod resources. This shows that the admission webhook is being invoked, and the Kubernetes pod is being mutated.

Validate port configuration

Check that ports are aligned across components, i.e., the admission webhook is pointing to the port exposed by the service, and the service is pointing to the port exposed by the Workload Autoscaler:

  1. Check admission webhook port:
kubectl describe mutatingwebhookconfigurations castai-workload-autoscaler
  1. Check service port:
kubectl describe service castai-workload-autoscaler
  1. Check deployment port:
kubectl describe deployment castai-workload-autoscaler

Look for the WEBHOOK_PORT environment variable value.

Review other admission webhooks

  1. List all webhooks potentially affecting pod creation:
kubectl describe mutatingwebhookconfigurations,validatingwebhookconfigurations

Look for webhooks invoked for v1/CREATE/pods.

  1. Verify that other webhooks aren't blocking or modifying Recommendations.

  2. If webhook conflicts occur, try setting webhook.reinvocationPolicy=IfNeeded to retrigger the Workload Autoscaler admission webhook after pod modifications.

Additional checks

  • Verify that Security Group Rules or Firewall Rules allow access to the Workload Autoscaler Pod/Service from the Kubernetes API server on port 443/9443 (or custom configured port).
  • Check for automation tools like ArgoCD or other Kubernetes controllers or operators that might revert or modify applied recommendations.

The workload-autoscaler is OOMKilled

If your Workload Autoscaler pods are being terminated due to OOMKilled errors, follow the steps outlined below in order.

Resolution steps

  1. Ensure you are using the latest Workload Autoscaler version.

    • To check the installed version, run the following:
      helm get metadata castai-workload-autoscaler -n castai-agent
      
    • To upgrade to the latest version, run:
      helm repo update castai-helm
      helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reuse-values
      
  2. Upgrade the Kubernetes cluster to version 1.32 or higher: Kubernetes version 1.32 introduced improved memory handling features.

  3. Enable pagination for the initial list request: If upgrading the Kubernetes cluster is not feasible, you can enable pagination to split the initial list requests into smaller chunks, reducing memory usage.

    📘

    Note

    This may increase startup time and place additional load on the API server.

    To enable pagination, run:

    helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reuse-values --set additionalEnv.LIST_WATCHER_FORCE_PAGINATION=true
    
  4. If all the above points don't help, consider increasing the default Workload Autoscaler memory limit.

    For example, to change the memory limit to 1Gi, run:

helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reuse-values --set resources.limits.memory=1Gi