Workload Autoscaler Configuration

Workload Autoscaling can be configured in different ways to suit your specific needs. This can be achieved by using the Cast AI API (or changing the fields via the UI) or controlling the autoscaling settings at the workload level using workload annotations.

Upgrading

Currently, workload autoscaler is installed as an in-cluster component via helm and can be upgraded by simply running the following:

helm upgrade -i castai-workload-autoscaler -n castai-agent castai-helm/castai-workload-autoscaler --reuse-values

Dynamically Injected containers

By default, containers that are injected during runtime (e.g.,istio-proxy) won't be managed by workload autoscaler, and recommendations won't be applied. To enable that, you must configure the in-cluster component with the following command:

helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reuse-values --set webhook.reinvocationPolicy=IfNeeded

Available Workload Settings

The following settings are currently available to configure Cast AI Workload Autoscaling:

  • Automation - on/off marks whether Cast AI should apply or just generate recommendations.
  • Scaling policy- allows for the selection of policy names. It must be one of the policies available for a cluster.
  • Recommendation Percentile - which percentile Cast AI will recommend, looking at the last day of the usage. The recommendation will be the average target percentile across all pods spanning the recommendation period. Setting the percentile to 100% will no longer use the average of all pods but the maximum observed value over the period.
  • Overhead - marks how many extra resources should be added to the recommendation. By default, it's set to 10% for memory and 0% for CPU.
  • Optimization Threshold - when automation is enabled, how much of a difference should there be between the current pod requests and the new recommendation so that the recommendation can be applied immediately? Defaults to 10% for both memory and CPU.
  • Workload autoscaler constraints - sets the minimum and maximum values for resources, which will dictate that workload autoscaler cannot scale CPU/Memory above the max or below the minimum limits. The limit is set for all containers.
  • Ignore startup metrics - allows excluding a specified duration of startup metrics from recommendation calculations for workloads with high initial resource usage (e.g., Java applications).
  • Look-back period - defines a custom timeframe (between 24 hours and 7 days) the Workload Autoscaler uses to observe CPU and memory usage when calculating scaling recommendations. It can be set separately for CPU and memory.

📘

Note

It is recommended to wait for a week before enabling Workload Autoscaling for "all workloads", so that the system has understanding how the resource consumption varies on weekdays and weekends.

Ignore startup metrics

Some workloads, notably Java and .NET applications, may have increased resource usage during startup that can negatively impact autoscaling recommendations. To address this, Cast AI allows you to ignore startup metrics for a specified duration when calculating workload autoscaling recommendations.

You can configure this setting in the Cast AI console under Advanced Settings of a vertical scaling policy:

Startup metrics at the policy level

Startup metrics at the policy level

  1. Enable the feature by checking the "Ignore workload startup metrics" box.
  2. Set the duration to exclude from recommendation generation after a workload starts (between 2 and 60 minutes).

This feature helps prevent inflated recommendations and unnecessary restarts caused by temporary resource spikes during application initialization.

You can also configure this setting via the API or Terraform.

Look-back period

The look-back period defines the timeframe the Workload Autoscaler uses to observe CPU and memory usage when calculating scaling recommendations. This feature allows you to customize the historical data window used for generating recommendations, which can be particularly useful for workloads with atypical resource usage patterns.

You can configure the look-back period in the Cast AI console under Advanced Settings of a vertical scaling policy:

Look-back period in Advanced Settings

Look-back period in Advanced Settings

  1. Set the look-back period for CPU and memory separately.
  2. Specify the duration in days (d) and hours (h). The minimum allowed period is 24 hours, and the maximum is 7 days.

This feature allows you to:

  • Adjust the recommendation window based on your workload's specific resource usage patterns.
  • Account for longer-term trends or cyclical resource usage in your applications.

You can configure this setting at different levels:

  • Policy level: Apply the setting to all workloads assigned to a specific scaling policy.
  • Individual workload level: Configure the setting for a specific workload using annotations or the UI by overriding policy-level settings.

The look-back period can also be configured via Annotations, the API, or Terraform.

Choosing the right look-back period

The optimal look-back period largely depends on your workload's resource usage patterns. Most applications benefit from a shorter look-back period of 1-2 days. This approach works particularly well for standard web applications, capturing daily usage patterns while maintaining high responsiveness to changes. Shorter periods enable more aggressive optimization and often lead to higher savings.

Some workloads, however, require longer observation periods of 3-7 days. Applications with significant differences between weekday and weekend usage patterns benefit from a 7-day period to capture these weekly variations. Batch processing jobs that run every few days need a look-back period that covers at least one full job cycle to prevent potential out-of-memory (OOM) situations.

Common use cases and recommended periods:

  • Standard web applications: 1-2 days captures daily patterns while maintaining responsiveness to changes
  • Batch processing jobs: Set to cover at least one full job cycle to account for periodic resource spikes
  • Weekend-sensitive workloads: 7 days to capture both weekday and weekend patterns
  • Variable workloads: Start with 1-2 days and adjust based on observed scaling behavior

💡

Tip

For workloads with variable or uncertain patterns, start with a shorter period and adjust based on observed behavior. The key is to match the look-back period to your application's actual resource usage patterns – whether that's daily consistency, weekly cycles, or periodic processing jobs.

Custom workload support

The workload autoscaler supports the scaling of custom workloads through label-based selection. This allows autoscaling for:

  • Bare pods (pods without controllers)
  • Pods created programmatically (as Spark Executors or Airflow Workers).
  • Jobs without parent controllers
  • Workloads with custom controllers not natively supported by Cast AI
  • Groups of related workloads that should be scaled together

Label-based workload selection

To enable autoscaling for custom workloads, add the workloads.cast.ai/custom-workload label to the Pod template specification. This is crucial - the label must be present in the Pod template, not just on the controller or running Pod:

apiVersion: v1
kind: Pod
metadata:
  labels:
    workloads.cast.ai/custom-workload: "my-custom-workload"
spec:
  containers:
    - name: app

Workloads with the same label value will be treated as a single workload for autoscaling purposes. The value acts as a unique identifier for the workload group.

Workloads are uniquely identified by their:

  • Namespace
  • Label value
  • Controller kind

Configuring autoscaling behavior

Both labels and annotations used to configure autoscaling behavior must be specified in the Pod template specification, not on the controller or running Pod.

Key points about label-based workload configuration:

  • Workloads are grouped per controller kind (deployments and StatefulSets with the same label will be treated as separate workloads)
  • For grouped workloads, the newest/latest matching controller's pod template configuration is used as the workload specification
  • Only workloads with the workloads.cast.ai/custom-workload label will be discovered for custom workload autoscaling
  • The label value must be unique for each distinct workload or group of workloads you want to scale together
  • All configuration labels and annotations must be specified in the Pod template specification

Examples

Scale a bare pod:

apiVersion: v1
kind: Pod
metadata:
  labels:
    workloads.cast.ai/custom-workload: "standalone-pod"
spec:
  containers:
    - name: app
      # Container spec...

Group related jobs:

apiVersion: batch/v1
kind: Job
spec:
  template:
    metadata:
      labels:
        workloads.cast.ai/custom-workload: "batch-processors"
    spec:
      containers:
        - name: processor
          # Container spec...

Schedule recurring workloads:

apiVersion: batch/v1
kind: CronJob
spec:
  schedule: "*/10 * * * *"
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            workloads.cast.ai/custom-workload: "scheduled-processor"
        spec:
          containers:
            - name: processor
              # Container spec...

Scale workloads with custom controllers:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-managed-app
  ownerReferences: # Custom controller resource
    - apiVersion: customcontroller.example.com/v1alpha1
      kind: CustomResourceType
      name: custom-resource
      uid: abc123
      controller: true
spec:
  template:
    metadata:
      labels:
        workloads.cast.ai/custom-workload: "custom-controlled-app"
    spec:
      containers:
        - name: app
          # Container spec...

The workload autoscaler will track and scale these workloads based on resource usage patterns, applying the same autoscaling policies and recommendations as standard workloads, except:

  • These workloads are only scaled vertically using Vertical Pod Autoscaling (VPA)
  • Only the deferred recommendation mode is supported

📘

Note

Custom workload autoscaling uses deferred mode, meaning recommendations are only applied when pods are naturally restarted. This helps ensure safe scaling behavior for workloads without native scaling support.

Configuration via API/UI

We can configure the aforementioned settings via the UI.

Configuration via Annotations

All settings are also available by adding annotations on the workload controller. When the workloads.cast.ai/configuration annotation is detected on a workload, it will be considered as configured by annotations. This allows for flexible configuration, combining annotations and scaling policies.

Changes to the settings via the API/UI are no longer permitted for workloads with annotations. The default or scaling policy value is used when a workload does not have an annotation for a specific setting.

Annotation values take precedence over what is defined in a scaling policy. This means that if a scaling policy is defined in the workload configuration under annotations, all of the individual configuration options defined under the annotation will override the respective policy values. Those that are not defined under the annotation will use system defaults or what is defined in the scaling policy.

Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
  annotations:
    workloads.cast.ai/configuration: |
      scalingPolicyName: custom
      vertical:
        optimization: on
        applyType: immediate
        antiAffinity:
          considerAntiAffinity: false
        startup:
          period: 5m
        confidence:
          threshold: 0.5
        cpu:
          target: p81
          lookBackPeriod: 25h
          min: 1000m
          max: 2500m
          applyThreshold: 0.2
          overhead: 0.15
          limit:
            type: multiplier
            multiplier: 2.0
        memory:
          target: max
          lookBackPeriod: 30h
          min: 2Gi
          max: 10Gi
          applyThreshold: 0.25
          overhead: 0.35
          limit:
            type: noLimit
        downscaling:
          applyType: immediate
        memoryEvent:
          applyType: immediate
      horizontal:
        optimization: on
        minReplicas: 5
        maxReplicas: 10
        scaleDown:
          stabilizationWindow: 5m

Configuration Structure

Below is a configuration structure reference for setting up a workload to be controlled by annotations.

📘

Note

workloads.cast.ai/configuration has to be a valid YAML string. In cases where the annotation contains an invalid YAML string, the entire configuration will be ignored.

scalingPolicyName

If not set, the system will use the default scaling policy.

FieldTypeRequiredDefaultDescription
scalingPolicyNamestringNo"default"Specifies the scaling policy name to use. When set, this annotation allows the workload to be managed by both annotations and the specified scaling policy. The scaling policy can control global settings like enabling/disabling vertical autoscaling.
scalingPolicyName: custom-policy

vertical

FieldTypeRequiredDefaultDescription
verticalobjectNo-Vertical scaling configuration.
vertical:
  optimization: on
  applyType: immediate
  antiAffinity:
    considerAntiAffinity: false
  startup:
    period: 5m
  confidence:
    threshold: 0.5
vertical.optimization
FieldTypeRequiredDefaultDescription
optimizationstring*Yes-Enable vertical scaling ("on"/"off").

If using the vertical configuration option, this field becomes required.
vertical:
  optimization: on
vertical.applyType
FieldTypeRequiredDefaultDescription
applyTypestringNo"immediate"Allows configuring the autoscaler operating mode to apply the recommendations.
Use immediate to apply recommendations as soon as the thresholds are passed.
Note: immediate mode can cause pod restarts.
Use deferred to apply recommendations only on natural pod restarts.
vertical:
  applyType: immediate
vertical.antiAffinity
FieldTypeRequiredDefaultDescription
antiAffinityobjectNo-Configuration for handling pod anti-affinity scheduling constraints.
vertical:
  antiAffinity:
    considerAntiAffinity: false
vertical.antiAffinity.considerAntiAffinity
FieldTypeRequiredDefaultDescription
considerAntiAffinityboolean*YesfalseWhen true, workload autoscaler will respect pod anti-affinity rules when making scaling decisions.

*If using the vertical.antiAffinity configuration option, this field becomes required.
vertical:
  antiAffinity:
    considerAntiAffinity: false
vertical.startup
FieldTypeRequiredDefaultDescription
startupobjectNo-Configuration for handling workload startup behavior.

See Startup metrics.
vertical:
  startup:
    period: 5m
vertical.startup.period
FieldTypeRequiredDefaultDescription
periodduration*Yes"0m"Duration to ignore resource usage metrics after workload startup. Useful for applications with high initial resource usage spikes.

*If using the vertical.startup configuration option, this field becomes required.
vertical:
  startup:
    period: 5m
vertical.confidence
FieldTypeRequiredDefaultDescription
confidenceobjectNo-Configuration for recommendation confidence thresholds.
vertical:
  confidence:
    threshold: 0.5
vertical.confidence.threshold
FieldTypeRequiredDefaultDescription
thresholdfloat*Yes0.9Minimum confidence score required to apply recommendations (0.0-1.0). Higher values require more data points for recommendations.

*If using the vertical.confidence configuration option, this field becomes required.
vertical:
  confidence:
    threshold: 0.5
vertical.cpu
FieldTypeRequiredDefaultDescription
cpuobjectNo-CPU-specific scaling configuration.
vertical:
  cpu:
    target: p80
    lookBackPeriod: 24h
    min: 100m
    max: 1000m
    applyThreshold: 0.1
    overhead: 0.0
vertical.cpu.target
FieldTypeRequiredDefaultDescription
targetstringNo"p80"Resource usage target:

- max - Use maximum observed usage
- p{0-99} - Use percentile (e.g., p80 for 80th percentile).
vertical:
  cpu:
    target: p80
vertical.cpu.lookBackPeriod
FieldTypeRequiredDefaultDescription
lookBackPerioddurationNo"24h"Historical resource usage data window to consider for recommendations (24h-168h).

See Look-back Period.
vertical:
  cpu:
    lookBackPeriod: 24h
vertical.cpu.min
FieldTypeRequiredDefaultDescription
minstringNo"10m"The lower limit for the recommendation. Uses standard Kubernetes CPU notation (e.g., "1000m" or "1"). Min cannot be greater than max.
vertical:
  cpu:
    min: 100m
vertical.cpu.max
FieldTypeRequiredDefaultDescription
maxstringNo-The upper limit for the recommendation. Uses standard Kubernetes CPU notation (e.g., "1000m" or "1"). Recommendations won't exceed this value.
vertical:
  cpu:
    max: 1000m
vertical.cpu.applyThreshold
FieldTypeRequiredDefaultDescription
applyThresholdfloatNo0.1The amount the recommendation should differ from the requests so that it can be applied. For example, a 10% difference would be expressed as 0.1.

Value range: 0.01-2.5.
vertical:
  cpu:
    applyThreshold: 0.1
vertical.cpu.overhead
FieldTypeRequiredDefaultDescription
overheadfloatNo0.0Additional resource buffer when applying recommendations (0.0-2.5, e.g., 0.1 = 10%).

If a 10% buffer is configured, the issued recommendation will have +10% added to it, so that the workload can handle further increased resource demand.
vertical:
  cpu:
    overhead: 0.0
vertical.cpu.limit
FieldTypeRequiredDefaultDescription
limitobjectNo-Configuration for container CPU limit scaling.
vertical:
  cpu:
    limit:
      type: multiplier
      multiplier: 2.0
vertical.cpu.limit.type
FieldTypeRequiredDefaultDescription
typestring*Yes-Type of limit scaling to apply:

- noLimit - Don't modify limits
- multiplier - Set limit as a multiplier of requests
*If using the vertical.cpu.limit configuration option, this field becomes required.
vertical:
  cpu:
    limit:
      type: multiplier
vertical.cpu.limit.multiplier
FieldTypeRequiredDefaultDescription
multiplierfloat*Yes-Value to multiply the requests by to set the limit (e.g., 2.0 means limit = 2 * requests).

*Required when type is set to multiplier.
vertical:
  cpu:
    limit:
      type: multiplier
      multiplier: 2.0
vertical.memory
FieldTypeRequiredDefaultDescription
memoryobjectNo-Memory-specific scaling configuration.
vertical:
  memory:
    target: max
    lookBackPeriod: 24h
    min: 128Mi
    max: 2Gi
    applyThreshold: 0.1
    overhead: 0.1
vertical.memory.target
FieldTypeRequiredDefaultDescription
targetstringNo"max"Resource usage target:

- max - Use maximum observed usage
- p{0-99} - Use percentile (e.g., p80 for 80th percentile).
vertical:
  memory:
    target: max
vertical.memory.lookBackPeriod
FieldTypeRequiredDefaultDescription
lookBackPerioddurationNo"24h"Historical resource usage data window to consider for recommendations (24h-168h).

See Look-back Period.
vertical:
  memory:
    lookBackPeriod: 24h
vertical.memory.min
FieldTypeRequiredDefaultDescription
minstringNo"10Mi"Minimum resource limit. Uses standard Kubernetes memory notation (e.g., "2Gi", "1000Mi").
vertical:
  memory:
    min: 128Mi
vertical.memory.max
FieldTypeRequiredDefaultDescription
maxstringNo-Maximum resource limit. Uses standard Kubernetes memory notation (e.g., "2Gi", "1000Mi").
vertical:
  memory:
    max: 2Gi
vertical.memory.applyThreshold
FieldTypeRequiredDefaultDescription
applyThresholdfloatNo0.1The amount the recommendation should differ from the requests so that it can be applied. For example, a 10% difference would be expressed as 0.1

Value range: 0.01-2.5
vertical:
  memory:
    applyThreshold: 0.1
vertical.memory.overhead
FieldTypeRequiredDefaultDescription
overheadfloatNo0.1Additional resource buffer when applying recommendations (0.0-2.5, e.g., 0.1 = 10%).

If a 10% buffer is configured, the issued recommendation will have +10% added to it, so that the workload can handle further increased resource demand.
vertical:
  memory:
    overhead: 0.1
vertical.memory.limit
FieldTypeRequiredDefaultDescription
limitobjectNo-Configuration for container memory limit scaling.
vertical:
  memory:
    limit:
      type: multiplier  
      multiplier: 1.5
vertical.memory.limit.type
FieldTypeRequiredDefaultDescription
typestring*Yes-Type of limit scaling to apply:

- noLimit - Don't modify limits
- multiplier - Set limit as a multiplier of requests
*If using the vertical.memory.limit configuration option, this field becomes required.
vertical:
  memory:
    limit:
      type: multiplier
vertical.memory.limit.multiplier
FieldTypeRequiredDefaultDescription
multiplierfloat*Yes-Value to multiply the requests by to set the limit (e.g., 1.5 means limit = 1.5 * requests).

*Required when type is set to multiplier.
vertical:
  memory:
    limit:
      type: multiplier  
      multiplier: 1.5
vertical.downscaling
FieldTypeRequiredDefaultDescription
downscalingobjectNo-Downscaling behavior override.
vertical:
  downscaling:
    applyType: immediate
vertical.downscaling.applyType
FieldTypeRequiredDefaultDescription
applyTypestringNoDefault is taken from the vertical scaling policy controlling the workload.Override application mode:

- immediate - Apply changes immediately
- deferred - Apply during natural restarts
vertical:
  downscaling:
    applyType: immediate
vertical.memoryEvent
FieldTypeRequiredDefaultDescription
memoryEventobjectNo-Memory event behavior override.
vertical:
  memoryEvent:
    applyType: immediate
vertical.memoryEvent.applyType

This configuration option is fully compatible with other applyType options and is meant to be used in combination with them. This allows for fine-grained control over both upscaling and downscaling. Here's how they interact:

  1. If both configuration options are set to the same value (both immediate or both deferred), the behavior remains unchanged.
  2. If vertical.downscaling.applyType is set to immediate and vertical.memoryEvent.applyType is set to deferred:
    • Upscaling operations will be applied immediately.
    • Downscaling operations will be deferred to natural pod restarts.
  3. If vertical.downscaling.applyType is set to deferred and vertical.memoryEvent.applyType is set to immediate:
    • Upscaling operations will be deferred to natural pod restarts.
    • Downscaling operations will be applied immediately.
FieldTypeRequiredDefaultDescription
applyTypestring*YesDefault is taken from the vertical scaling policy controlling the workload.Override application mode for memory-related events (OOM kills, pressure):

- immediate - Apply changes immediately
- deferred - Apply during natural restarts*If using the vertical.memoryEvent configuration option, this field becomes required.
vertical:
  memoryEvent:
    applyType: immediate

horizontal

FieldTypeRequiredDefaultDescription
horizontalobjectNo-Horizontal scaling configuration.
horizontal:
  optimization: on
  minReplicas: 1
  maxReplicas: 10
  scaleDown:
    stabilizationWindow: 5m
  shortAverage: 3m
horizontal.optimization
FieldTypeRequiredDefaultDescription
optimizationstringYes*-Enable horizontal scaling ("on"/"off").

*If using the horizontal configuration option, this field becomes required.
horizontal:
  optimization: on
horizontal.minReplicas
FieldTypeRequiredDefaultDescription
minReplicasintegerYes*-Minimum number of replicas.
horizontal:
  minReplicas: 1
horizontal.maxReplicas
FieldTypeRequiredDefaultDescription
maxReplicasintegerYes*-Maximum number of replicas.
horizontal:
  maxReplicas: 10
horizontal.scaleDown
FieldTypeRequiredDefaultDescription
scaleDownobjectNo-Houses scaledown cofiguration options.
horizontal.scaleDown.stabilizationWindow
FieldTypeRequiredDefaultDescription
stabilizationWindowduration*Yes"5m"Cooldown period between scale-downs.

*If using the horizontal.scaleDown configuration option, this field becomes required.
horizontal:
  scaleDown:
    stabilizationWindow: 5m

*Required if the parent object is present in the configuration.

🚧

Legacy Annotation Support

For documentation on the legacy annotation format, which is now deprecated, see the Legacy Annotations Reference page .

Migration Guide

📘

Note

The annotations V2 structure cannot be combined with deprecated annotations V1. When the annotation workloads.cast.ai/configuration is detected, the workload is considered to be configured by using that annotation and all other annotations starting with workloads.cast.ai will be ignored.

To migrate from v1 to v2 annotations:

  1. Remove all individual legacy workloads.cast.ai/* annotations
  2. Add the new workloads.cast.ai/configuration annotation
  3. Move all settings into the YAML structure under the new annotation

For example, these v1 annotations:

workloads.cast.ai/vertical-autoscaling: "on"
workloads.cast.ai/cpu-target: "p80"
workloads.cast.ai/memory-max: "2Gi"

Would become:

workloads.cast.ai/configuration: |
  vertical:
    optimization: on
    cpu:
      target: p80
    memory:
      max: 2Gi