Workload Autoscaling Configuration

Workload Autoscaling can be configured in different ways to suit your specific needs. You can achieve this using the Cast AI API (or changing the fields via the UI) or controlling the autoscaling settings at the workload level using workload annotations.

Upgrading

📘
Note
Some of the Helm commands below use the --reset-then-reuse-values flag, which requires Helm v3.14.0 or higher.

Currently, workload autoscaler is installed as an in-cluster component via Helm and can be upgraded by running the following command:

helm repo update castai-helm
helm upgrade -i castai-workload-autoscaler -n castai-agent castai-helm/castai-workload-autoscaler --reset-then-reuse-values

Upgrade with Helm Test Enabled

To verify the deployment's functionality during an upgrade, you can enable Helm's built-in testing mechanism. This ensures both the reconciliation loop and admission webhook are working correctly:

helm repo update castai-helm
helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reset-then-reuse-values --set test.enabled=true

Upgrade with Memory Limits & Helm Test Enabled

For clusters with high workload counts or complex scaling requirements, you may need to adjust memory limits to prevent OOM kills. This command combines memory limit configuration with Helm testing:

helm repo update castai-helm
helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reset-then-reuse-values --set resources.limits.memory=5Gi --set resources.requests.memory=5G --set test.enabled=true

Rollback to a Previous Version

If you encounter issues after an upgrade, you can roll back to a previous version using the following steps:

Check Release History

Before rolling back, check the current release history to identify available versions:

helm history castai-workload-autoscaler -n castai-agent

Rollback to a Specific Version

Once you've identified the desired version from the history, execute the rollback. Replace [VERSION] with the desired revision number:

helm rollback castai-workload-autoscaler [VERSION] -n castai-agent

This will restore the workload autoscaler to the specified version.

Verify the deployment

After an upgrade or rollback, confirm that the deployment is running correctly:

kubectl get pods -n castai-agent
kubectl logs -l app=castai-workload-autoscaler -n castai-agent

ArgoCD compatibility

If you're using ArgoCD for deployment, you may need to disable Helm test hooks to prevent sync issues. The Workload Autoscaler includes a test hook that verifies its basic functionality, but this can cause problems with ArgoCD as it doesn't fully support Helm test hooks. To disable test hook creation, set test.enabled=false.

For new installations:

helm install castai-workload-autoscaler castai-helm/castai-workload-autoscaler 
	-n castai-agent \
	--set test.enabled=false

For upgrades:

helm repo update castai-helm
helm upgrade -i castai-workload-autoscaler castai-helm/castai-workload-autoscaler \
  -n castai-agent \
  --reset-then-reuse-values \
  --set test.enabled=false

This option is available from version 0.1.75 (appVersion v0.27.3) onwards.

Dynamically Injected containers

By default, containers that are injected during runtime (e.g.,istio-proxy) won't be managed by workload autoscaler, and recommendations won't be applied. To enable that, you must configure the in-cluster component with the following command:

helm repo update castai-helm
helm upgrade castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent --reset-then-reuse-values --set webhook.reinvocationPolicy=IfNeeded

Available Workload Settings

The following settings are currently available to configure Cast AI Workload Autoscaling:

Automation - on/off marks whether Cast AI should apply or just generate recommendations.
Scaling policy- allows for the selection of policy names. It must be one of the policies available for a cluster.
Recommendation Percentile - which percentile Cast AI will recommend, looking at the last day of the usage. The recommendation will be the average target percentile across all pods spanning the recommendation period. Setting the percentile to 100% will no longer use the average of all pods but the maximum observed value over the period.
Overhead - marks how many extra resources should be added to the recommendation. By default, it's set to 10% for memory and 0% for CPU.
Optimization Threshold - when automation is enabled, how much of a difference should there be between the current pod requests and the new recommendation so that the recommendation can be applied immediately? Defaults to 10% for both memory and CPU.
Workload autoscaler constraints - sets the minimum and maximum values for resources, which will dictate that workload autoscaler cannot scale CPU/Memory above the max or below the minimum limits. The limit is set for all containers.
Ignore startup metrics - allows excluding a specified duration of startup metrics from recommendation calculations for workloads with high initial resource usage (e.g., Java applications). Only affects vertical pod autoscaling decisions.
Look-back period - defines a custom timeframe (between 24 hours and 7 days) the Workload Autoscaler uses to observe CPU and memory usage when calculating scaling recommendations. It can be set separately for CPU and memory.

📘
Note
It is recommended to wait for a week before enabling Workload Autoscaling for "all workloads", so that the system has understanding how the resource consumption varies on weekdays and weekends.

Recommendation Apply Type

Setting Name	Description	Possible Values	Default Value
Apply type	Controls how and when the Workload Autoscaler applies recommendations to workloads.	`immediate`, `deferred`	`immediate`

Immediate mode

When set to immediate, the Workload Autoscaler proactively implements resource optimization. The system monitors your workloads and applies new resource recommendations as soon as they exceed the configured thresholds. This approach prioritizes rapid resource optimization, automatically triggering pod restarts to implement the new allocations.

Apply recommendations as soon as they exceed the configured thresholds
Trigger pod restarts to implement the new resource allocations immediately

Deferred mode

When set to deferred, the Workload Autoscaler takes a non-disruptive approach to resource optimization. Rather than forcing changes immediately, the system stores recommendations and waits for natural pod lifecycle events to apply them. When pods restart for other reasons—such as application deployments, scaling events, or node maintenance—the pending recommendations are seamlessly applied.

Store recommendations but not forcibly apply them
Apply recommendations only when pods naturally restart (e.g., during deployments, scaling events, or node maintenance)

Recommendation Annotations in Different Scaling Modes

When the Workload Autoscaler applies recommendations to your workloads, it adds annotations to track when and which recommendations have been applied.

Annotation	Description
`autoscaling.cast.ai/vertical-recommendation-hash`	A hash value representing the applied recommendation. This annotation appears on all workloads with applied recommendations, regardless of the scaling mode.
`autoscaling.cast.ai/recommendation-applied-at`	A timestamp indicating when the recommendation was actively applied to the workload. This annotation only appears on workloads using the immediate apply type.

Scaling Mode Behavior

Immediate mode: Both annotations will be present. The recommendation-applied-at annotation captures the exact time when the recommendation was applied, and pod restarts were triggered.
Deferred mode: Only the vertical-recommendation-hash annotation will be present. Since recommendations are only applied during natural pod restarts in deferred mode (without forcing controller restarts), the recommendation-applied-at annotation is not added. You can determine when the recommendation was applied by looking at the pod's creation timestamp, as this corresponds to when the pod naturally restarted and incorporated the recommendation.

Recommendation Percentile

The recommendation percentile setting determines how conservatively the Workload Autoscaler allocates resources based on the workload's historical usage patterns. It defines how close recommendations should be to observed workload resource usage, with higher percentiles leading to more generous resource allocations.

The recommendation percentile represents the statistical threshold used when analyzing workload resource usage. For example:

A p80 (80th percentile) setting for CPU means the recommendation will ensure resources are sufficient to handle 80% of all observed load scenarios.
A max (100th percentile) setting for memory means recommendations will account for the absolute highest observed memory usage.

Configuring Percentiles

You can configure different percentile values for CPU and memory independently:

CPU Percentile: Typically set at p80 (default), can range from p50 to max
Memory Percentile: Typically set at max (default) or p99, rarely lower

The recommendation is calculated using the average target percentile across all pods spanning the recommendation period. When you set the percentile to "max" (100%), the system will use the maximum observed value over the period instead of the average across pods.

Use Cases and Recommendations

Higher percentiles provide more buffer room for traffic spikes but increase costs. Lower percentiles optimize costs but may lead to resource contention during traffic spikes. Memory percentiles should generally be set higher than CPU percentiles, as memory exhaustion (OOM) is more disruptive than CPU throttling.

CPU Percentile

Setting	Use Case	Characteristics
p50-p70	Development/non-critical workloads	Most cost-efficient, some throttling possible
p80 (default)	General-purpose production workloads	Good balance of cost and performance
p90-p95	Performance-sensitive applications	Higher cost, minimal throttling
max	Ultra-low latency applications	Highest cost, optimal performance

Memory Percentile

Setting	Use Case	Characteristics
p90-p95	Stateless applications with graceful OOM handling	More cost-efficient
p99	Most production applications	Good balance for most workloads
max (default)	Critical applications	Minimal OOM risk, higher cost

Resource Overhead

Resource overhead allows you to add a buffer to the recommendations generated by the Workload Autoscaler. This buffer provides extra capacity for your workloads to handle unexpected load increases without immediately triggering scaling events.

Resource overhead is a percentage value that is added to the recommended resource requests. For example, if the recommended CPU usage for a workload is 100m and you set a CPU overhead of 10%, the final recommendation will be 110m.

Configuring Resource Overhead

You can configure different overhead values for CPU and memory:

CPU Overhead: Typically set between 0% (default) and 20%.
Memory Overhead: Typically set between 10% (default) and 30%.

Properly configured overhead helps prevent out-of-memory (OOM) events and CPU throttling while maintaining cost efficiency.

Limitations

Overhead cannot exceed 250% (2.5), as extremely high values could lead to significant resource waste.
When both Vertical Pod Autoscaling (VPA) and Horizontal Pod Autoscaling (HPA) are enabled, memory overhead can still be configured as normal, but CPU overhead settings are ignored as the system automatically balances vertical and horizontal scaling.

Workload Resource Limits

Workload resource limits allow you to configure how container resource limits are managed relative to the resource requests that Workload Autoscaler optimizes. This feature provides fine-grained control over the relationship between requests and limits, ensuring your workloads have the right balance of resource guarantees and constraints.

In Kubernetes, each container can specify both resource requests (guaranteed resources) and resource limits (maximum allowed resources). The Workload Autoscaler primarily optimizes resource requests based on actual usage patterns, but the workload resource limits setting determines how container limits are handled during this optimization.

Configuration Options

You can configure resource limits separately for CPU and memory.

CPU Limit Options

Remove limits: Removes any existing resource limits from containers, including those specified in your workload manifest.
Custom multiplier: Set CPU limits as a multiple of the requests (e.g., 2.0 means limits = requests × 2).

Memory Limit Options

Automatic (default): Automatically sets memory limits to 1.5x requests when limits are lower than this value. If existing limits are higher, they remain unchanged.
Remove limits: Removes any existing resource limits from containers, including those specified in your workload manifest.
Custom multiplier: Set memory limits as a multiple of the requests (e.g., 2.0 means limits = requests × 2).

Impact on Resource Management

The relationship between requests and limits affects how Kubernetes schedules and manages your workloads:

Scheduling: Pods are scheduled based on requests, not limits.
CPU Throttling: Containers using more CPU than their limit will be throttled.
OOM Kills: Containers exceeding memory limits will be terminated (OOM killed), as they cannot burst past the limit.

Even though Cast AI does not recommend using resource limits on workloads that are actively managed by Workload Autoscaler, properly configured limits can prevent problems from neighboring workloads and provide predictable performance.

Limitations

The multiplier value must be greater than or equal to 1.0
When setting a custom multiplier, Workload Autoscaler will never reduce existing limits below the calculated value (limits = requests × multiplier)

Ignore startup metrics

Some workloads, notably Java and .NET applications, may have increased resource usage during startup that can negatively impact vertical pod autoscaling recommendations. To address this, Cast AI allows you to ignore startup metrics for a specified duration when calculating these recommendations.

You can configure this setting in the Cast AI console under the Advanced Settings of a vertical scaling policy:

Startup metrics at the policy level — Startup metrics at the vertical scaling policy level

Enable the feature by checking the "Ignore workload startup metrics" box.
Set the duration to exclude from vertical pod autoscaling recommendation generation after a workload starts (between 2 and 60 minutes).

This feature helps prevent inflated vertical scaling recommendations and unnecessary pod restarts caused by temporary resource spikes during application initialization. The startup metrics exclusion only applies to vertical pod autoscaling - horizontal pod autoscaling (HPA) will still respond normally to resource usage during startup.

You can also configure this setting via the API or Terraform.

Change Sensitivity

Change sensitivity determines when the Workload Autoscaler applies resource recommendation changes to your workloads. The sensitivity represents the minimum percentage difference between current resource requests and new recommendations that triggers an update.

Workload Autoscaler offers two sensitivity options in the UI:

Percentage: A fixed threshold value that applies equally to all workloads
Dynamic: An adaptive threshold that automatically adjusts based on workload size

Dynamic vs Percentage Sensitivity

Percentage Sensitivity

A percentage sensitivity applies the same fixed percentage to all workloads regardless of their size. For example, with a 10% sensitivity:

A workload requesting 100m CPU will only be scaled if the new recommendation differs by at least 10m
A workload requesting 10 CPUs will only be scaled if the new recommendation differs by at least 1 CPU

While this is straightforward, it can be less optimal for workloads of varying sizes.

Dynamic Sensitivity (Recommended)

The dynamic sensitivity automatically adjusts based on workload size, providing more appropriate scaling behavior:

For small workloads: Uses a higher threshold percentage to prevent frequent, insignificant updates
For large workloads: Uses a lower threshold percentage to enable meaningful optimizations

This helps prevent unnecessary pod restarts for small workloads while ensuring larger workloads are also efficiently optimized.

Recommendations

For most users, we recommend using the Dynamic sensitivity setting, as it provides appropriate thresholds across workloads of all sizes and requires no manual tuning or maintenance.

The Percentage sensitivity setting is best suited for workloads of similar sizes and behavior when they are managed by the same scaling policy.

Configuring Change Sensitivity in Policies

You can configure change sensitivity through the scaling policy settings:

Navigate to Workload Autoscaler → Scaling Policies
Select an existing policy or create a new one
Click When to apply changes → Advanced Settings to expand additional options
Under Change sensitivity, select your preferred strategy:
- Percentage: Enter a fixed percentage value
- Dynamic: Use the adaptive algorithm (recommended option)

Dynamic Sensitivity Simulation

The UI includes a sensitivity simulation graph that shows how the dynamic threshold changes based on workload size. This visualization helps you understand how the sensitivity percentage varies as resource requests increase.

Key elements of the simulation:

The X-axis represents resource requests (CPU or memory)
The Y-axis represents the threshold percentage
The curve shows how the threshold decreases as resource size increases

You can toggle between CPU and memory views using the dropdown.

The threshold value indicator shows the exact sensitivity percentage that would apply to a specific resource request amount. For example, in the screenshot above, a request of 0.1 CPU would have a threshold of 45.45%. Meaning it would need to change by that amount to trigger a change.

Advanced Configuration Options

While the UI offers simplified access to dynamic sensitivity settings, power users can access additional customization options through annotations. These advanced options include:

Custom adaptive thresholds with configurable parameters
Independent sensitivity settings for CPU and memory
Fine-tuning of the adaptive algorithm formula

For details on these advanced options, refer to our Annotations reference documentation.

Workload Autoscaler Constraints

Workload autoscaler constraints allow you to set minimum and maximum resource limits that the Workload autoscaler will respect when generating and applying recommendations. These constraints act as guardrails to prevent resources from being scaled too low or too high, ensuring your workloads maintain appropriate resource boundaries.

Workload autoscaler constraints define the allowed resource scaling range for each container managed by the autoscaler. By setting these constraints, you can:

Prevent resources from being scaled below a functional minimum
Limit maximum resource allocation to control costs
Customize scaling boundaries for each container individually

Configuration Options

Constraints can be configured at two levels:

Policy-Level Constraints (Global)
Policy-level constraints apply to all workloads associated with a scaling policy. These constraints serve as default guardrails for all containers managed by that policy.
You can set:
- Global minimum CPU and memory limits
- Global maximum CPU and memory limits
Container-Level Constraints
For more granular control, you can specify constraints for individual containers within a workload. Container-specific constraints override policy-level constraints when both are defined. To do this in the console, override policy settings at a workload level for each workload you want to modify in such a way.
Container constraints include:
- Minimum and maximum CPU resources
- Minimum and maximum memory resources

Use Cases

Workload autoscaler constraints are valuable in multiple scenarios.

Setting minimum CPU/memory resources ensures business-critical services never scale below functional thresholds. Some applications (like JVM-based services) need a certain amount of memory to function, so setting a minimum for them will prevent performance issues on cold starts.

Setting maximum CPU/memory resources is typically done for reasons such as:

Cost Control: Limiting maximum resources prevents unexpected cost increases
Compliance: Enforcing organizational policies about resource consumption limits

Limitations

When configuring container-specific constraints, the container name must match exactly what's defined in the pod specification.

Container-Level Constraints Example

For workloads with multiple containers with different resource profiles, you can set specific constraints for each using annotations:

workloads.cast.ai/configuration: |
  vertical:
    containers:
      app-server:
        cpu:
          min: 500m
          max: 2000m
        memory:
          min: 1Gi
          max: 4Gi
      metrics-sidecar:
        cpu:
          min: 10m
          max: 100m
        memory:
          min: 64Mi
          max: 256Mi

This allows for precise control over scaling boundaries for each container within your pod.

Look-back period

The look-back period defines the timeframe the Workload Autoscaler uses to observe CPU and memory usage when calculating scaling recommendations. This feature allows you to customize the historical data window used for generating recommendations, which can be particularly useful for workloads with atypical resource usage patterns.

You can configure the look-back period in the Cast AI console under Advanced Settings of a vertical scaling policy:

Set the look-back period for CPU and memory separately.
Specify the duration in days (d) and hours (h). The minimum allowed period is 3 hours, and the maximum is 7 days.

This feature allows you to:

Adjust the recommendation window based on your workload's specific resource usage patterns.
Account for longer-term trends or cyclical resource usage in your applications.

You can configure this setting at different levels:

Policy level: Apply the setting to all workloads assigned to a specific scaling policy.
Individual workload level: Configure the setting for a specific workload using annotations or the UI by overriding policy-level settings.

The look-back period can also be configured via Annotations, the API, or Terraform.

Choosing the right look-back period

The optimal look-back period largely depends on your workload's resource usage patterns. Most applications benefit from a shorter look-back period of 1-2 days. This approach works particularly well for standard web applications, capturing daily usage patterns while maintaining high responsiveness to changes. Shorter periods enable more aggressive optimization and often lead to higher savings.

Some workloads, however, require longer observation periods of 3-7 days. Applications with significant differences between weekday and weekend usage patterns benefit from a 7-day period to capture these weekly variations. Batch processing jobs that run every few days need a look-back period that covers at least one full job cycle to prevent potential out-of-memory (OOM) situations.

Common use cases and recommended periods:

High-frequency trading or real-time applications: 3-6 hours for rapid scaling response
Standard web applications: 1-2 days captures daily patterns while maintaining responsiveness to changes
Batch processing jobs: Set to cover at least one full job cycle to account for periodic resource spikes
Weekend-sensitive workloads: 7 days to capture both weekday and weekend patterns
Variable workloads: Start with 1-2 days and adjust based on observed scaling behavior

💡
Tip
For workloads with variable or uncertain patterns, start with a shorter period and adjust based on observed behavior. The key is to match the look-back period to your application's actual resource usage patterns – whether that's daily consistency, weekly cycles, or periodic processing jobs.

Resource-specific optimization

When configuring vertical scaling, you can enable or disable CPU and memory optimization independently while still receiving recommendations for both resources. Even when optimization is disabled for a resource, Workload autoscaler continues to generate recommendations but won't apply them automatically. This setting can be configured both at the vertical policy level and for individual workloads.

Selective resource optimization controls in the vertical scaling policy settings

📘
Note
At least one resource type must remain enabled - you cannot disable both CPU and memory optimization simultaneously.

Version requirements

The minimum required workload-autoscaler component version to use this feature is v0.23.1.

Configuration options

You can configure resource-specific optimization through:

The Cast AI console UI using the resource checkboxes
The Cast AI API or Terraform
Annotations at the workload level:

workloads.cast.ai/configuration: |
  vertical:
    memory:
      optimization: off

For detailed reference information on Workload autoscaler annotations, see Configuration via annotations.

Single-replica workload management

The Workload Autoscaler provides a zero-downtime update mechanism for single-replica workloads. This feature enables resource adjustments without service interruptions by temporarily scaling to two replicas during updates.

How it works

When enabled, this setting:

Temporarily scales single-replica deployments to two replicas
Applies resource adjustments to the new pods
Waits for the new pod to become healthy and ready
Scales back to one replica, removing the original pod

Thus, it maintains continuous service availability throughout the process. This setting is especially valuable for applications where even brief interruptions could cause failed requests or other issues.

Configuring zero-downtime updates

You can enable this feature in your vertical scaling policy settings:

Navigate to the Workload Autoscaler → Scaling Policies section
Edit an existing policy or create a new one
Under Advanced Settings, locate the Single replica workload management section
Check the Enable zero-downtime updates option

Prerequisites

For zero-downtime updates to work effectively, your workloads must meet these requirements:

This feature is only applicable to Deployment resources that support running as multi-replica
Running with a single replica (replica count = 1)
Deployment's rollout strategy allows for downtime
It works with the immediate apply type and is not needed for deferred mode
It requires workload-autoscaler component version v0.35.3 or higher

For workloads where brief interruptions are unacceptable, this setting provides a way to achieve continuous availability without permanently increasing replica counts and associated costs.

Custom workload support

The Workload Autoscaler supports the scaling of custom workloads through label-based selection. This allows autoscaling for:

Bare pods (pods without controllers)
Pods created programmatically (as Spark Executors or Airflow Workers).
Jobs and CronJobs
Workloads with custom controllers not natively supported by Cast AI
Groups of related workloads that should be scaled together

Label-based workload selection

To enable autoscaling for custom workloads, add the workloads.cast.ai/custom-workload label to the Pod template specification. This is crucial - the label must be present in the Pod template, not just on the controller.

The label value must conform to RFC 1123 DNS subdomain and label name restrictions imposed by Kubernetes, meaning it must:

Contain only lowercase alphanumeric characters or '-'
contain at most 63 characters
Start and end with an alphanumeric character
Not contain underscores (_) or other special characters

apiVersion: v1
kind: Pod
metadata:
  labels:
    workloads.cast.ai/custom-workload: "my-custom-workload"
spec:
  containers:
    - name: app

Workloads with the same label value will be treated as a single workload for autoscaling purposes. The value acts as a unique identifier for the workload group.

Workloads are uniquely identified by their:

Namespace
Label value
Controller kind

Configuring custom workloads via annotations

Regarding custom workload discovery, Cast AI configuration annotations must also be specified either in the Pod template specification or at the top level of the manifest.

The placement of the configuration annotations for each Kind:

For deployments: Place annotations in the spec.template.metadata.annotations

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-managed-app
spec:
  template:
    metadata:
      annotations:
        workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
          scalingPolicyName: custom
          vertical:
            optimization: on
      labels:
        workloads.cast.ai/custom-workload: "custom-controlled-app"
    spec:
      containers:
        - name: app
          # Container spec...

For jobs: Place both the custom workload label and all configuration annotations in the Pod template specification (spec.template.metadata):

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    metadata:
      labels:
        workloads.cast.ai/custom-workload: "batch-processor"
      annotations:
        workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
          scalingPolicyName: custom
          vertical:
            optimization: on
    spec:
      containers:
        # Container spec...

Configuring autoscaling behavior

Both labels and annotations used to configure autoscaling behavior must be specified in the Pod template specification, not on the controller or running Pod.

Key points about label-based workload configuration:

Workloads are grouped per controller kind (deployments and StatefulSets with the same label will be treated as separate workloads)
For grouped workloads, the newest/latest matching controller's pod template configuration is used as the workload specification
Only workloads with the workloads.cast.ai/custom-workload label will be discovered for custom workload autoscaling
The label value must be unique for each distinct workload or group of workloads you want to scale together
All configuration labels and annotations must be specified in the Pod template specification

Examples

Scale a bare pod:

apiVersion: v1
kind: Pod
metadata:
  labels:
    workloads.cast.ai/custom-workload: "standalone-pod"
  annotations:
    workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
      scalingPolicyName: custom
      vertical:
        optimization: on
spec:
  containers:
    - name: app
      # Container spec...

Group related jobs:

apiVersion: batch/v1
kind: Job
spec:
  template:
    metadata:
      labels:
        workloads.cast.ai/custom-workload: "batch-processors"
      annotations:
        workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
          scalingPolicyName: custom
          vertical:
            optimization: on
    spec:
      containers:
        - name: processor
          # Container spec...

Schedule recurring workloads:

apiVersion: batch/v1
kind: CronJob
spec:
  schedule: "*/10 * * * *"
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            workloads.cast.ai/custom-workload: "scheduled-processor"
          annotations:
            workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
              scalingPolicyName: custom
              vertical:
                optimization: on
        spec:
          containers:
            - name: processor
              # Container spec...

Scale workloads with custom controllers:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-managed-app
  ownerReferences: # Custom controller resource 
    - apiVersion: customcontroller.example.com/v1alpha1
      kind: CustomResourceType
      name: custom-resource
      uid: abc123
      controller: true
spec:
  template:
    metadata:
      annotations:
        workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
          scalingPolicyName: custom
          vertical:
            optimization: on
      labels:
        workloads.cast.ai/custom-workload: "custom-controlled-app"
    spec:
      containers:
        - name: app
          # Container spec...

The Workload autoscaler will track and scale these workloads based on resource usage patterns, applying the same autoscaling policies and recommendations as standard workloads, except:

These workloads are only scaled vertically using Vertical Pod Autoscaling (VPA)
Only the deferred recommendation mode is supported

📘
Note
Custom workload autoscaling uses deferred mode, meaning recommendations are only applied when pods are naturally restarted. This helps ensure safe scaling behavior for workloads without native scaling support.

Container Grouping for Dynamic Containers

Starting with workload-autoscaler v0.33.0, Cast AI Workload autoscaler supports grouping dynamically created containers with similar names into a single logical container for resource recommendation purposes. This is meant to be done for workloads that generate containers with auto-generated names containing a common pattern (typically in the form of a suffix), such as batch jobs or Tekton pipelines.

For upgrading your workload-autoscaler version to leverage this feature, see upgrading instructions.

Container Grouping Annotation

To group containers with similar naming patterns, you can add the containersGrouping section to your workload configuration annotation:

workloads.cast.ai/configuration: |
  containersGrouping:
    - key: name
      operator: contains
      values: ["data-processor"]
      into: data-processor

Once configured, the workload usage will appear stable in the Cast AI console under one container name, making it easier to track resource usage patterns and apply consistent optimization recommendations.

Configuration Options

For the most up-to-date information on workload autoscaling annotation configuration options, always refer to Configuration via Annotations.

Field	Type	Required	Description
`key`	string	Yes	The attribute used to match containers. Currently only supports `name` which refers to the container name property.
`operator`	string	Yes	Defines how the `key` is evaluated against the `values` list. Currently only supports `contains`.
`values`	array	Yes	A list of string values used for matching against the `key` with the specified operator. Must contain at least one item.
`into`	string	Yes	The target container name into which matching containers should be grouped for workload optimization purposes.

Use Case Example

This feature is meant to be used for workloads that create containers with auto-generated names based on a common pattern. For example, a TaskRun in Tekton might create containers with names like:

step-data-processor-167588-0-0-m9cun-0
step-data-processor-167589-0-0-m9cun-0
step-data-processor-167590-0-0-m9cun-0

🚧
Important
Without container grouping, the workload autoscaler would treat each of these as completely separate, thus interfering with metrics collection and making it difficult to establish meaningful resource usage patterns and generate accurate recommendations for scaling.

By using the container grouping feature, you can consolidate all these containers under a single logical name (e.g., data-processor), allowing the workload autoscaler to provide more accurate and stable resource recommendations.

Complete Example

Here's a complete example showing how to configure container grouping for a Tekton TaskRun:

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: example-taskrun
spec:
  # TaskRun specifications
  podTemplate:
    metadata:
      labels:
        workloads.cast.ai/custom-workload: "team-neptune-data-processing-batch-job"
      annotations:
        workloads.cast.ai/configuration: |
          containersGrouping:
            - key: name
              operator: contains
              values: ["data-processing-batch-job"]
              into: data-processing-batch-job
    # Other pod template specifications

In this example:

The workload is identified by the custom workload label team-neptune-data-processing-batch-job.
Any container with a name containing the string data-processing-batch-job will be grouped into a logical container named data-processing-batch-job.
The workload autoscaler will track resource usage and generate recommendations for this logical container group rather than for each individual container.

Scaling Job workloads

The workload autoscaler can automatically manage resources for Job and CronJob workloads when they're labeled as custom workloads.

To enable automatic scaling for your Jobs or CronJobs that is uniquely optimized for those types of workloads, add the custom workload label to your Job specifications:

apiVersion: batch/v1
kind: CronJob
spec:
  schedule: "*/10 * * * *"
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            workloads.cast.ai/custom-workload: "scheduled-processor"
          annotations:
            workloads.cast.ai/configuration: | # Configuration annotation in pod template spec
              scalingPolicyName: custom
              vertical:
                optimization: on
        spec:
          containers:
            - name: processor
              # Container spec...

Once labeled, the workload autoscaler will uniquely recognize these workloads, begin tracking your Job's resource usage patterns, and generate scaling recommendations that are different from those for other kinds of workloads.

Creating scaling policies for Jobs

Different types of Jobs benefit from different scaling configurations. For best results, set up separate vertical scaling policies to match your Job frequency patterns.

Group similar Jobs together
Create dedicated scaling policies for Jobs with similar patterns. For example, put all your hourly Jobs under one policy and daily Jobs under another. That is because the optimal settings for scaling jobs of different frequencies will differ (e.g., have different look-back periods configured).
Consider Job duration
Very short-lived Jobs (running for just seconds) might be better served with fixed resource allocations rather than automatic scaling.

📘
Note
To reach full recommendation confidence, a job needs to run at least 3 times within the configured look-back period.

GitLab CI Runner Integration

When using Cast AI's workload autoscaling with GitLab CI runners, you must properly configure custom workload labels to enable optimization. Here's how to set it up.

Configuring GitLab Runner

Add the custom workload label to your GitLab runner configuration in the config.toml file:

[[runners]]
  [runners.kubernetes]
    namespace = "{{.Release.Namespace}}"
    image = "alpine"
    [runners.kubernetes.pod_labels]
      # Use CI_JOB_NAME_SLUG and CI_PROJECT_ID for unique workload identification
      ["workloads.cast.ai/custom-workload"] = "${CI_JOB_NAME_SLUG}-${CI_PROJECT_ID}"

Important Considerations

These are some of the considerations around best practices when implementing custom workload logic with GitLab runners that should help avoid potential issues:

Label naming restrictions: The custom workload label value must conform to Kubernetes naming conventions (RFC 1123).
Using GitLab CI variables: When constructing the custom workload label, use GitLab's predefined variables that are already sanitized:
- CI_JOB_NAME_SLUG: A lowercase, sanitized version of CI_JOB_NAME
- CI_PROJECT_ID: The unique ID of the project
Shared runners: For environments with shared runners across multiple projects, using a combination of CI_JOB_NAME_SLUG and CI_PROJECT_ID helps ensure unique workload identification per project and job type.

Example Configuration

Here's a complete example of a GitLab runner configuration with workload autoscaling enabled:

[[runners]]
  [runners.kubernetes]
    namespace = "{{.Release.Namespace}}"
    image = "alpine"
    [runners.kubernetes.pod_annotations]
      "workloads.cast.ai/configuration" = "vertical:\n  optimization: on\n"
    [runners.kubernetes.pod_labels]
      "workloads.cast.ai/custom-workload" = "${CI_JOB_NAME_SLUG}-${CI_PROJECT_ID}"

This configuration ensures the following key requirements:

Each job type per project gets its own workload optimization profile
Label values comply with Kubernetes naming requirements

For more information about available GitLab CI variables, refer to the GitLab CI/CD predefined variables documentation.

Configuration via API/UI

We can configure the aforementioned settings via the UI.

Configuration via Annotations

All settings are also available by adding annotations on the workload controller. When the workloads.cast.ai/configuration annotation is detected on a workload, it will be considered as configured by annotations. This allows for flexible configuration, combining annotations and scaling policies.

Changes to the settings via the API/UI are no longer permitted for workloads with annotations. The default or scaling policy value is used when a workload does not have an annotation for a specific setting.

Annotation values take precedence over what is defined in a scaling policy. This means that if a scaling policy is defined in the workload configuration under annotations, all of the individual configuration options defined under the annotation will override the respective policy values. Those that are not defined under the annotation will use system defaults or what is defined in the scaling policy.

Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
  annotations:
    workloads.cast.ai/configuration: |
      scalingPolicyName: custom
      vertical:
        optimization: on
        applyType: immediate
        antiAffinity:
          considerAntiAffinity: false
        startup:
          period: 5m
        confidence:
          threshold: 0.5
        cpu:
          target: p81
          lookBackPeriod: 25h
          min: 1000m
          max: 2500m
          applyThresholdStrategy:
            type: defaultAdaptive
          overhead: 0.15
          limit:
            type: multiplier
            multiplier: 2.0
        memory:
          target: max
          lookBackPeriod: 30h
          min: 2Gi
          max: 10Gi
          applyThresholdStrategy:
            type: defaultAdaptive
          overhead: 0.35
          limit:
            type: noLimit
        downscaling:
          applyType: immediate
        memoryEvent:
          applyType: immediate
        containers:
          {container_name}:
            cpu:
              min: 10m
              max: 1000m
            memory:
              min: 10Mi
              max: 2048Mi
      rolloutBehavior:
        type: NoDisruption
      horizontal:
        optimization: on
        minReplicas: 5
        maxReplicas: 10
        scaleDown:
          stabilizationWindow: 5m
      containersGrouping:
        - key: name
          operator: contains
          values: ["data-processor"]
          into: data-processor

Configuration Structure

Below is a configuration structure reference for setting up a workload to be controlled by annotations.

📘
Note
workloads.cast.ai/configuration has to be a valid YAML string. In cases where the annotation contains an invalid YAML string, the entire configuration will be ignored.

scalingPolicyName

If not set, the system will use the default scaling policy.

Field	Type	Required	Default	Description
`scalingPolicyName`	string	No	"default"	Specifies the scaling policy name to use. When set, this annotation allows the workload to be managed by both annotations and the specified scaling policy. The scaling policy can control global settings like enabling/disabling vertical autoscaling.

scalingPolicyName: custom-policy

vertical

Field	Type	Required	Default	Description
`vertical`	object	No	-	Vertical scaling configuration.

vertical:
  optimization: on
  applyType: immediate
  antiAffinity:
    considerAntiAffinity: false
  startup:
    period: 5m
  confidence:
    threshold: 0.5

vertical.optimization

Field	Type	Required	Default	Description
`optimization`	string	*Yes	-	Enable vertical scaling ("on"/"off"). If using the `vertical` configuration option, this field becomes required.

vertical:
  optimization: on

vertical.applyType

Field	Type	Required	Default	Description
`applyType`	string	No	"immediate"	Allows configuring the autoscaler operating mode to apply the recommendations. Use `immediate` to apply recommendations as soon as the thresholds are passed. Note: `immediate` mode can cause pod restarts. Use `deferred` to apply recommendations only on natural pod restarts.

vertical:
  applyType: immediate

vertical.antiAffinity

Field	Type	Required	Default	Description
`antiAffinity`	object	No	-	Configuration for handling pod anti-affinity scheduling constraints.

vertical:
  antiAffinity:
    considerAntiAffinity: false

vertical.antiAffinity.considerAntiAffinity

Field	Type	Required	Default	Description
`considerAntiAffinity`	boolean	*Yes	`false`	When `true`, workload autoscaler will respect pod anti-affinity rules on hostname or host port and, as a result, issue recommendations for these pods in a deferred manner. When `false` (default), recommendations for pods containing one of the constraints above will be applied immediately instead. *If using the `vertical.antiAffinity` configuration option, this field becomes required.

vertical:
  antiAffinity:
    considerAntiAffinity: false

vertical.startup

Field	Type	Required	Default	Description
`startup`	object	No	-	Configuration for handling workload startup behavior. See Startup metrics.

vertical:
  startup:
    period: 5m

vertical.startup.period

Field	Type	Required	Default	Description
`period`	duration	*Yes	"2m"	Duration to ignore resource usage metrics after workload startup. Useful for applications with high initial resource usage spikes. Set to `0s` to disable startup metrics ignore period completely. Valid values range from `0s` (disabled) to `60m`. *If using the `vertical.startup` configuration option, this field becomes required.

vertical:
  startup:
    period: 5m  # Example: ignore first 5 minutes of metrics

vertical:
  startup:
    period: 0s  # Disable startup metrics ignore period

vertical.confidence

Field	Type	Required	Default	Description
`confidence`	object	No	-	Configuration for recommendation confidence thresholds.

vertical:
  confidence:
    threshold: 0.5
    required:  false

vertical.confidence.required

Field	Type	Required	Default	Description
`required`	bool	*Yes	-	When set to: - `true`: The workload will require confidence metrics before optimization, even when directly enabled via annotations. This prevents immediate optimization of workloads with highly variable load patterns until sufficient historical data is available. - `false`: When enabled via annotations, the workload will be optimized immediately, bypassing the confidence check. Note: This option is not recommended. *If using the `vertical.confidence` configuration option, at least one confidence field is required.

vertical:
  confidence:
    required: true

vertical.confidence.threshold

Field	Type	Required	Default	Description
`threshold`	float	*Yes	0.9	Minimum confidence score required to apply recommendations (0.0-1.0). Higher values require more data points for recommendations. *If using the `vertical.confidence` configuration option, at least one confidence field is required.

vertical:
  confidence:
    threshold: 0.5

vertical.cpu

Field	Type	Required	Default	Description
`cpu`	object	No	-	CPU-specific scaling configuration.

vertical:
  cpu:
    target: p80
    lookBackPeriod: 24h
    min: 100m
    max: 1000m
    applyThresholdStrategy:
    	type: defaultAdaptive
    overhead: 0.0

vertical.cpu.target

Field	Type	Required	Default	Description
`target`	string	No	"p80"	Resource usage target: - `max` - Use maximum observed usage - `p{0-99}` - Use percentile (e.g., `p80` for 80th percentile).

vertical:
  cpu:
    target: p80

vertical.cpu.lookBackPeriod

Field	Type	Required	Default	Description
`lookBackPeriod`	duration	No	"24h"	Historical resource usage data window to consider for recommendations (24h-168h). See Look-back Period.

vertical:
  cpu:
    lookBackPeriod: 24h

vertical.cpu.min

Field	Type	Required	Default	Description
`min`	string	No	"10m"	The lower limit for the recommendation uses standard Kubernetes CPU notation (e.g., "1000m" or "1"). The minimum cannot be greater than the maximum.

vertical:
  cpu:
    min: 100m

vertical.cpu.max

Field	Type	Required	Default	Description
`max`	string	No	-	The upper limit for the recommendation. It uses standard Kubernetes CPU notation (e.g., "1000m" or "1"). Recommendations will not exceed this value.

vertical:
  cpu:
    max: 1000m

vertical.cpu.applyThreshold

🚧
Deprecation Notice
The applyThreshold configuration option is deprecated but still supported for backward compatibility. We strongly recommend migrating to the new applyThresholdStrategy configuration format for future compatibility and access to the latest features. See applyThresholdStrategy.

Field	Type	Required	Default	Description
`applyThreshold`	float	No	0.1	The relative difference required between current and recommended resource values to apply a change immediately: - for upscaling, the difference is calculated relative to current resource requests; - for downscaling, it's calculated relative to the new recommended value. For example, with a threshold of 0.1 (10%), an upscale from 100m to 120m CPU would be applied immediately (20% increase relative to current 100m), while an upscale from 110m to 120m would not be applied immediately (8% increase relative to new 120m).Value range: 0.01-2.5.

vertical:
  cpu:
    applyThreshold: 0.1

vertical.cpu.applyThresholdStrategy

🚧
Warning
applyThreshold and applyThresholdStrategy cannot be used simultaneously in a configuration as that will result in an error. applyThresholdStrategy is the latest and recommended configuration option.

Field	Type	Required	Default	Description
`applyThresholdStrategy`	object	No	-	Configuration for the strategy used to determine when recommendations should be applied. The strategy determines how the threshold percentage is calculated based on current resource requests.

vertical:
  cpu:
    applyThresholdStrategy:
      type: defaultAdaptive

vertical.cpu.applyThresholdStrategy.type

Field	Type	Required	Default	Description
`type`	string	*Yes	"defaultAdaptive"	The type of threshold strategy to use: - `defaultAdaptive` - Automatically adjusts thresholds based on workload size. - `percentage` - Uses a fixed percentage threshold. - `customAdaptive` - Allows custom configuration of the adaptive threshold formula. Recommended for power users only. It works in the same way as the Default Adaptive Threshold, but it allows tweaking the parameters of the adaptive threshold formula.The `defaultAdaptive` threshold option uses the following values: `numerator` = `0.5` `denominator` = `1` `exponent` = `1` (same effect as not used, i.e., no influence on the calculation)*Required when using `applyThresholdStrategy`

vertical:
  cpu:
    applyThresholdStrategy:
      type: defaultAdaptive  # Using default adaptive threshold

vertical.cpu.applyThresholdStrategy.percentage

Field	Type	Required	Default	Description
`percentage`	float	*Yes	-	The fixed percentage threshold to use. Value range: 0.01-2.5. *Required when type is `percentage`

vertical:
  cpu:
    applyThresholdStrategy:
      type: percentage      # Using fixed percentage threshold
      percentage: 0.3      # 30% threshold

vertical.cpu.applyThresholdStrategy.numerator

Field	Type	Required	Default	Description
`numerator`	float	*Yes	0.5	Affects the vertical stretch of the threshold function. Lower values create smaller thresholds. *Required when type is `customAdaptive`

vertical:
  cpu:
    applyThresholdStrategy:
      type: customAdaptive    # Using custom adaptive threshold
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.cpu.applyThresholdStrategy.denominator

Field	Type	Required	Default	Description
`denominator`	float	*Yes	1	Affects threshold sensitivity for small workloads. Values close to `0` result in larger thresholds for small workloads. For example, when `numerator` is `1`, `exponent` is `1` and `denominator` is `0` the threshold for `0.5` req. CPU will be 200%. *Required when type is `customAdaptive`

vertical:
  cpu:
    applyThresholdStrategy:
      type: customAdaptive
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.cpu.applyThresholdStrategy.exponent

Field	Type	Required	Default	Description
`exponent`	float	*Yes	1	Controls how quickly the threshold decreases for larger workloads. Lower values prevent extremely small thresholds for large resources. *Required when type is `customAdaptive`

vertical:
  cpu:
    applyThresholdStrategy:
      type: customAdaptive
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.cpu.overhead

Field	Type	Required	Default	Description
`overhead`	float	No	0.0	Additional resource buffer when applying recommendations (0.0-2.5, e.g., 0.1 = 10%). If a 10% buffer is configured, the issued recommendation will have +10% added to it, so that the workload can handle further increased resource demand.

vertical:
  cpu:
    overhead: 0.0

vertical.cpu.limit

Field	Type	Required	Default	Description
`limit`	object	No	-	Configuration for container CPU limit scaling. Default behaviour when not specified: - If `spec.containers[].resources.limits.cpu` is not defined on the workload, no limit is set by the workload autoscaler. - If `spec.containers[].resources.limits.cpu` is defined on the workload, it is removed by workload autoscaler.

vertical:
  cpu:
    limit:
      type: multiplier
      multiplier: 2.0

vertical.cpu.limit.type

Field	Type	Required	Default	Description
`type`	string	*Yes	-	Type of CPU limit scaling to apply: - `noLimit` - Remove the resource limit from the workload definition entirely. - `multiplier` - Set limit as a multiple of requests using the following formula:`resources.requests.cpu * {multiplier}`. *If using the `vertical.cpu.limit` configuration option, this field becomes required.

vertical:
  cpu:
    limit:
      type: multiplier

vertical.cpu.limit.multiplier

Field	Type	Required	Default	Description
`multiplier`	float	*Yes	-	Value to multiply the requests by to set the limit. The calculation is: `resources.requests.cpu * {multiplier}`. The value must be greater than or equal to `1`. *Required when type is set to `multiplier`.

vertical:
  cpu:
    limit:
      type: multiplier
      multiplier: 2.0

vertical.cpu.optimization

Field	Type	Required	Default	Description
`optimization`	string	no	-	This configuration option can be used to disable CPU management for workloads that benefit from memory management only. The workload will then use CPU requests/limits configured in the template. * The only allowed option is off. If not set, the resource will inherit the workload management option as normal. Minimum required `workload-autoscaler` component version to use this feature is `v0.23.1`.

vertical:
  cpu:
    optimization: off

vertical.memory

Field	Type	Required	Default	Description
`memory`	object	No	-	Memory-specific scaling configuration.

vertical:
  memory:
    target: max
    lookBackPeriod: 24h
    min: 128Mi
    max: 2Gi
    applyThresholdStrategy:
    	type: defaultAdaptive
    overhead: 0.1

vertical.memory.target

Field	Type	Required	Default	Description
`target`	string	No	"max"	Resource usage target: - `max` - Use maximum observed usage. - `p{0-99}` - Use percentile (e.g., `p80` for 80th percentile).

vertical:
  memory:
    target: max

vertical.memory.lookBackPeriod

Field	Type	Required	Default	Description
`lookBackPeriod`	duration	No	"24h"	Historical resource usage data window to consider for recommendations (24h-168h). See Look-back Period.

vertical:
  memory:
    lookBackPeriod: 24h

vertical.memory.min

Field	Type	Required	Default	Description
`min`	string	No	"10Mi"	Minimum resource limit. Uses standard Kubernetes memory notation (e.g., "2Gi", "1000Mi").

vertical:
  memory:
    min: 128Mi

vertical.memory.max

Field	Type	Required	Default	Description
`max`	string	No	-	Maximum resource limit. Uses standard Kubernetes memory notation (e.g., "2Gi", "1000Mi").

vertical:
  memory:
    max: 2Gi

vertical.memory.applyThreshold

🚧
Deprecation Notice
The applyThreshold configuration option is deprecated but still supported for backward compatibility. We strongly recommend migrating to the new applyThresholdStrategy configuration format for future compatibility and access to the latest features. See applyThresholdStrategy.

Field	Type	Required	Default	Description
`applyThreshold`	float	No	0.1	The relative difference required between current and recommended resource values to apply a change immediately: - for upscaling, the difference is calculated relative to current resource requests; - for downscaling, it's calculated relative to the new recommended value. For example, with a threshold of 0.1 (10%), an upscale from 100MiB to 120MiB of memory would be applied immediately (20% increase relative to current 100MiB), while an upscale from 110MiB to 120MiB would not be applied immediately (8% increase relative to new 120MiB). Value range: 0.01-2.5.

vertical:
  memory:
    applyThreshold: 0.1

vertical.memory.applyThresholdStrategy

🚧
Warning
applyThreshold and applyThresholdStrategy cannot be used simultaneously in a configuration as that will result in an error. applyThresholdStrategy is the latest and recommended configuration option.

Field	Type	Required	Default	Description
`applyThresholdStrategy`	object	No	-	Configuration for the strategy used to determine when recommendations should be applied. The strategy determines how the threshold percentage is calculated based on current resource requests.

vertical:
  memory:
    applyThresholdStrategy:
      type: customAdaptive    # Using custom adaptive threshold
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.memory.applyThresholdStrategy.type

Field	Type	Required	Default	Description
`type`	string	*Yes	"defaultAdaptive"	The type of threshold strategy to use: - `defaultAdaptive` - Automatically adjusts thresholds based on workload size. - `percentage` - Uses a fixed percentage threshold. - `customAdaptive` - Allows custom configuration of the adaptive threshold formula. Recommended for power users only. It works in the same way as the Default Adaptive Threshold, but it allows tweaking the parameters of the adaptive threshold formula. The `defaultAdaptive` threshold option uses the following values: `numerator` = `0.5` `denominator` = `1` `exponent` = `1` (same effect as not used, i.e., no influence on the calculation)*Required when using `applyThresholdStrategy`

vertical:
  memory:
    applyThresholdStrategy:
      type: defaultAdaptive  # Using default adaptive threshold

vertical.memory.applyThresholdStrategy.percentage

Field	Type	Required	Default	Description
`percentage`	float	*Yes	-	The fixed percentage threshold to use. Value range: 0.01-2.5. *Required when type is `percentage`

vertical:
  memory:
    applyThresholdStrategy:
      type: percentage      # Using fixed percentage threshold
      percentage: 0.3      # 30% threshold

vertical.memory.applyThresholdStrategy.numerator

Field	Type	Required	Default	Description
`numerator`	float	*Yes	0.5	Affects the vertical stretch of the threshold function. Lower values create smaller thresholds. *Required when type is `customAdaptive`

vertical:
  memory:
    applyThresholdStrategy:
      type: customAdaptive    # Using custom adaptive threshold
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.memory.applyThresholdStrategy.denominator

Field	Type	Required	Default	Description
`denominator`	float	*Yes	1	Affects threshold sensitivity for small workloads. Values close to `0` result in larger thresholds for small workloads. For example, when `numerator` is `1`, `exponent` is `1` and `denominator` is `0` the threshold for `0.5` req. Memory will be 200%. *Required when type is `customAdaptive`

vertical:
  memory:
    applyThresholdStrategy:
      type: customAdaptive
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.cpu.applyThresholdStrategy.exponent

Field	Type	Required	Default	Description
`exponent`	float	*Yes	1	Controls how quickly the threshold decreases for larger workloads. Lower values prevent extremely small thresholds for large resources. *Required when type is `customAdaptive`

vertical:
  memory:
    applyThresholdStrategy:
      type: customAdaptive
      numerator: 0.1
      exponent: 0.1
      denominator: 2

vertical.memory.overhead

Field	Type	Required	Default	Description
`overhead`	float	No	0.1	Additional resource buffer when applying recommendations (0.0-2.5, e.g., 0.1 = 10%). If a 10% buffer is configured, the issued recommendation will have +10% added to it so that the workload can handle further increased resource demand.

vertical:
  memory:
    overhead: 0.1

vertical.memory.limit

Field	Type	Required	Default	Description
`limit`	object	No	-	Configuration for container memory limit scaling. The default behavior when not specified: - If `spec.containers[].resources.limits.memory` is not defined on the workload, no limit is set by workload autoscaler. - If `spec.containers[].resources.limits.memory` is defined on the workload, workload autoscaler calculates the limit using the following formula: `max(resources.requests.memory * 1.5, current.resources.limit)`(it will only increase the limit, but never lower it).

vertical:
  memory:
    limit:
      type: multiplier  
      multiplier: 1.5

vertical.memory.limit.type

Field	Type	Required	Default	Description
`type`	string	*Yes	-	Type of limit scaling to apply: - `noLimit` - Remove the resource limit from the workload definition entirely. - `multiplier` - Set limit as a multiple of requests using the following formula: `resources.requests.memory * {multiplier}`. *If using the `vertical.memory.limit` configuration option, this field (type) becomes required.

vertical:
  memory:
    limit:
      type: multiplier

vertical.memory.limit.multiplier

Field	Type	Required	Default	Description
`multiplier`	float	*Yes	-	Value to multiply the requests by to set the limit on the workload. The calculation is: `resources.requests.memory * {multiplier}`. The value must be greater than or equal to `1`. *Required when type is set to `multiplier`.

vertical:
  memory:
    limit:
      type: multiplier  
      multiplier: 1.5

vertical.memory.optimization

Field	Type	Required	Default	Description
`optimization`	string	no	-	This configuration option can be used to disable memory management for workloads that benefit from CPU management only (e.g., Java workloads with a fixed heap size). The workload will then use memory requests/limits configured in the template. * The only allowed option is off. If not set, the resource will inherit the workload management option as normal. Minimum required `workload-autoscaler` component version to use this feature is `v0.23.1`.

vertical:
  memory:
    optimization: off

vertical.downscaling

Field	Type	Required	Default	Description
`downscaling`	object	No	-	Downscaling behavior override.

vertical:
  downscaling:
    applyType: immediate

vertical.downscaling.applyType

Field	Type	Required	Default	Description
`applyType`	string	No	Default is taken from the vertical scaling policy controlling the workload.	Override application mode: - `immediate` - Apply changes immediately - `deferred` - Apply during natural restarts

vertical:
  downscaling:
    applyType: immediate

vertical.memoryEvent

Field	Type	Required	Default	Description
`memoryEvent`	object	No	-	Memory event behavior override.

vertical:
  memoryEvent:
    applyType: immediate

vertical.memoryEvent.applyType

This configuration option is fully compatible with other applyType options and is meant to be used in combination with them. This allows for fine-grained control over both upscaling and downscaling. Here's how they interact:

If both configuration options are set to the same value (both immediate or both deferred), the behavior remains unchanged.
If vertical.downscaling.applyType is set to deferred and vertical.memoryEvent.applyType is set to immediate:
- Upscaling operations will be applied immediately.
- Downscaling operations will be deferred to natural pod restarts.
If vertical.downscaling.applyType is set to immediate and vertical.memoryEvent.applyType is set to deferred:
- Upscaling operations will be deferred to natural pod restarts.
- Downscaling operations will be applied immediately.

Field	Type	Required	Default	Description
`applyType`	string	*Yes	Default is taken from the vertical scaling policy controlling the workload.	Override application mode for memory-related events (OOM kills, pressure): - `immediate` - Apply changes immediately - `deferred` - Apply during natural restarts*If using the `vertical.memoryEvent` configuration option, this field becomes required.

vertical:
  memoryEvent:
    applyType: immediate

vertical.containers

Configuration object that contains container-specific settings.

Field	Type	Required	Default	Description
containers	object	No	-	Container configuration mapping.

vertical:
  containers:

vertical.containers.{container_name}

Configuration object that contains resource constraints for a specific container. Replace {container_name} with the name of your container.

Field	Type	Required	Default	Description
{container_name}	object	No	-	Container resource configuration object. Has to be the name of the container for which resources are being configured.

vertical:
  containers:
    {container_name}:

vertical.containers.{container_name}.cpu

Container CPU constraints. Set minimum and maximum CPU limits for a specific container to define the workload autoscaler's scaling range.

Field	Type	Required	Default	Description
max	string	No	-	The upper limit for the recommendation. Uses standard Kubernetes CPU notation (e.g., "1000m" or "1"). Recommendations won't exceed this value.
min	string	No	-	The lower limit for the recommendation. Uses standard Kubernetes CPU notation (e.g., "1000m" or "1"). Min cannot be greater than max.

vertical:
  containers:
    {container_name}:
      cpu:
        min: 10m
        max: 1000m

vertical.containers.{container_name}.memory

Container memory constraints. Set minimum and maximum memory limits for a specific container to define the workload autoscaler's scaling range.

Field	Type	Required	Default	Description
min	string	No	-	Minimum resource limit. Uses standard Kubernetes memory notation (e.g., "2Gi", "1000Mi").
max	string	No	-	Maximum resource limit. Uses standard Kubernetes memory notation (e.g., "2Gi", "1000Mi").

vertical:
  containers:
    {container_name}:
      memory:
        min: 10Mi
        max: 2048Mi

rolloutBehavior

Field	Type	Required	Default	Description
`rolloutBehavior`	object	No	-	Configuration for controlling how recommendations are rolled out.

rolloutBehavior:
  type: NoDisruption

rolloutBehavior.type

Field	Type	Required	Default	Description
`type`	string	*Yes	-	Controls how recommendation updates are rolled out: - `NoDisruption` Ensures zero-downtime updates for single-replica workloads by temporarily scaling to two replicas during updates. This prevents service interruptions when applying new resource recommendations. If using the `rolloutBehavior` configuration option, this field becomes required. Note*: Requires `workload-autoscaler` component version v0.35.3 or higher. This setting is incompatible with the `deferred` apply type - if you change policy to `deferred`, you must remove the `rolloutBehavior` setting. This setting is applicable only to Deployment resources with a single replica, whose rollout strategy allows for downtime.

rolloutBehavior:
  type: NoDisruption

horizontal

Field	Type	Required	Default	Description
`horizontal`	object	No	-	Horizontal scaling configuration.

horizontal:
  optimization: on
  minReplicas: 1
  maxReplicas: 10
  scaleDown:
    stabilizationWindow: 5m
  shortAverage: 3m

horizontal.optimization

Field	Type	Required	Default	Description
`optimization`	string	Yes*	-	Enable horizontal scaling ("on"/"off"). *If using the `horizontal` configuration option, this field becomes required.

horizontal:
  optimization: on

horizontal.minReplicas

Field	Type	Required	Default	Description
`minReplicas`	integer	Yes*	-	Minimum number of replicas.

horizontal:
  minReplicas: 1

horizontal.maxReplicas

Field	Type	Required	Default	Description
`maxReplicas`	integer	Yes*	-	Maximum number of replicas.

horizontal:
  maxReplicas: 10

horizontal.scaleDown

Field	Type	Required	Default	Description
`scaleDown`	object	No	-	Houses scaledown cofiguration options.

horizontal.scaleDown.stabilizationWindow

Field	Type	Required	Default	Description
`stabilizationWindow`	duration	*Yes	"5m"	Cooldown period between scale-downs. *If using the `horizontal.scaleDown` configuration option, this field becomes required.

horizontal:
  scaleDown:
    stabilizationWindow: 5m

*Required if the parent object is present in the configuration.

containersGrouping

Field	Type	Required	Default	Description
`containersGrouping`	array	No	-	Rules for grouping dynamically generated containers with similar naming patterns.

containersGrouping:
  - key: name
    operator: contains
    values: ["data-processor"]
    into: data-processor

containersGrouping.[].key

Field	Type	Required	Default	Description
`key`	string	*Yes	-	The attribute used to match containers. Currently, only supports `name` which refers to the container name property.

containersGrouping.[].operator

Field	Type	Required	Default	Description
`operator`	string	*Yes	-	Defines how the `key` is evaluated against the `values` list. Currently, only supports contains.

containersGrouping.[].values

Field	Type	Required	Default	Description
`values`	array	*Yes	-	A list of string values used for matching against the `key` with the specified operator. Must contain at least one item.

containersGrouping.[].into

Field	Type	Required	Default	Description
`into`	string	*Yes	-	The target container name into which matching containers should be grouped.

Example usage

See Container Grouping for Dynamic Containers.

*Required if the parent object is present in the configuration.

🚧
Legacy Annotation Support
For documentation on the legacy annotation format, which is now deprecated, see the Legacy Annotations Reference page .

Migration Guide

📘
Note
The annotations V2 structure cannot be combined with deprecated annotations V1. When the annotation workloads.cast.ai/configuration is detected, the workload is considered to be configured by using that annotation and all other annotations starting with workloads.cast.ai will be ignored.

To migrate from v1 to v2 annotations:

Remove all individual legacy workloads.cast.ai/* annotations
Add the new workloads.cast.ai/configuration annotation
Move all settings into the YAML structure under the new annotation

For example, these v1 annotations:

workloads.cast.ai/vertical-autoscaling: "on"
workloads.cast.ai/cpu-target: "p80"
workloads.cast.ai/memory-max: "2Gi"

Would become:

workloads.cast.ai/configuration: |
  vertical:
    optimization: on
    cpu:
      target: p80
    memory:
      max: 2Gi

Upgrading

📘Note

Upgrade with Helm Test Enabled

Upgrade with Memory Limits & Helm Test Enabled

Rollback to a Previous Version

Check Release History

Rollback to a Specific Version

Verify the deployment

ArgoCD compatibility

Dynamically Injected containers

Available Workload Settings

📘Note

Recommendation Apply Type

Immediate mode

Deferred mode

Recommendation Annotations in Different Scaling Modes

Scaling Mode Behavior

Recommendation Percentile

Configuring Percentiles

Use Cases and Recommendations

CPU Percentile

Memory Percentile

Resource Overhead

Configuring Resource Overhead

Limitations

Workload Resource Limits

Configuration Options

CPU Limit Options

Memory Limit Options

Impact on Resource Management

Limitations

Ignore startup metrics

Change Sensitivity

Dynamic vs Percentage Sensitivity

Percentage Sensitivity

Dynamic Sensitivity (Recommended)

Recommendations

Configuring Change Sensitivity in Policies

Dynamic Sensitivity Simulation

Advanced Configuration Options

Workload Autoscaler Constraints

Configuration Options

Use Cases

Limitations

Container-Level Constraints Example

Look-back period

Choosing the right look-back period

💡Tip

Resource-specific optimization

📘Note

Version requirements

Configuration options

Single-replica workload management

How it works

Configuring zero-downtime updates

Prerequisites

Custom workload support

Label-based workload selection

Configuring custom workloads via annotations

Configuring autoscaling behavior

Examples

📘Note

Container Grouping for Dynamic Containers

Container Grouping Annotation

Configuration Options

Use Case Example

🚧Important

Complete Example

Scaling Job workloads

Creating scaling policies for Jobs

📘Note

GitLab CI Runner Integration

Configuring GitLab Runner

Important Considerations

Example Configuration

Configuration via API/UI

Configuration via Annotations

Example

Configuration Structure

📘Note

📘
Note

📘
Note

💡
Tip

📘
Note

📘
Note

🚧
Important

📘
Note

📘
Note

🚧
Deprecation Notice

🚧
Warning

🚧
Deprecation Notice

🚧
Warning