Horizontal Pod Autoscaling

Horizontal Pod Autoscaling (HPA) automatically adjusts the number of pods in a deployment based on observed CPU utilization. CAST AI's implementation of HPA goes beyond traditional autoscaling by working in harmony with our Vertical Pod Autoscaler (VPA) to provide a comprehensive workload optimization solution.

Our HPA is designed to handle rapid fluctuations in workload demand, ensuring your applications remain responsive and cost-effective. By automatically scaling the number of pods up or down, CAST AI HPA helps maintain optimal performance during traffic spikes while preventing over-provisioning during quieter periods.

System Requirements for HPA

To leverage CAST AI's HPA capabilities, your cluster needs to meet certain system requirements. You need the following component versions installed:

castai-workload-autoscalerv0.11.2 or later
castai-agentv0.60.0 or later

Ensuring you have these minimum versions installed grants access to the HPA features and optimizations.

Workload Requirements for HPA

Not all workloads are suitable for horizontal scaling. To be eligible for CAST AI HPA, your workload must meet the following criteria:

  • It must be a Deployment
  • At least one container must have CPU requests defined. HPA cannot accurately determine when to scale the workload without defined CPU requests.
  • The workload must not have native HPA on the CPU enabled. CAST AI HPA is designed to replace, not complement, native Kubernetes HPA.


Interaction with native HPA

  • Workloads with native HPA enabled will not be scaled horizontally by CAST AI.
  • Configuring CAST AI HPA via annotations is not allowed when native HPA is present.
  • Users are expected to disable their native HPA before working with CAST AI HPA.


Our HPA makes scaling decisions based on short-term CPU usage of the workload, typically over the last Β±3 minutes. This timeframe allows for responsive scaling while avoiding reactionary decisions based on momentary spikes.

The HPA algorithm considers these factors:

  • Recent CPU usage: This provides insight into the current demand on the workload.
  • Current pod requests: The resource requests set for each pod influence scaling decisions.
  • Current pod count: The existing number of pods is considered when determining whether to scale up or down.

HPA requires a minimum of a few metrics data points to initiate operations. As a result, scaling actions typically occur within minutes of a workload being onboarded, if needed.

Multi-container workloads

When a workload contains multiple containers, the Workload autoscaler makes HPA autoscaling decisions based on the metrics of the largest container in the pod.

HPA Behavior with Existing Pod Counts

Understanding how HPA interacts with your current deployment state is crucial for effective management. Here's what happens when you set constraints that differ from the current pod count:

If you set constraints (min/max) that differ from the current pod count, HPA will adjust the workload to meet these constraints after the cooldown period. For example:

  • Scenario: The current pod count is 6, but you set the HPA max to 5
  • Outcome: After the cooldown period, HPA will reduce the workload to a maximum of 5 replicas

It's important to note that while HPA considers current requests and CPU usage in its decision-making process, it will never exceed the constraints you've set. This gives you ultimate control over the scaling behavior of your workloads.

Disabling HPA

When you turn off CAST AI HPA for a deployment:

  • Scaling Stops: The system will stop actively scaling the deployment based on resource utilization.
  • Current State Maintained: The deployment will retain its current number of replicas at the time of disabling HPA.
  • No Automatic Adjustments: After disabling, the system will not make any further automatic adjustments to the number of replicas.

Configuring HPA

CAST AI provides flexible configuration options for HPA, allowing you to fine-tune its behavior to suit your specific needs. Configuration can be done via UI, API, or Kubernetes annotations.

API Configuration Options

Through the CAST AI WorkloadOptimizationAPI, you can set the following parameters:

  • Min/Max replicas allowed to be set by HPA (required settings)
    These settings define the boundaries within which HPA can scale your workload. For example, you might set a minimum of 2 replicas to ensure high availability and a maximum of 10 to control costs.
  • HPA enabled/disabled
    This toggle lets you easily enable or disable HPA for specific workloads without removing the configuration.

Annotation Configuration Options

For more granular control, you can use Kubernetes annotations to configure HPA:

workloads.cast.ai/horizontal-autoscaling"on" or "off"YesThis annotation enables or disables HPA for the workload. Set it to "on" to enable HPA, or "off" to disable it.
workloads.cast.ai/min-replicasstringYesSpecifies the minimum number of replicas that HPA should maintain.
workloads.cast.ai/max-replicasstringYesSpecifies the maximum number of replicas that HPA can scale up to.
workloads.cast.ai/horizontal-downscale-stabilization-windowstringNoThis optional setting determines the downscale stabilization period. For example, if it's set to "3m" (3 minutes), it means HPA will wait 3 minutes before considering another scale-down action, even if metrics indicate it's necessary. This helps prevent rapid fluctuations in scaling.

The duration needs to be parsable.

The downscale stabilization period is particularly useful for workloads with variable traffic patterns, as it helps maintain a balance between responsiveness and stability.

Configuration example:

workloads.cast.ai/horizontal-autoscaling: "on" # Enable horizontal autoscaling for this deployment
workloads.cast.ai/min-replicas: "1" # Set the minimum number of replicas to 1
workloads.cast.ai/max-replicas: "10" # Set the maximum number of replicas to 10
workloads.cast.ai/horizontal-downscale-stabilization-window: "3m" # Set a 3-minute stabilization window before scaling down
apiVersion: apps/v1
kind: Deployment
  name: my-app
    app: my-app
    workloads.cast.ai/horizontal-autoscaling: "on"
    workloads.cast.ai/min-replicas: "1"
    workloads.cast.ai/max-replicas: "10"
    workloads.cast.ai/horizontal-downscale-stabilization-window: "3m"

Combining HPA and VPA

If you want to use Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA) together, you need to use a specific annotation for VPA.

workloads.cast.ai/vertical-autoscaling"on"or "off"This annotation enables or disables VPA for the workload. Set it to "on" to enable VPA, or "off" to disable it.



The old VPA annotation workloads.cast.ai/autoscaling: "vertical" is not supported with HPA annotations.

Configuration example:

workloads.cast.ai/horizontal-autoscaling: "on"  
workloads.cast.ai/vertical-autoscaling: "on"

This combination allows CAST AI to optimize your workloads horizontally (adjusting the number of pods) and vertically (adjusting the resources allocated to each pod).

HPA Interaction with CI/CD Systems

Continuous Deployment (CD) systems can sometimes interfere with HPA operations. CAST AI has implemented solutions to ensure smooth integration:

  • ArgoCD: CAST AI will change the replica field owner value to prevent ArgoCD from detecting an out-of-sync state.

Automated deployments and HPA

In environments where automated deployments frequently update workloads, there can be potential conflicts between HPA's scaling decisions and the specifications defined in your deployment manifests. Here's what you need to know:

  1. Conflicting Replica Counts: If your CI/CD pipeline repeatedly deploys a workload with a specific replica count that differs from HPA's decisions, CAST AI's HPA is designed to detect and manage these conflicts.
  2. Cooldown Period: CAST AI implements a cooldown mechanism to prevent rapid fluctuations and potential resource thrashing.
  3. Resuming Normal Operation: After the cooldown period, which lasts 30 minutes, HPA will resume normal operation for the workload.