Horizontal Pod Autoscaling

Horizontal Pod Autoscaling (HPA) automatically adjusts the number of pods in a deployment based on observed CPU utilization. Cast AI's implementation of HPA goes beyond traditional autoscaling by working in harmony with our Vertical Pod Autoscaler (VPA) to provide a comprehensive workload optimization solution.

Our HPA is designed to handle rapid fluctuations in workload demand, ensuring your applications remain responsive and cost-effective. By automatically scaling the number of pods up or down, Cast AI HPA helps maintain optimal performance during traffic spikes while preventing over-provisioning during quieter periods.

System Requirements for HPA

To leverage Cast AI's HPA capabilities, your cluster needs to meet certain system requirements. You need the following component versions installed:

Component	Version
castai-workload-autoscaler	v0.11.2 or later (v0.21.0+ required for Rollouts)
castai-agent	v0.60.0 or later

Ensuring these minimum versions are installed grants access to the HPA features and optimizations.

Workload Requirements for HPA

Not all workloads are suitable for horizontal scaling. To be eligible for Cast AI HPA, your workload must meet the following criteria:

It must be a Deployment or Rollout
At least one container must have CPU requests defined. Without defined CPU requests, HPA cannot accurately determine when to scale the workload
The workload must not have native HPA on the CPU enabled. Cast AI HPA is designed to replace, not complement, native Kubernetes HPA

📘
Interaction with native HPA

Workloads with native HPA enabled will not be scaled horizontally by Cast AI.

Configuring Cast AI HPA via annotations is not allowed when native HPA is present.

Users are expected to disable their native HPA before working with Cast AI HPA.

Rollout-Specific Requirements

When using Rollouts with Cast AI HPA, be aware of these additional limitations:

Rollouts using workloadRef: Rollouts that reference a Deployment using the workloadRef field cannot be managed by Cast AI HPA. This configuration creates an indirect relationship between the Rollout and its pods, preventing HPA functionality.

# Example of an unsupported Rollout configuration
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: example-rollout
spec:
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-deployment
  # Other Rollout configuration...

Rollouts converted from template to workloadRef: If you initially configure a Rollout using a template and later modify it to use workloadRef, any existing Cast HPA configuration will become invalid. You'll need to disable HPA before making this change.

How Cast AI HPA Works

Cast AI HPA makes scaling decisions based on the workload's short-term CPU usage, typically over the last ±3 minutes. This timeframe allows for responsive scaling while avoiding reactionary decisions based on momentary spikes.

The HPA algorithm considers these factors:

Recent CPU usage: Recalculated every 15 seconds, based on usage over the last 3 minutes
Current pod requests: The resource requests set for each pod influence scaling decisions
Current pod count: The existing number of pods is considered when determining whether to scale up or down

HPA requires a minimum of a few metrics data points to initiate operations. As a result, scaling actions typically occur within minutes of a workload being onboarded, if needed.

Multi-container workloads

When a workload contains multiple containers, the Workload autoscaler makes HPA autoscaling decisions based on the metrics of the largest container in the pod.

HPA Behavior with Existing Pod Counts

Understanding how HPA interacts with your current deployment state is crucial for effective management. Here's what happens when you set constraints that differ from the current pod count:

If you set constraints (min/max) that differ from the current pod count, HPA will adjust the workload to meet these constraints after the cooldown period. For example:

Scenario: The current pod count is 6, but you set the HPA max to 5
Outcome: After the cooldown period, HPA will reduce the workload to a maximum of 5 replicas

It's important to note that while HPA considers current requests and CPU usage in its decision-making process, it will never exceed the constraints you've set. This gives you ultimate control over the scaling behavior of your workloads.

HPA and VPA Interaction

Cast AI's HPA works in harmony with our Vertical Pod Autoscaler (VPA) to provide comprehensive workload optimization. While HPA reacts immediately to CPU spikes by adjusting pod counts, VPA gradually adjusts resource requests based on longer-term usage patterns.

The key things to understand about this interaction:

HPA provides immediate response to traffic spikes
VPA makes slower, more deliberate adjustments to resource allocation
Both can operate simultaneously to optimize your workloads

Memory vs. CPU Scaling

Currently, Cast AI HPA only supports CPU-based scaling. For workloads that need to scale based on memory usage, we recommend using native Kubernetes HPA. This design choice reflects our focus on optimizing for CPU utilization patterns, which are typically more indicative of immediate scaling needs.

Memory-based scaling limitations

When using memory-based HPA metrics (native HPA) alongside Cast AI's workload autoscaler, you may encounter unexpected scaling behavior. The most common issue is pods rapidly scaling to their maximum replica count and not scaling back down.

This occurs because the workload autoscaler's memory optimization can conflict with native HPA's memory utilization targets. When the workload autoscaler reduces memory requests, the HPA may interpret this as high utilization and trigger unnecessary scaling.

For example, if your HPA configuration includes memory metrics:

metrics:
  - resource:
      name: memory
      target:
        averageUtilization: 75
        type: Utilization
    type: Resource

To resolve this, you have these options:

Remove memory metrics from your HPA configuration and rely on CPU-based scaling
Use Cast AI HPA in combination with VPA
Use single resource optimization to only allow workload autoscaler to scale CPU
Set minimum memory constraints in your workload autoscaler configuration to prevent excessive memory request reduction

Cast AI recommends using CPU-based scaling with HPA, as it provides more predictable and stable scaling behavior while still allowing for efficient resource optimization.

Startup Metrics Behavior

HPA actively responds to resource usage during application startup. This means that applications with high initial resource usage (like JVM applications) may trigger scaling events during startup.

Disabling HPA

When you turn off Cast AI HPA for a deployment:

Scaling Stops: The system will stop actively scaling the deployment based on resource utilization.
Current State Maintained: The deployment will retain its current number of replicas at the time of disabling HPA.
No Automatic Adjustments: After disabling, the system will not make any further automatic adjustments to the number of replicas.

Configuring HPA

Cast AI provides flexible configuration options for HPA, allowing you to fine-tune its behavior to suit your specific needs. Configuration can be done via UI, API, or Kubernetes annotations.

API Configuration Options

Through the Cast AI WorkloadOptimizationAPI, you can set the following parameters:

Min/Max replicas allowed to be set by HPA (required settings)
These settings define the boundaries within which HPA can scale your workload. For example, you might set a minimum of 2 replicas to ensure high availability and a maximum of 10 to control costs.
HPA enabled/disabled
This toggle lets you easily enable or disable HPA for specific workloads without removing the configuration.

Annotation Configuration Options

For more granular control, you can configure HPA behavior using the workloads.cast.ai/configuration Kubernetes annotation. This allows you to define HPA settings alongside other configuration options.

Here's a basic example of HPA configuration using the annotation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
  annotations:
    workloads.cast.ai/configuration: |
      horizontal:
        optimization: on
        minReplicas: 1
        maxReplicas: 10
        scaleDown:
          stabilizationWindow: 3m

The annotation supports the following horizontal scaling configuration options:

Field	Type	Required	Description
`optimization`	string	Yes	This enables or disables HPA for the workload. Set it to `"on"` to enable HPA, or `"off"` to disable it.
`minReplicas`	integer	Yes	Specifies the minimum number of replicas that HPA should maintain.
`maxReplicas`	integer	Yes	Specifies the maximum number of replicas that HPA can scale up to.
`scaleDown.stabilizationWindow`	duration	No	Cooldown period between scale-downs (default: "5m"). This optional setting determines the downscale stabilization period. For example, if it's set to `"3m"` (3 minutes), it means HPA will wait 3 minutes before considering another scale-down action, even if metrics indicate it's necessary. This helps prevent rapid fluctuations in scaling. The duration needs to be parsable.

Combining HPA and VPA

To use both Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA) together, you can include both configurations in the same annotation:

workloads.cast.ai/configuration: |
  horizontal:
    optimization: on
    minReplicas: 1
    maxReplicas: 10
    scaleDown:
      stabilizationWindow: 3m
  vertical:
    optimization: on

This combination allows Cast AI to optimize your workloads both horizontally (adjusting the number of pods) and vertically (adjusting the resources allocated to each pod).

For more information on configuring the Workload autoscaler using annotations, see Workload Autoscaler Configuration.

HPA Interaction with CI/CD Systems

Continuous Deployment (CD) systems can sometimes interfere with HPA operations. Cast AI has implemented solutions to ensure smooth integration:

ArgoCD: Cast AI will change the replica field owner value to prevent ArgoCD from detecting an out-of-sync state.

Automated deployments and HPA

In environments where automated deployments frequently update workloads, there can be potential conflicts between HPA's scaling decisions and the specifications defined in your deployment manifests. Here's what you need to know:

Conflicting Replica Counts: If your CI/CD pipeline repeatedly deploys a workload with a specific replica count that differs from HPA's decisions, Cast AI's HPA is designed to detect and manage these conflicts.
Cooldown Period: Cast AI implements a cooldown mechanism to prevent rapid fluctuations and potential resource thrashing.
Resuming Normal Operation: After the cooldown period, which lasts 30 minutes, HPA will resume normal operation for the workload.

System Requirements for HPA

Workload Requirements for HPA

📘Interaction with native HPA

Rollout-Specific Requirements

How Cast AI HPA Works

Multi-container workloads

HPA Behavior with Existing Pod Counts

HPA and VPA Interaction

Memory vs. CPU Scaling

Memory-based scaling limitations

Startup Metrics Behavior

Disabling HPA

Configuring HPA

API Configuration Options

Annotation Configuration Options

Combining HPA and VPA

HPA Interaction with CI/CD Systems

Automated deployments and HPA

📘
Interaction with native HPA