Horizontal workload scaling

Horizontal workload scaling automatically adjusts the number of pods in a deployment based on observed CPU utilization. Cast AI's implementation of horizontal scaling goes beyond traditional Kubernetes native HPA by working in harmony with our advanced vertical workload scaling to provide a comprehensive workload optimization solution.

Our horizontal workload scaling is designed to handle rapid fluctuations in workload demand, ensuring your applications remain responsive and cost-effective. By automatically scaling the number of pods up or down, horizontal workload scaling helps maintain optimal performance during traffic spikes while preventing over-provisioning during quieter periods.

System requirements for horizontal workload scaling

To leverage Cast AI's horizontal workload scaling capabilities, your cluster needs to meet certain system requirements. You need the following component versions installed:

ComponentVersion
castai-workload-autoscalerv0.11.2 or later (v0.21.0+ required for Rollouts)
castai-agentv0.60.0 or later

Ensuring these minimum versions are installed grants access to the horizontal workload scaling features and optimizations.

Workload requirements

Not all workloads are suitable for horizontal scaling. To be eligible for Cast AI's horizontal workload scaling, your workload must meet the following criteria:

  • It must be a Deployment or Rollout
  • At least one container must have CPU requests defined. Without defined CPU requests, Cast AI cannot accurately determine when to scale the workload horizontally
  • The workload must not have native Kubernetes HPA on the CPU enabled. Cast AI's horizontal workload scaling is designed to replace, not complement, native Kubernetes HPA
📘

Interaction with native HPA

  • Workloads with native HPA enabled will not be scaled horizontally by Cast AI
  • Configuring Cast AI's horizontal workload scaling via annotations is not allowed when native HPA is present
  • Users are expected to disable native HPA on their workloads before working with Cast AI's horizontal workload scaling

Rollout-specific Requirements

When using Rollouts with Cast AI's horizontal workload scaling, be aware of these additional limitations:

  • Rollouts using workloadRef: Rollouts that reference a Deployment using the workloadRef field cannot be managed by Cast AI for horizontal optimization. This configuration creates an indirect relationship between the Rollout and its pods, preventing horizontal scaling.
# Example of an unsupported Rollout configuration
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: example-rollout
spec:
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-deployment
  # Other Rollout configuration...
  • Rollouts converted from template to workloadRef: If you initially configure a Rollout using a template and later modify it to use workloadRef, any existing Cast horizontal workload scaling configuration will become invalid. You'll need to disable it before making this change.

How Cast AI horizontal workload scaling works

Cast AI's horizontal workload scaling makes scaling decisions based on the workload's short-term CPU usage, typically over the last ±3 minutes. This timeframe allows for responsive scaling while avoiding reactionary decisions based on momentary spikes.

The horizontal scaling algorithm considers these factors:

  • Recent CPU usage: Recalculated every 15 seconds, based on usage over the last 3 minutes
  • Current pod requests: The resource requests set for each pod influence scaling decisions
  • Current pod count: The existing number of pods is considered when determining whether to scale up or down

Horizontal workload scaling requires a minimum of a few metric data points to initiate operations. As a result, scaling actions typically occur within minutes of a workload being onboarded, if needed.

Multi-container workloads

When a workload contains multiple containers, the Workload Autoscaler makes horizontal workload autoscaling decisions based on the metrics of the largest container in the pod.

Horizontal scaling behavior with existing pod counts

Understanding how horizontal workload scaling interacts with your current deployment state is crucial for effective management. Here's what happens when you set constraints that differ from the current pod count:

If you set constraints (min/max) that differ from the current pod count, Workload Autoscaler will adjust the workload to meet these constraints after the cooldown period. For example:

  • Scenario: The current pod count is 6, but you set it to a max of 5
  • Outcome: After the cooldown period, Workload Autoscaler will reduce the workload to a maximum of 5 replicas
📘

Note

While Workload Autoscaler considers current requests and CPU usage in its decision-making process, it will never exceed the constraints you have set. This gives you ultimate control over the scaling behavior of your workloads.

Interaction between vertical and horizontal scaling

Cast AI's horizontal workload scaling works in tandem with our vertical scaling functionality to provide comprehensive workload optimization. While horizontal scaling reacts immediately to CPU spikes by adjusting pod counts, vertical scaling gradually adjusts resource requests based on longer-term usage patterns.

The key things to understand about this interaction:

  • Horizontal scaling provides immediate response to traffic spikes
  • Vertical scaling makes slower, more deliberate adjustments to resource allocation
  • Both can operate simultaneously to optimize your workloads

Memory vs. CPU scaling

Cast AI's horizontal scaling supports CPU-based scaling. For workloads that need to scale based on memory usage, you can use native Kubernetes HPA alongside Cast's vertical scaling. The Workload Autoscaler automatically coordinates between vertical memory adjustments and memory-based HPA to prevent any scaling conflicts.

Cast AI still recommends using CPU-based scaling with HPA, as it provides more predictable and stable scaling behavior while still allowing for efficient resource optimization.

Startup metrics behavior

Horizontal workload autoscaling actively responds to resource usage during application startup. This means that applications with high initial resource usage (like JVM applications) may trigger scaling events during startup.

Disabling horizontal workload scaling

When you turn off Cast AI's horizontal scaling for a deployment:

  • Scaling Stops: The system will stop actively scaling the deployment based on resource utilization.
  • Current State Maintained: The deployment will retain its current number of replicas at the time of disabling horizontal scaling.
  • No Automatic Adjustments: After disabling, the system will not make any further automatic adjustments to the number of replicas.

Configuring horizontal workload scaling

Cast AI provides flexible configuration options for horizontal scaling, allowing you to fine-tune its behavior to suit your specific needs. Configuration can be done via UI, API, or Kubernetes annotations.

API Configuration Options

Through the Cast AI WorkloadOptimizationAPI, you can set the following parameters:

  • Min/Max replicas allowed to be set by horizontal autoscaling (required settings)
    These settings define the boundaries within which horizontal scaling can operate on your workload. For example, you might set a minimum of 2 replicas to ensure high availability and a maximum of 10 to control costs.
  • Horizontal scaling enabled/disabled
    This toggle lets you easily enable or disable horizontal scaling for specific workloads without removing the configuration.

Annotation Configuration Options

For more granular control, you can configure horizontal scaling behavior using the workloads.cast.ai/configuration Kubernetes annotation. This allows you to define horizontal workload scaling settings alongside other configuration options.

Here's a basic example of such a configuration using annotations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
  annotations:
    workloads.cast.ai/configuration: |
      horizontal:
        optimization: on
        minReplicas: 1
        maxReplicas: 10
        scaleDown:
          stabilizationWindow: 3m

The annotations configuration supports the following horizontal scaling configuration options:

Field

Type

Required

Description

optimization

string

Yes

This enables or disables HPA for the workload. Set it to "on" to enable HPA, or "off" to disable it.

minReplicas

integer

Yes

Specifies the minimum number of replicas that HPA should maintain.

maxReplicas

integer

Yes

Specifies the maximum number of replicas that HPA can scale up to.

scaleDown.stabilizationWindow

duration

No

Cooldown period between scale-downs (default: "5m"). This optional setting determines the downscale stabilization period. For example, if it's set to "3m" (3 minutes), it means HPA will wait 3 minutes before considering another scale-down action, even if metrics indicate it's necessary. This helps prevent rapid fluctuations in scaling.

The duration needs to be parsable.

Combining horizontal and vertical autoscaling

To use both horizontal and vertical workload autoscaling together, you can include both configurations in the same annotation:

workloads.cast.ai/configuration: |
  horizontal:
    optimization: on
    minReplicas: 1
    maxReplicas: 10
    scaleDown:
      stabilizationWindow: 3m
  vertical:
    optimization: on

This combination allows Cast AI to optimize your workloads both horizontally (adjusting the number of pods) and vertically (adjusting the resources allocated to each pod).

For more information on configuring the Workload Autoscaler using annotations, see Workload Autoscaler Configuration.

Horizontal workload autoscaling interaction with CI/CD Systems

Continuous Deployment (CD) systems can sometimes interfere with horizontal workload scaling operations. Cast AI has implemented solutions to ensure smooth integration:

  • ArgoCD: Cast AI will change the replica field owner value to prevent ArgoCD from detecting an out-of-sync state.

Automated deployments and horizontal workload scaling

In environments where automated deployments frequently update workloads, there can be potential conflicts between horizontal scaling decisions and the specifications defined in your deployment manifests. Here's what you need to know:

  1. Conflicting Replica Counts: If your CI/CD pipeline repeatedly deploys a workload with a specific replica count that differs from HPA's decisions, Cast AI's horizontal scaling is designed to detect and manage these conflicts.
  2. Cooldown Period: Cast AI implements a cooldown mechanism to prevent rapid fluctuations and potential resource thrashing.
  3. Resuming Normal Operation: After the cooldown period, which lasts 30 minutes, horizontal autoscaling will resume normal operation for the workload.