Horizontal Pod Autoscaling
Horizontal Pod Autoscaling (HPA) automatically adjusts the number of pods in a deployment based on observed CPU utilization. Cast AI's implementation of HPA goes beyond traditional autoscaling by working in harmony with our Vertical Pod Autoscaler (VPA) to provide a comprehensive workload optimization solution.
Our HPA is designed to handle rapid fluctuations in workload demand, ensuring your applications remain responsive and cost-effective. By automatically scaling the number of pods up or down, Cast AI HPA helps maintain optimal performance during traffic spikes while preventing over-provisioning during quieter periods.
System Requirements for HPA
To leverage Cast AI's HPA capabilities, your cluster needs to meet certain system requirements. You need the following component versions installed:
Component | Version |
---|---|
castai-workload-autoscaler | v0.11.2 or later (v0.21.0+ required for Rollouts) |
castai-agent | v0.60.0 or later |
Ensuring these minimum versions are installed grants access to the HPA features and optimizations.
Workload Requirements for HPA
Not all workloads are suitable for horizontal scaling. To be eligible for CAST AI HPA, your workload must meet the following criteria:
- It must be a Deployment or Rollout
- At least one container must have CPU requests defined. Without defined CPU requests, HPA cannot accurately determine when to scale the workload
- The workload must not have native HPA on the CPU enabled. Cast AI HPA is designed to replace, not complement, native Kubernetes HPA
Interaction with native HPA
- Workloads with native HPA enabled will not be scaled horizontally by Cast AI.
- Configuring Cast AI HPA via annotations is not allowed when native HPA is present.
- Users are expected to disable their native HPA before working with Cast AI HPA.
How Cast AI HPA Works
Cast AI HPA makes scaling decisions based on the workload's short-term CPU usage, typically over the last ±3 minutes. This timeframe allows for responsive scaling while avoiding reactionary decisions based on momentary spikes.
The HPA algorithm considers these factors:
- Recent CPU usage: Recalculated every 15 seconds, based on usage over the last 3 minutes
- Current pod requests: The resource requests set for each pod influence scaling decisions
- Current pod count: The existing number of pods is considered when determining whether to scale up or down
HPA requires a minimum of a few metrics data points to initiate operations. As a result, scaling actions typically occur within minutes of a workload being onboarded, if needed.
Multi-container workloads
When a workload contains multiple containers, the Workload autoscaler makes HPA autoscaling decisions based on the metrics of the largest container in the pod.
HPA Behavior with Existing Pod Counts
Understanding how HPA interacts with your current deployment state is crucial for effective management. Here's what happens when you set constraints that differ from the current pod count:
If you set constraints (min/max) that differ from the current pod count, HPA will adjust the workload to meet these constraints after the cooldown period. For example:
- Scenario: The current pod count is 6, but you set the HPA max to 5
- Outcome: After the cooldown period, HPA will reduce the workload to a maximum of 5 replicas
It's important to note that while HPA considers current requests and CPU usage in its decision-making process, it will never exceed the constraints you've set. This gives you ultimate control over the scaling behavior of your workloads.
HPA and VPA Interaction
Cast AI's HPA works in harmony with our Vertical Pod Autoscaler (VPA) to provide comprehensive workload optimization. While HPA reacts immediately to CPU spikes by adjusting pod counts, VPA gradually adjusts resource requests based on longer-term usage patterns.
The key things to understand about this interaction:
- HPA provides immediate response to traffic spikes
- VPA makes slower, more deliberate adjustments to resource allocation
- Both can operate simultaneously to optimize your workloads
Memory vs. CPU Scaling
Currently, Cast AI HPA only supports CPU-based scaling. For workloads that need to scale based on memory usage, we recommend using native Kubernetes HPA. This design choice reflects our focus on optimizing for CPU utilization patterns, which are typically more indicative of immediate scaling needs.
Startup Metrics Behavior
HPA actively responds to resource usage during application startup. This means that applications with high initial resource usage (like JVM applications) may trigger scaling events during startup.
Disabling HPA
When you turn off Cast AI HPA for a deployment:
- Scaling Stops: The system will stop actively scaling the deployment based on resource utilization.
- Current State Maintained: The deployment will retain its current number of replicas at the time of disabling HPA.
- No Automatic Adjustments: After disabling, the system will not make any further automatic adjustments to the number of replicas.
Configuring HPA
Cast AI provides flexible configuration options for HPA, allowing you to fine-tune its behavior to suit your specific needs. Configuration can be done via UI, API, or Kubernetes annotations.
API Configuration Options
Through the Cast AI WorkloadOptimizationAPI, you can set the following parameters:
- Min/Max replicas allowed to be set by HPA (required settings)
These settings define the boundaries within which HPA can scale your workload. For example, you might set a minimum of 2 replicas to ensure high availability and a maximum of 10 to control costs. - HPA enabled/disabled
This toggle lets you easily enable or disable HPA for specific workloads without removing the configuration.
Annotation Configuration Options
For more granular control, you can configure HPA behavior using the workloads.cast.ai/configuration
Kubernetes annotation. This allows you to define HPA settings alongside other configuration options.
Here's a basic example of HPA configuration using the annotation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
annotations:
workloads.cast.ai/configuration: |
horizontal:
optimization: on
minReplicas: 1
maxReplicas: 10
scaleDown:
stabilizationWindow: 3m
The annotation supports the following horizontal scaling configuration options:
Field | Type | Required | Description |
---|---|---|---|
optimization | string | Yes | This enables or disables HPA for the workload. Set it to "on" to enable HPA, or "off" to disable it. |
minReplicas | integer | Yes | Specifies the minimum number of replicas that HPA should maintain. |
maxReplicas | integer | Yes | Specifies the maximum number of replicas that HPA can scale up to. |
scaleDown.stabilizationWindow | duration | No | Cooldown period between scale-downs (default: "5m"). This optional setting determines the downscale stabilization period. For example, if it's set to "3m" (3 minutes), it means HPA will wait 3 minutes before considering another scale-down action, even if metrics indicate it's necessary. This helps prevent rapid fluctuations in scaling.The duration needs to be parsable. |
Combining HPA and VPA
To use both Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA) together, you can include both configurations in the same annotation:
workloads.cast.ai/configuration: |
horizontal:
optimization: on
minReplicas: 1
maxReplicas: 10
scaleDown:
stabilizationWindow: 3m
vertical:
optimization: on
This combination allows Cast AI to optimize your workloads both horizontally (adjusting the number of pods) and vertically (adjusting the resources allocated to each pod).
For more information on configuring the Workload autoscaler using annotations, see Workload Autoscaler Configuration.
HPA Interaction with CI/CD Systems
Continuous Deployment (CD) systems can sometimes interfere with HPA operations. Cast AI has implemented solutions to ensure smooth integration:
- ArgoCD: Cast AI will change the replica field owner value to prevent ArgoCD from detecting an out-of-sync state.
Automated deployments and HPA
In environments where automated deployments frequently update workloads, there can be potential conflicts between HPA's scaling decisions and the specifications defined in your deployment manifests. Here's what you need to know:
- Conflicting Replica Counts: If your CI/CD pipeline repeatedly deploys a workload with a specific replica count that differs from HPA's decisions, Cast AI's HPA is designed to detect and manage these conflicts.
- Cooldown Period: Cast AI implements a cooldown mechanism to prevent rapid fluctuations and potential resource thrashing.
- Resuming Normal Operation: After the cooldown period, which lasts 30 minutes, HPA will resume normal operation for the workload.
Updated 8 days ago