Vertical & Horizontal Pod Autoscaling

In Kubernetes, running workloads with both Vertical Pod Autoscaling (VPA) and Horizontal Pod Autoscaling (HPA) turned on at the same time can be challenging. VPA adjusts the resources allocated to individual pods while HPA changes the number of pod replicas - when operating independently, these mechanisms can work against each other. For example, VPA might increase resources per pod while HPA tries to scale down the number of pods, or vice versa, leading to sub-optimal resource allocation and potential issues.

Cast AI has developed a unique approach to ensure these scaling mechanisms work together harmoniously for CPU scaling. Rather than letting them potentially conflict, our Workload Autoscaler automatically adjusts its behavior in the background to optimize resource allocation of workloads that have both VPA and HPA configured at the same time.

How it works

Workload Autoscaler recognizes two distinct workload patterns and applies different optimization strategies for each. This works when combining both Kubernetes native HPA or Cast AI's proprietary HPA with Cast AI VPA.

When Workload Autoscaler detects that a workload's resource allocation is significantly different from optimal targets, it makes adjustments gradually over time. This process can take a few hours as the system carefully steps closer and closer toward the target allocation.

Stable workload optimization

For workloads with consistent CPU usage patterns, Workload Autoscaler dynamically balances vertical and horizontal scaling to maintain optimal performance. The system calculates an ideal target replica count from your configured HPA range, aiming to run at roughly 10% of the available replica budget + minReplicas count during normal operation. This target serves as a baseline for making vertical scaling decisions.

To maintain this balance, Workload Autoscaler continuously monitors pod counts averaged over recent hours. When it observes the workload running with more replicas than the target, it increases vertical resource recommendations to consolidate the workload onto fewer, better-resourced pods. When running below target, it reduces vertical CPU resources to encourage horizontal scaling. This approach ensures both optimal resource allocation and maintains enough capacity for handling sudden load increases.

Example of a stable workload

Example of a stable workload

Example

Consider an HPA configuration allowing 1-11 replicas. This gives Workload Autoscaler a replica budget of 10 pods to work with. The system targets running about 2 replicas during normal operation (~10% of the range + minReplicas of 1 in this example). If the average pod count over recent hours rises above 2, vertical resources are increased to encourage consolidation. If it falls below 2, vertical resources are decreased to promote running more pods.

Cyclical and variable workload optimization

For workloads with predictable patterns or significant load variations, Workload Autoscaler employs a three-tier optimization algorithm.

  1. Minimum load distribution
    Workload Autoscaler calculates the minimum CPU requirements by dividing the lowest observed total load by minimum replicas: minRequests = minLoad / minReplicas
    It then verifies if these minimum requests could handle peak load when scaled to maximum replicas: minRequest * maxReplicas >= maxLoad
    If this validation passes, it uses these values for issuing vertical recommendations.

  2. Current recommendation validation
    If minimum load distribution isn't suitable, checks if current per-pod vertical recommendations can handle maximum load when scaled to maximum replicas, maintaining current recommendations if they prove sufficient.

  3. Maximum load distribution
    If neither of the above strategies work, Workload Autoscaler spreads the maximum observed load across maximum replicas. This allows HPA to scale freely between minimum and maximum replicas based on current demand.
    This last approach works as a reliable catch-all solution for highly variable workloads.

Example of a workload with a cyclical usage pattern

Example of a workload with a cyclical usage pattern

Example

Consider a user-facing service that experiences consistent daily traffic patterns - low usage during nighttime hours and high usage during business hours. During off-peak hours, the service might need only 0.5 CPU cores total across all pods. During peak hours, usage might spike to 10 CPU cores. Workload Autoscaler recognizes this pattern and ensures the workload can scale efficiently across its full operating range while maintaining optimal CPU allocation at both low and high usage periods.

The technical details of how the system calculates CPU distribution - such as validating if minRequests * maxReplicas >= maxLoad - ensure the workload always has sufficient capacity to handle its expected load patterns, whether operating at minimum or maximum scale.

Limitations

When using VPA and HPA together on Cast AI, the aforementioned algorithmic optimizations happen automatically in the background with no manual configuration needed or, in fact, available. There are no way to toggle this behavior on or off. Therefore, it is important to note that there are certain limitations that need to be understood.

CPU-based scaling only

Workload Autoscaler's VPA and HPA integration currently only works with CPU utilization-based horizontal scaling. When HPA is configured based on memory metrics or custom metrics, the automatic optimization between vertical and horizontal scaling will not occur. This is because CPU usage patterns are typically more predictable and evenly distributable across pods compared to memory usage.

Custom metric HPAs

If you have configured HPA to scale based on custom metrics, the background VPA-HPA optimization will not be applied. The system will continue to respect your custom metric-based scaling decisions without attempting to optimize the interaction between vertical and horizontal scaling.

VPA setting availability

Several VPA settings available for the Workload Autoscaler become irrelevant as the system handles CPU allocation dynamically and balances between the two optimization strategies. Even if their values are set in the vertical scaling policy, they will be ignored.

Recommendation percentile:
Instead of respecting the CPU values manually set by the user, when both VPA and HPA are enabled for a workload, the system will rely on its aforementioned algorithms instead to issue recommendations for the CPU. For Memory, the figure below can still be configured.

Resource overhead:
When both VPA and HPA are enabled for a workload, adding CPU overhead is no longer possible because the workload can be scaled in two dimensions to meet any increased CPU demand. Memory overhead can still be configured as that resource falls outside of HPA-VPA algorithms.