Container Live Migration on GCP and Azure, Index Advisor for MySQL, and the Continuous Rebalancer for Cast AI for Karpenter

In April, Container Live Migration reached general availability on GCP and Azure, complementing the existing AWS support.

Database Optimizer's Index Advisor extended to MySQL, with both DROP and CREATE recommendations validated against industry-standard benchmark workloads; Performance Advisor moved into MVP.

On Cast AI for Karpenter clusters, the Continuous Rebalancer now replaces Evictor as the consolidation mechanism.

The month also brought AKS GPU driver selection per pool, Accelerated Networking for all node types, OMNI Terraform modules for AWS, GCP, and OCI edge locations, JVM auto-instrumentation in Workload Autoscaler, ProxySQL as an alternative connection pooler for DBO, and per-node-template provisioned cost reporting.

Major Features and Improvements

Container Live Migration on GCP and Azure (GA)

Container Live Migration (CLM) is now generally available on Google Kubernetes Engine and Azure Kubernetes Service, extending the existing AWS support.

Database Optimizer: Index Advisor for MySQL

Index Advisor now supports MySQL alongside PostgreSQL. It recommends both indexes worth adding and unused indexes worth dropping, based on the actual query patterns running against the database.

Performance Advisor MVP

Performance Advisor reached MVP this month. It analyzes queries running against your databases and recommends specific changes—not just at the index level, but also at the query level. Every recommendation is validated against the live workload before being surfaced, and fixes are prioritized based on how much database time they will reclaim.

Cloud Provider Integrations

AKS

GPU Driver Selection (CUDA vs GRID) Per Pool

When Cast AI provisions GPU nodes on AKS, the NVIDIA driver type (CUDA or GRID) is now selected based on the target VM SKU family rather than being inherited from the customer's source pool. This resolves a class of GPU initialization failures where the source pool used NV-series VMs (GRID), but Cast AI provisioned NC-series VMs (which require CUDA).

Accelerated Networking Beyond GPU Nodes

Accelerated Networking support on AKS now extends to all node types, not just GPU nodes. The setting can be left at the default (allow when supported) or explicitly disabled per node configuration.

Agent Baker Image Replication

The list of Azure regions for AKS agent baker image replication was updated. This fixes node provisioning in regions such as Australia Southeast, where the Microsoft community images lack the NVMe tag, preventing the creation of v6-family instances.

Ubuntu 22.04 Forced on Kubernetes 1.35

With Kubernetes 1.35, AKS started defaulting to Ubuntu 24.04. Until 24.04 is fully validated for Cast AI-managed nodes, Cast AI now forces Cast pools to provision Ubuntu 22.04 so that instance templates and provisioned nodes remain consistent.

EKS

Better Handling of Partially-Configured SecurityGroupsForPods

Cast AI now handles EKS clusters where the SecurityGroupsForPods feature is only partially configured — for example, where the prerequisites are in place but no nodes are advertising the pod-ENI resource yet. Previously, these conditions could cause healthy nodes to be deleted.

EBS Limit Calculation for Nitro Instances

The autoscaler's EBS-volume scaling logic now accounts for the shared EBS+ENI pools used by AWS Nitro instances (e.g., the m5 family).

Commitments

Capacity-Reservation Coverage for External Nodes

Capacity-reservation coverage is now visible in cost reports for nodes provisioned outside Cast AI. Customers using Cloud Connect can now see which of their externally provisioned nodes are running under existing capacity reservations, alongside the rest of their cost data.

Generic Commitments Upload

Customers can now bring their commitment data into Cast AI even when Cloud Connect isn't an option — for example, when commitments are managed through a third-party vendor. Once uploaded, those commitments show up alongside the rest of the cost and savings picture in Cast AI for the first time.

Enterprise Commitments in the Console

Enterprise commitments — commitments that span multiple organizations — are now manageable directly in the console, on top of the cross-organization usage tracking added previously. Customers with enterprise-scale agreements can now see and manage these commitments without having to go through support.

AWS Cloud Connect via Terraform

AWS Cloud Connect can now be set up through the Cast AI Terraform provider, letting customers manage Cloud Connect alongside the rest of their infrastructure as code.

Workload Optimization

JVM Auto-Instrumentation

Workload Autoscaler can now automatically instrument JVM workloads end-to-end. Previously, getting JVM memory recommendations meant configuring Prometheus and the JMX exporter manually before Cast AI had anything to work with. Now, when a JVM workload is detected and the scaling policy opts in, Cast AI installs Prometheus in the cluster (off by default) and injects the JMX exporter into the workload automatically — so JVM optimization works out of the box. Auto-instrumentation is opt-in per scaling policy, and JVM recommendations only surface once enough data has been collected to make them extra reliable.

DaemonSet Dynamic Resource Sizing

Workload Autoscaler now recalculates DaemonSet resource requirements during binpacking as instance types change, rather than using static pre-spawned sizes. This fixes provisioning failures on customers running DaemonSets whose pod requests vary based on the selected node — previously, the autoscaler sized nodes from the static DaemonSet spec, while the admission webhook later mutated pod requests, leading to insufficient capacity. DaemonSet support also gains min/max recommendation constraints alongside the existing node-percentage feature.

Job-like Workload Indicator

The Workload Autoscaler API and UI now expose whether a workload is treated as "job-like." Job-like workloads have different optimization behavior — only three runs are required for confidence-building, and their recommendation CRDs are retained longer. Customers can now verify from the UI whether their custom workload labels are taking effect (for example, for Spark workloads).

Anomaly Detection and Custom Data Sources in Terraform

The anomaly-detection option and custom data sources used by JVM-style workloads can now be configured through the Cast AI Terraform provider.

Workload Detail: Reliability Section

The Workload Autoscaler workload detail page now includes a Reliability section with request throughput, error rate, and latency time-series data, mirroring the cost-reporting Reliability tab directly in the workload view.

System Policies: Lower Default Minimum Replicas

The default minimum-replica count on system HPA policies was reduced from 3. Three replicas could cause notable overhead on workloads with anti-affinity, pod-topology constraints, or high per-replica memory usage.

Cast AI for Karpenter

Continuous Rebalancer Replaces Evictor

On Cast AI for Karpenter clusters, the Continuous Rebalancer now replaces Evictor as the consolidation mechanism. The console was updated accordingly.

The Continuous Rebalancer also now consumes a dedicated commitments budget that tells the autoscaler how much commitment-covered capacity a cluster may use (AWS Reserved Instances at first).

Node Autoscaling

Evictor: Improved PDB Handling

The Evictor is now more resilient when working with pods governed by PodDisruptionBudgets. It identifies pods with tight disruption budgets and prioritizes those first, retrying when a budget temporarily blocks an eviction. This reduces cases where the Evictor would clear most of a node and then mark it as failed because of a single remaining pod.

Rebalancer: Live Migration Compatibility Pre-Check

When live migration is set to "preferred," the rebalancer now checks workload migration compatibility before kicking off a migration. This avoids wasted migration attempts on workloads that aren't currently migratable.

Rebalancer Resilience on Node Interruption

The rebalancer no longer fails an entire rebalance operation when a single node is interrupted while being prepared. The operation continues with the remaining nodes.

Max-CPU Limit Alerts

New alerts trigger when a production cluster approaches 90–95% of its configured maximum cluster CPU limit, giving customers time to adjust the limit manually before the autoscaler is forced to stop provisioning new capacity.

Custom Namespace for Autoscaler Components

Cast AI now supports installing autoscaler components in a namespace other than castai-agent, addressing customers whose internal policies require components to live in a specific namespace.

Earlier Validation Before Hibernating a Cluster

Cluster resume readiness is now validated when a cluster is being hibernated, not just when resuming. Misconfigurations — for example, EKS clusters missing required permissions — are now flagged up front, so customers find out before hibernation rather than discovering the issue when a resume fails.

Hibernated Clusters No Longer Show False "Reconciling" State

Updating a node configuration on a hibernated cluster no longer triggers a reconciliation attempt or leaves the cluster appearing stuck in a "reconciling" state in the console. Hibernated clusters now stay in a clean, accurate state until they're resumed.

Clearer EKS Resume Errors When Capacity Is Unavailable

When EKS fails to resume a node group from hibernation because AWS has no spare capacity in the requested instance type or AZ, the audit log now surfaces the underlying AWS error directly, instead of a generic timeout. Customers can see why a resume failed at a glance.

Higher Availability for Pricing and Instance-Type Lookups

Cast AI's pricing and instance-type lookups now have a fallback path so they keep working even when the underlying systems are degraded. If primary data sources are temporarily unavailable, customers continue to see instance types and pricing — drawn from recent backups — instead of failures. If a discount calculation can't complete in time, the underlying instance and pricing information is still returned, just without discounts applied, rather than failing the entire request.

Database Optimization

ProxySQL Support for Connection Pooling

Database Optimizer now supports ProxySQL as a connection pooler alongside the existing PgDog support. Customers running ProxySQL can now use Database Optimizer with their existing setup, with pooling metrics visible in the API and console alongside everything else.

MySQL Connection Tracking

Database Optimizer now tracks active and idle connection counts for MySQL databases, in addition to the existing Postgres support. MySQL customers get the same visibility into their connection usage patterns.

Cleaner Database List in Performance Advisor

The Performance Advisor databases list now only shows databases with db-agent actively reporting. Databases without an active agent are no longer listed, removing a source of confusion when an old configuration lingered without an agent attached.

Run Database Optimizer on AWS ECS

A reference ECS task definition is now available for running Database Optimizer as an ECS service on AWS. Customers running their database tooling on ECS rather than Kubernetes can now deploy Database Optimizer through their standard ECS workflows.

Filter Cache Metrics by User

The cache configuration view in Database Optimizer now supports filtering by user, alongside the existing endpoint and connection-attribute filters. Customers can isolate cache behavior for a specific application user when investigating performance.

Auto-Refreshing Database Optimizer Views

The caching, Index Advisor, and Performance Advisor views now refresh automatically, so the numbers customers see stay current without manually reloading the page.

Graceful Shutdown on DBO Upgrades

DBO pods now shut down gracefully during Helm-chart upgrades, eliminating a class of transient connection errors observed during rolling upgrades.

Cost Management

Provisioned Resource Cost by Node Template

The Cost Monitoring organization view now includes a table breaking down provisioned CPU and RAM cost per node template, with a time-range selector covering historical months and CSV export. Customers can now see per-node-template spend directly from the console without having to extract the data through other means.

List or Discount Pricing in Savings Reports

Savings reports at both the organization and cluster levels now include a price-type dropdown for switching between list and discount pricing, bringing them in line with the other cost reports in the console.

OpsPilot (formerly "Ask Cast AI")

The "Ask Cast AI" assistant has been rebranded to OpsPilot across the console, alongside multiple usability improvements.

OMNI Edge Provisioning

Terraform Support for AWS, GCP, and OCI Edge Locations

The OMNI Terraform module now supports edge locations on AWS, GCP, and OCI, completing multi-cloud Terraform coverage.

Shared Control Plane as the Default for New Edge Locations

New OMNI edge locations are now created with a shared control plane by default — the recommended setup for running multiple edge locations together. Customers creating edge locations from the console, the API, or Terraform get this configuration automatically.

Multi-Edge Deployments Supported on OCI Identity Domains

Customers running multiple OMNI edge locations within the same OCI identity domain are now fully supported. An earlier authentication change had restricted these deployments due to OCI's limit of one identity-trust configuration per provider; the underlying authentication has been adjusted so this limitation no longer applies.

Simpler GPU Metrics on Edge Clusters

GPU metrics on edge clusters are now collected through Kvisor, which most customers already have installed. The separate gpu-exporter chart is no longer needed, removing one component from the edge stack.

Keyless GCP Access for OMNI

OMNI now uses GCP Workload Identity to access GCP from the customer's environment, eliminating the need for static service-account keys in the OMNI deployment. Customers no longer have long-lived credentials sitting in their cluster for OMNI to function.

Live Migration

Better Visibility Into Migration Outcomes

Container Live Migration now records and classifies a broader set of migration outcomes, including failures during restoration, failures that occur after restoration completes, and cancellation cases where the source or restored pod disappears mid-migration. Failures during the snapshotting phase are now identified by which stage they occurred in, making the root cause easier to pinpoint. Workload status now also reflects migration results directly — a healthy signal when a migration succeeds and a degraded signal when restoration fails — with a clear, descriptive message on every status change.

Pre-Migration Checks Catch Unsupported Pods Early

Container Live Migration now checks whether a pod is actually migratable before starting a migration. Pods using configurations that aren't supported — for example, HostPath volumes, Istio sidecars, or GPU resources — are now rejected up front with a clear explanation of why.

Broader TCP Connection Preservation

Container Live Migration now preserves TCP connections across more cluster topology changes — when migrating from a single-node to a multi-node cluster and vice versa. For pods that previously relied on connection translation to keep TCP sessions alive across nodes, that translation is no longer needed: traffic now flows directly to the migrated pod, removing the prior overhead.

Live Migration Dashboard Updates

The Live Migration dashboard now shows error messages, cold and warm startup times, and a migration timeline, and was moved from a side drawer to a dedicated page for easier navigation.

Organization Management

Cluster ID Shown in the Cluster Dropdown

The cluster dropdown now displays each cluster's ID next to its name, and the search matches against either field. Customers managing multiple clusters with similar or identical display names can now pick the right cluster confidently.

Consistent UI for User Groups and Enterprise Groups

The User group and Enterprise group screens now share a consistent layout and interaction pattern, reducing confusion for customers who use both.

User Interface Improvements

Spot Interruption Data on Node Templates

The Node templates list now shows the spot interruption ratio for each template containing spot nodes, along with the spot-interruption feature status. Customers can click straight through to the Spot Interruption analysis page for any template that catches their eye as being an outlier.

Workload Page Improvements

The Workload page now defaults to showing the largest container in a pod rather than the first one alphabetically, so customers see the most relevant resource usage at a glance. Long workload names are now neatly truncated rather than breaking the table layout. The Workloads list now also includes Deployments, StatefulSets, CronJobs, and DaemonSets by default, alongside the workloads Workload Autoscaler manages.

Pricing Adjustments Drawer Loads All Instance Types

The pricing adjustments drawer now loads all available instance types, even for customers with very large catalogs (more than 2,500 instance types) that previously hit page limits.

Security and Compliance

Service Account Permission Right-Sizing

Following the quarterly security audit, broad administrative roles on multiple service accounts were replaced with granular permissions scoped to each service's actual functional needs.

Terraform and Agent Updates

We've released an updated version of our Terraform provider. As always, the latest changes are detailed in the changelog on GitHub. The updated provider and modules are now ready for use in your infrastructure-as-code projects in Terraform's registry.

We have released a new version of the Cast AI agent. The complete list of changes is here. To update the agent in your cluster, please follow these steps or use the Component Control dashboard in the Cast AI console.