Using Container Live Migration with Evictor
Container live migration integrates seamlessly with Cast AI's Evictor to enable zero-downtime workload optimization. This guide walks you through enabling live migration in your cluster and understanding how it works with Evictor to optimize your workloads automatically.
Before you begin
Ensure your cluster meets the requirements for container live migration. Review the complete requirements and limitations before proceeding.
Key prerequisite:
- Cast AI-managed AWS EKS cluster running Kubernetes 1.30 or later
Enable container live migration
Container live migration can be enabled through two approaches:
- Cast AI Console (UI): Configure live migration through the Cast AI console using node templates and Autoscaler settings
- Terraform: Set up live migration infrastructure as code using Cast AI's Terraform provider
Using Terraform
If you prefer infrastructure as code, you can enable container live migration using Terraform. See our EKS Live Migration Terraform example for complete configuration templates and setup instructions.
Using Cast AI Console
Container live migration setup varies depending on whether you're setting up a new cluster or enabling the feature on an existing cluster.
For new clusters
When you connect a new cluster to Cast AI and enable automation, live migration components are automatically installed as part of the phase 2 onboarding script. You'll still need to configure node templates and node configuration as described below to ensure nodes support live migration.
For existing clusters
You'll need to manually install the live migration components first, then configure node templates. Follow these steps in order:
Step 1: Install live migration components (existing clusters only)
For existing clusters, manually install the live migration controller using Helm:
# Add Cast AI Helm repository
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update castai-helm
# Install live migration controller
helm install castai-live castai-helm/castai-live \
--namespace castai-agent \
--set castai.apiKey=<CONSOLE_API_KEY> \
--set castai.apiURL=<API_URL> \
--set castai.clusterID=<CLUSTER_ID> \
--set daemon.install.enabled=true \
--set castai-aws-vpc-cni.enabled=trueReplace the placeholders with your actual values:
<CONSOLE_API_KEY>: Your Cast AI API key<API_URL>: Cast AI API URL (https://api.cast.aiorhttps://api.eu.cast.ai)<CLUSTER_ID>: Your Cast AI cluster ID
This installation:
- Installs the live migration controller in your cluster
- Sets up the specialized VPC CNI for TCP preservation
- Configures the necessary daemon components
All components are installed in the castai-agent namespace.
Step 2: Configure node templates
Navigate to your cluster's Autoscaler → Node templates section and either create a new template or edit an existing one.
In the node template configuration, locate the Container live migration section and enable it:
Enabling this option ensures that nodes provisioned using this template will support live migration.
Configure compatible instance families
After enabling Container Live Migration, you must configure compatible instance families to ensure successful migrations. Click on the Compatible instance helper section in the node template:
The instance selector helps you choose compatible instance families from the same CPU generation set. This ensures that workload migrations between nodes will succeed.
Select base instance family
Start by selecting your base instance family. This determines the CPU architecture and generation that will serve as the foundation for compatible selections:
Choose compatible families
After selecting the base family, the interface will show you all compatible instance families. These are automatically filtered to include only families that share the same CPU generation characteristics:
For example, if you select m5 as your base family, compatible options include other fifth-generation families like c5, r5, and i5. However, fourth-generation families like r4, c4, or m4 would not be compatible. Therefore, they will be excluded.
Apply instance constraints
Once you've selected your compatible families, the node template's instance constraints will be automatically updated to include only these compatible options:
This configuration overrides any previous instance constraints in the template, ensuring that only live migration-compatible instance types can be provisioned.
Why compatible instances matterContainer Live Migration requires nodes with compatible CPU architectures to successfully transfer running workloads. Attempting to migrate between incompatible CPU generations (e.g., from c3 to c5 families) will result in migration failures. The compatible instance selector automates this compatibility checking, preventing configuration errors that could cause migration issues.
Step 3: Configure node infrastructure
Navigate to your cluster's Autoscaler → Node configuration section and either create a new configuration or edit an existing one. It has to be linked to the node template configured in the previous step.
Ensure your Node configuration is set up correctly for live migration:
- Image family: Set to
Amazon Linux 2023in the Image configuration section. - Container runtime: Set
Containerdas the container runtime
Step 4: Enable Evictor with live migration
Navigate to Autoscaler → Settings and locate the Evictor section within the Node deletion policy. Enable Evictor in order to leverage container live migration in the cluster.
For existing clusters with Evictor already installed: You'll need to update the Evictor configuration to enable live migration support:
helm -n castai-agent upgrade castai-evictor castai-helm/castai-evictor \
--set 'liveMigration.enabled=true' \
--reuse-valuesFor newly connected clusters: Evictor will automatically include live migration support when installed as part of the cluster onboarding process.
Step 5: Replace existing nodes (recommended)
Perform a rebalancing to replace current nodes with live-migration-enabled nodes. This ensures Evictor has compatible nodes available for migration operations. Otherwise, you would have to wait for a natural node addition operation to occur in the cluster, stemming from additional capacity requirements from workloads.
Navigate to Rebalancer and create a rebalancing plan to replace all nodes. For instructions on cluster rebalancing, see Rebalancing.
This results in a cluster where all nodes support live migration, maximizing the optimization opportunities for Evictor immediately.
How Evictor uses live migration
Once enabled, Evictor automatically leverages live migration to optimize your cluster without manual intervention.
Automatic workload identification
Cast AI's live controller continuously scans your cluster and identifies workloads eligible for live migration by:
- Analyzing workload characteristics: Evaluating configurations, storage requirements, and other eligibility parameters
- Applying labels: Adding
live.cast.ai/migration-enabled=truelabels to compatible workloads
Migration decision logic
When Evictor identifies bin-packing opportunities, it follows this decision process:
- Check live migration eligibility: Evictor first checks if workloads have the live migration label
- Attempt live migration: For eligible workloads, Evictor initiates live migration to the destination node
- Fallback to eviction: If live migration fails for any reason, Evictor falls back to traditional pod eviction and still completes its bin-packing
- Preserve critical workloads: Workloads with
autoscaling.cast.ai/removal-disabledlabels are recovered on the original node if migration fails
All of this behavior can be controlled via labels that Evictor respects.
Workload label matrix
Evictor respects multiple labels that control migration and eviction behavior:
| live.cast.ai/migration-enabled | autoscaling.cast.ai/removal-disabled | autoscaling.cast.ai/live-migration-disabled | Evictor action |
|---|---|---|---|
true | true | false (or missing) | Live migrate the workload, but do not fallback to traditional eviction if it fails; restore the pod on the source node instead |
true | true | true | Do nothing (workload protected) |
true | false (or missing) | true | Evict the pod using traditional eviction |
true | false (or missing) | false (or missing) | Live migrate the workload, if it fails, fallback to traditional eviction |
false | true | (ignored) | Do nothing (workload protected) |
false | false | (ignored) | Evict the pod using traditional eviction |
Label descriptions:
live.cast.ai/migration-enabled=true: Automatically applied by the live controller to eligible workloadsautoscaling.cast.ai/removal-disabled=true: Prevents eviction and ensures recovery on the original node if migration failsautoscaling.cast.ai/live-migration-disabled=true: Forces traditional eviction instead of attempting live migration
Monitoring live migrations
You can monitor progress through custom resources:
# List ongoing migrations
kubectl get migrations -A
# Get detailed migration status
kubectl describe migrations <migration-name> -n <namespace>Inspect the events for each migration to build an understanding of the steps that were executed and where a migration might've failed.
Migration events
Migration progress is tracked through Kubernetes events in the migration resource:
| Event | Description |
|---|---|
PreDumpSkipped | Memory pre-dump iteration skipped |
PreDumpFinished | Memory pre-dump completed successfully |
PreDumpFailed | Memory pre-dump operation failed |
MigrationReconfigured | Migration configuration was updated |
PodCreateFinished | Pod successfully recreated on the new node |
PodCreateFailed | Failed to create pod on destination node |
MigrationFinished | Pod successfully migrated to the new node |
MigrationFailed | Migration failed, check logs for details |
Configuration options
Evictor modes with live migration
Default mode: Evictor migrates multi-replica workloads and evicts single-replica applications using traditional methods.
Aggressive mode: When enabled, Evictor attempts to live-migrate all eligible workloads, including single-replica deployments and bare pods. This mode maximizes cost optimization but requires careful testing.
Troubleshooting
Live migration components not installed
If you're enabling CLM on an existing cluster and components aren't working:
- Verify component installation: Check that the live controller is running:
kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-live - Install missing components: For existing clusters, manually install using the Helm command in Step 1
- Check installation logs: Review installation logs for any errors:
kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-live
Workloads not being labeled
If workloads aren't receiving live migration labels:
- Verify requirements: Ensure workloads meet all technical requirements
- Check controller status: Confirm the live controller is running in the
castai-agentnamespace - Review logs: Examine live controller logs for issues
Migration failures
When migrations fail:
- Check events: Review migration events for specific error messages
- Verify node compatibility: Ensure source and destination nodes are in the same instance family generation
Evictor not using live migration
If Evictor continues using traditional eviction:
- Confirm feature enablement: Verify live migration is enabled in node templates
- Check node labels: Ensure nodes have
live.cast.ai/migration-enabled=truelabels:kubectl get nodes -l live.cast.ai/migration-enabled=true - Review workload labels: Confirm workloads have appropriate migration labels applied:
kubectl get pods -A -l live.cast.ai/migration-enabled=true - Verify Evictor configuration: Check that Evictor has live migration support enabled:
helm get values castai-evictor -n castai-agent | grep liveMigration
Updated 2 days ago
