Using Container Live Migration with Evictor

Container live migration integrates seamlessly with Cast AI's Evictor to enable zero-downtime workload optimization. This guide walks you through enabling live migration in your cluster and understanding how it works with Evictor to optimize your workloads automatically.

Before you begin

Ensure your cluster meets the requirements for container live migration. Review the complete requirements and limitations before proceeding.

Key prerequisite:

  • Cast AI-managed AWS EKS cluster running Kubernetes 1.30 or later

Enable container live migration

Container live migration can be enabled through two approaches:

  1. Cast AI Console (UI): Configure live migration through the Cast AI console using node templates and Autoscaler settings
  2. Terraform: Set up live migration infrastructure as code using Cast AI's Terraform provider

Using Terraform

If you prefer infrastructure as code, you can enable container live migration using Terraform. See our EKS Live Migration Terraform example for complete configuration templates and setup instructions.

Using Cast AI Console

Container live migration setup varies depending on whether you're setting up a new cluster or enabling the feature on an existing cluster.

For new clusters

When you connect a new cluster to Cast AI and enable automation, live migration components are automatically installed as part of the phase 2 onboarding script. You'll still need to configure node templates and node configuration as described below to ensure nodes support live migration.

For existing clusters

You'll need to manually install the live migration components first, then configure node templates. Follow these steps in order:

Step 1: Install live migration components (existing clusters only)

For existing clusters, manually install the live migration controller using Helm:

# Add Cast AI Helm repository
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update castai-helm

# Install live migration controller
helm install castai-live castai-helm/castai-live \
  --namespace castai-agent \
  --set castai.apiKey=<CONSOLE_API_KEY> \
  --set castai.apiURL=<API_URL> \
  --set castai.clusterID=<CLUSTER_ID> \
  --set daemon.install.enabled=true \
  --set castai-aws-vpc-cni.enabled=true

Replace the placeholders with your actual values:

  • <CONSOLE_API_KEY>: Your Cast AI API key
  • <API_URL>: Cast AI API URL (https://api.cast.ai or https://api.eu.cast.ai)
  • <CLUSTER_ID>: Your Cast AI cluster ID

This installation:

  • Installs the live migration controller in your cluster
  • Sets up the specialized VPC CNI for TCP preservation
  • Configures the necessary daemon components

All components are installed in the castai-agent namespace.

Step 2: Configure node templates

Navigate to your cluster's AutoscalerNode templates section and either create a new template or edit an existing one.

In the node template configuration, locate the Container live migration section and enable it:

Enabling this option ensures that nodes provisioned using this template will support live migration.

Configure compatible instance families

After enabling Container Live Migration, you must configure compatible instance families to ensure successful migrations. Click on the Compatible instance helper section in the node template:

The instance selector helps you choose compatible instance families from the same CPU generation set. This ensures that workload migrations between nodes will succeed.

Select base instance family

Start by selecting your base instance family. This determines the CPU architecture and generation that will serve as the foundation for compatible selections:

Choose compatible families

After selecting the base family, the interface will show you all compatible instance families. These are automatically filtered to include only families that share the same CPU generation characteristics:

For example, if you select m5 as your base family, compatible options include other fifth-generation families like c5, r5, and i5. However, fourth-generation families like r4, c4, or m4 would not be compatible. Therefore, they will be excluded.

Apply instance constraints

Once you've selected your compatible families, the node template's instance constraints will be automatically updated to include only these compatible options:

This configuration overrides any previous instance constraints in the template, ensuring that only live migration-compatible instance types can be provisioned.

📘

Why compatible instances matter

Container Live Migration requires nodes with compatible CPU architectures to successfully transfer running workloads. Attempting to migrate between incompatible CPU generations (e.g., from c3 to c5 families) will result in migration failures. The compatible instance selector automates this compatibility checking, preventing configuration errors that could cause migration issues.

Step 3: Configure node infrastructure

Navigate to your cluster's AutoscalerNode configuration section and either create a new configuration or edit an existing one. It has to be linked to the node template configured in the previous step.

Ensure your Node configuration is set up correctly for live migration:

  • Image family: Set to Amazon Linux 2023 in the Image configuration section.
  • Container runtime: Set Containerd as the container runtime

Step 4: Enable Evictor with live migration

Navigate to AutoscalerSettings and locate the Evictor section within the Node deletion policy. Enable Evictor in order to leverage container live migration in the cluster.

For existing clusters with Evictor already installed: You'll need to update the Evictor configuration to enable live migration support:

helm -n castai-agent upgrade castai-evictor castai-helm/castai-evictor \
  --set 'liveMigration.enabled=true' \
  --reuse-values

For newly connected clusters: Evictor will automatically include live migration support when installed as part of the cluster onboarding process.

Step 5: Replace existing nodes (recommended)

Perform a rebalancing to replace current nodes with live-migration-enabled nodes. This ensures Evictor has compatible nodes available for migration operations. Otherwise, you would have to wait for a natural node addition operation to occur in the cluster, stemming from additional capacity requirements from workloads.

Navigate to Rebalancer and create a rebalancing plan to replace all nodes. For instructions on cluster rebalancing, see Rebalancing.

This results in a cluster where all nodes support live migration, maximizing the optimization opportunities for Evictor immediately.

How Evictor uses live migration

Once enabled, Evictor automatically leverages live migration to optimize your cluster without manual intervention.

Automatic workload identification

Cast AI's live controller continuously scans your cluster and identifies workloads eligible for live migration by:

  1. Analyzing workload characteristics: Evaluating configurations, storage requirements, and other eligibility parameters
  2. Applying labels: Adding live.cast.ai/migration-enabled=true labels to compatible workloads

Migration decision logic

When Evictor identifies bin-packing opportunities, it follows this decision process:

  1. Check live migration eligibility: Evictor first checks if workloads have the live migration label
  2. Attempt live migration: For eligible workloads, Evictor initiates live migration to the destination node
  3. Fallback to eviction: If live migration fails for any reason, Evictor falls back to traditional pod eviction and still completes its bin-packing
  4. Preserve critical workloads: Workloads with autoscaling.cast.ai/removal-disabled labels are recovered on the original node if migration fails

All of this behavior can be controlled via labels that Evictor respects.

Workload label matrix

Evictor respects multiple labels that control migration and eviction behavior:

live.cast.ai/migration-enabledautoscaling.cast.ai/removal-disabledautoscaling.cast.ai/live-migration-disabledEvictor action
truetruefalse (or missing)Live migrate the workload, but do not fallback to traditional eviction if it fails; restore the pod on the source node instead
truetruetrueDo nothing (workload protected)
truefalse (or missing)trueEvict the pod using traditional eviction
truefalse (or missing)false (or missing)Live migrate the workload, if it fails, fallback to traditional eviction
falsetrue(ignored)Do nothing (workload protected)
falsefalse(ignored)Evict the pod using traditional eviction

Label descriptions:

  • live.cast.ai/migration-enabled=true: Automatically applied by the live controller to eligible workloads
  • autoscaling.cast.ai/removal-disabled=true: Prevents eviction and ensures recovery on the original node if migration fails
  • autoscaling.cast.ai/live-migration-disabled=true: Forces traditional eviction instead of attempting live migration

Monitoring live migrations

You can monitor progress through custom resources:

# List ongoing migrations
kubectl get migrations -A

# Get detailed migration status
kubectl describe migrations <migration-name> -n <namespace>

Inspect the events for each migration to build an understanding of the steps that were executed and where a migration might've failed.

Migration events

Migration progress is tracked through Kubernetes events in the migration resource:

EventDescription
PreDumpSkippedMemory pre-dump iteration skipped
PreDumpFinishedMemory pre-dump completed successfully
PreDumpFailedMemory pre-dump operation failed
MigrationReconfiguredMigration configuration was updated
PodCreateFinishedPod successfully recreated on the new node
PodCreateFailedFailed to create pod on destination node
MigrationFinishedPod successfully migrated to the new node
MigrationFailedMigration failed, check logs for details

Configuration options

Evictor modes with live migration

Default mode: Evictor migrates multi-replica workloads and evicts single-replica applications using traditional methods.

Aggressive mode: When enabled, Evictor attempts to live-migrate all eligible workloads, including single-replica deployments and bare pods. This mode maximizes cost optimization but requires careful testing.

Troubleshooting

Live migration components not installed

If you're enabling CLM on an existing cluster and components aren't working:

  1. Verify component installation: Check that the live controller is running:
    kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-live
  2. Install missing components: For existing clusters, manually install using the Helm command in Step 1
  3. Check installation logs: Review installation logs for any errors:
    kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-live

Workloads not being labeled

If workloads aren't receiving live migration labels:

  1. Verify requirements: Ensure workloads meet all technical requirements
  2. Check controller status: Confirm the live controller is running in the castai-agent namespace
  3. Review logs: Examine live controller logs for issues

Migration failures

When migrations fail:

  1. Check events: Review migration events for specific error messages
  2. Verify node compatibility: Ensure source and destination nodes are in the same instance family generation

Evictor not using live migration

If Evictor continues using traditional eviction:

  1. Confirm feature enablement: Verify live migration is enabled in node templates
  2. Check node labels: Ensure nodes have live.cast.ai/migration-enabled=true labels:
    kubectl get nodes -l live.cast.ai/migration-enabled=true
  3. Review workload labels: Confirm workloads have appropriate migration labels applied:
    kubectl get pods -A -l live.cast.ai/migration-enabled=true
  4. Verify Evictor configuration: Check that Evictor has live migration support enabled:
    helm get values castai-evictor -n castai-agent | grep liveMigration