Requirements and limitations

Before enabling container live migration in your cluster, ensure your infrastructure meets the specific requirements for successful pod migration between nodes. This capability requires particular node configurations, compatible instance types, and specific Kubernetes versions to function reliably.

System requirements

Cloud platform support

PlatformSupport statusDetails
Amazon EKSFull supportIncludes TCP session preservation and all migration features.
Google GKENot supportedPlanned for future releases.
Microsoft AKSNot supported

Kubernetes version

Container live migration requires Kubernetes 1.30 or later. This minimum version ensures compatibility with the container runtime enhancements and custom resource definitions that enable live migration functionality.

Earlier Kubernetes versions lack the necessary API stability and container runtime features required for reliable checkpoint and restore operations. In Kubernetes 1.30 (2024), the checkpoint/restore support graduated to the Beta phase.

Node infrastructure requirements

Cast AI management: Both source and destination nodes must be managed by Cast AI. Live migration cannot occur between Cast AI-managed nodes and nodes managed by other provisioners or cloud provider native node groups.

Node image requirements: Nodes require Amazon Linux 2023 image family (FAMILY_AL2023) configured in cluster's node configuration.

Container runtime: Nodes must use containerd v2+ as the container runtime engine. Other runtimes are not supported due to specific integration requirements for checkpoint and restore functionality.

Network topology: Source and destination nodes must meet specific networking requirements within the same AWS region:

Standard configuration (ENABLE_PREFIX_DELEGATION=false):

  • Single subnet per Node and Pod
  • No cross-subnet migrations (validated by checking node topology.cast.ai/subnet-id label)

Dedicated subnet for Pods:

  • ENIConfig CRD must be configured in the cluster
  • ENIConfig is per availability zone only
  • No cross-AZ migration (validated by topology.kubernetes.io/zone)
  • Source node must have live.cast.ai/custom-network label set

Subnet discovery must not be configured:

  • kubernetes.io/role/cni tag must not be configured on any subnets

Instance type and CPU compatibility

CPU architecture consistency: CPU architecture consistency is critical for successful live migration. Node templates used for live migration must be configured for a single processor architecture—either AMD64 or ARM64, but not both (any). Attempting to migrate workloads between nodes with different architectures (e.g., from AMD64 to ARM64) will fail. Cloud providers may also provision different CPU models within the same instance type, creating compatibility issues with migration.

Node template configuration requirements: Instance families in your node templates must belong to the same generation set for migration compatibility:

  • Compatible sets: m3 with c3, or c5 with r5, m5, etc.
  • Incompatible: c3 with c5, or mixing different generation families

Configure your node templates to include only instance types from the same generation family via instance constraints. Migration between nodes with different CPU architectures or generation sets will fail.

Architecture support: Container live migration supports both AMD64 and ARM64 architectures, but node templates must be configured for a single architecture. You cannot mix architectures within a live migration-enabled node template.

When configuring your node template:

  • Set processor architecture to either AMD64 or ARM64
  • Do not select "Any"

All nodes provisioned from the template will use the selected architecture

There are also functional differences between architectures:

  • AMD64: Full incremental memory transfer
  • ARM64: Non-iterative memory transfer

The architectural difference on ARM64 affects the approach to memory dumping during the checkpoint process. This limitation cannot be overcome.

Container Network Interface (CNI)

Live migration requires specialized CNI support to preserve network connections during pod movement. Cast AI uses a forked version of the AWS VPC CNI that enables:

  • IP address preservation across nodes
  • TCP session continuity during migration

This fork of CNI is automatically installed on Cast AI-managed, live-enabled nodes and requires no additional configuration. The CNI fork maintains compatibility with standard AWS VPC networking while adding the necessary features for live migration.

Supported workload types

Container live migration supports the following Kubernetes workload types:

Workload typeSupport statusNotes
StatefulSetsSupported
DeploymentsSupported
Bare podsSupported
JobsSupported
CronJobsSupported
Custom controllersLimited supportCompatibility assessment required. Contact Cast AI.
DaemonSetsNot supportedCannot be migrated by design.

Multi-container support: Pods with multiple containers are fully supported. All containers within a pod are migrated together, except Init containers: Init containers are skipped during migration by default and will not be rerun on the restored pod.

Container requirements

TTY restrictions: Containers with TTY enabled (tty: true) cannot be migrated.

Storage compatibility

Supported storage types

Storage typeSupport scope
Persistent Volume Claims (PVCs)Same availability zone only
Network File System (NFS)Cross-node
Amazon EBS volumesSame availability zone only
EmptyDir volumesSupported
ConfigMap volumesSupported
Secret volumesSupported
Host path volumesExperimental support only. Contact Cast AI for specific use cases.

Host path volumes: Host path volumes present unique challenges because they access node-local storage. Limited support exists for some use cases.

Current limitations

Hardware and performance constraints

GPU workloads: Not currently supported due to the complexity of checkpoint and restore for GPU memory and compute.

For maximizing the efficiency of GPU workloads on a given node, Cast AI offers sophisticated alternative approaches in the forms of MIG and GPU time sharing.

Memory transfer performance: Migration time increases with workload memory usage. Large memory footprints may experience naturally longer migration times.

Automatic workload assessment

Cast AI automatically evaluates workloads for migration eligibility:

Automatic labeling: The live controller scans your cluster and applies migration-eligible labels (live.cast.ai/migration-enabled=true) to workloads that meet all requirements.

Continuous assessment: As workloads change (storage additions, security context modifications, etc.), the controller updates eligibility labels accordingly.

Checking workload eligibility

The simplest way to check if workloads are eligible for live migration is to look for the migration label:

# Check all pods with live migration enabled
kubectl get pods -A -l live.cast.ai/migration-enabled=true

Do note that this will only work once container live migration is enabled and the live controller is operating in the cluster.