Requirements and limitations
Before enabling container live migration in your cluster, ensure your infrastructure meets the specific requirements for successful pod migration between nodes. This capability requires particular node configurations, compatible instance types, and specific Kubernetes versions to function reliably.
System requirements
Cloud platform support
| Platform | Support status | Details |
|---|---|---|
| Amazon EKS | Full support | Includes TCP session preservation and all migration features. |
| Google GKE | Not supported | Planned for future releases. |
| Microsoft AKS | Not supported |
Kubernetes version
Container live migration requires Kubernetes 1.30 or later. This minimum version ensures compatibility with the container runtime enhancements and custom resource definitions that enable live migration functionality.
Earlier Kubernetes versions lack the necessary API stability and container runtime features required for reliable checkpoint and restore operations. In Kubernetes 1.30 (2024), the checkpoint/restore support graduated to the Beta phase.
Node infrastructure requirements
Cast AI management: Both source and destination nodes must be managed by Cast AI. Live migration cannot occur between Cast AI-managed nodes and nodes managed by other provisioners or cloud provider native node groups.
Node image requirements: Nodes require Amazon Linux 2023 image family (FAMILY_AL2023) configured in cluster's node configuration.
Container runtime: Nodes must use containerd v2+ as the container runtime engine. Other runtimes are not supported due to specific integration requirements for checkpoint and restore functionality.
Network topology: Source and destination nodes must meet specific networking requirements within the same AWS region:
Standard configuration (ENABLE_PREFIX_DELEGATION=false):
- Single subnet per Node and Pod
- No cross-subnet migrations (validated by checking node
topology.cast.ai/subnet-idlabel)
Dedicated subnet for Pods:
- ENIConfig CRD must be configured in the cluster
- ENIConfig is per availability zone only
- No cross-AZ migration (validated by
topology.kubernetes.io/zone) - Source node must have
live.cast.ai/custom-networklabel set
Subnet discovery must not be configured:
kubernetes.io/role/cnitag must not be configured on any subnets
Instance type and CPU compatibility
CPU architecture consistency: CPU architecture consistency is critical for successful live migration. Node templates used for live migration must be configured for a single processor architecture—either AMD64 or ARM64, but not both (any). Attempting to migrate workloads between nodes with different architectures (e.g., from AMD64 to ARM64) will fail. Cloud providers may also provision different CPU models within the same instance type, creating compatibility issues with migration.
Node template configuration requirements: Instance families in your node templates must belong to the same generation set for migration compatibility:
- Compatible sets: m3 with c3, or c5 with r5, m5, etc.
- Incompatible: c3 with c5, or mixing different generation families
Configure your node templates to include only instance types from the same generation family via instance constraints. Migration between nodes with different CPU architectures or generation sets will fail.
Architecture support: Container live migration supports both AMD64 and ARM64 architectures, but node templates must be configured for a single architecture. You cannot mix architectures within a live migration-enabled node template.
When configuring your node template:
- Set processor architecture to either AMD64 or ARM64
- Do not select "Any"
All nodes provisioned from the template will use the selected architecture
There are also functional differences between architectures:
- AMD64: Full incremental memory transfer
- ARM64: Non-iterative memory transfer
The architectural difference on ARM64 affects the approach to memory dumping during the checkpoint process. This limitation cannot be overcome.
Container Network Interface (CNI)
Live migration requires specialized CNI support to preserve network connections during pod movement. Cast AI uses a forked version of the AWS VPC CNI that enables:
- IP address preservation across nodes
- TCP session continuity during migration
This fork of CNI is automatically installed on Cast AI-managed, live-enabled nodes and requires no additional configuration. The CNI fork maintains compatibility with standard AWS VPC networking while adding the necessary features for live migration.
Supported workload types
Container live migration supports the following Kubernetes workload types:
| Workload type | Support status | Notes |
|---|---|---|
| StatefulSets | Supported | |
| Deployments | Supported | |
| Bare pods | Supported | |
| Jobs | Supported | |
| CronJobs | Supported | |
| Custom controllers | Limited support | Compatibility assessment required. Contact Cast AI. |
| DaemonSets | Not supported | Cannot be migrated by design. |
Multi-container support: Pods with multiple containers are fully supported. All containers within a pod are migrated together, except Init containers: Init containers are skipped during migration by default and will not be rerun on the restored pod.
Container requirements
TTY restrictions: Containers with TTY enabled (tty: true) cannot be migrated.
Storage compatibility
Supported storage types
| Storage type | Support scope |
|---|---|
| Persistent Volume Claims (PVCs) | Same availability zone only |
| Network File System (NFS) | Cross-node |
| Amazon EBS volumes | Same availability zone only |
| EmptyDir volumes | Supported |
| ConfigMap volumes | Supported |
| Secret volumes | Supported |
| Host path volumes | Experimental support only. Contact Cast AI for specific use cases. |
Host path volumes: Host path volumes present unique challenges because they access node-local storage. Limited support exists for some use cases.
Current limitations
Hardware and performance constraints
GPU workloads: Not currently supported due to the complexity of checkpoint and restore for GPU memory and compute.
For maximizing the efficiency of GPU workloads on a given node, Cast AI offers sophisticated alternative approaches in the forms of MIG and GPU time sharing.
Memory transfer performance: Migration time increases with workload memory usage. Large memory footprints may experience naturally longer migration times.
Automatic workload assessment
Cast AI automatically evaluates workloads for migration eligibility:
Automatic labeling: The live controller scans your cluster and applies migration-eligible labels (live.cast.ai/migration-enabled=true) to workloads that meet all requirements.
Continuous assessment: As workloads change (storage additions, security context modifications, etc.), the controller updates eligibility labels accordingly.
Checking workload eligibility
The simplest way to check if workloads are eligible for live migration is to look for the migration label:
# Check all pods with live migration enabled
kubectl get pods -A -l live.cast.ai/migration-enabled=trueDo note that this will only work once container live migration is enabled and the live controller is operating in the cluster.
Updated 21 days ago
