Overview
Early Access FeatureThis feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.
What is OMNI?
OMNI extends your Kubernetes cluster to additional regions and cloud providers, enabling you to provision nodes beyond your cluster's primary region. With OMNI, Cast AI's Autoscaler can automatically select the most cost-effective location—whether in your main cluster region or configured edge locations—based on real-time pricing and instance availability.
Key benefits
- Expanded GPU access: Find GPU capacity across multiple regions and clouds when your main cluster region has limited availability
- Cost optimization across regions: Autoscaler compares Spot and On-Demand prices across your main cluster region and edge locations, provisioning nodes where costs are lowest
- Cross-region: Extend EKS clusters to additional AWS regions or GKE clusters to additional GCP regions
- Cross-cloud: Extend EKS clusters to GCP regions or GKE clusters to AWS regions
- Unified management: Edge nodes appear as standard nodes in your cluster and are managed through already familiar Cast AI workflows
Supported configurations
Cluster requirements:
- Phase 2 cluster (Automation enabled)
- EKS or GKE or AKS
Edge locations:
- Any AWS region (can be added to EKS or GKE or AKS clusters)
- Any GCP region (can be added to EKS or GKE or AKS clusters)
- Any OCI region (can be added to EKS or GKE or AKS clusters)
Instance types:
- Spot Instances
- On-Demand instances
GPU support:
- Automatic driver installation for GPU instances
- Support for NVIDIA GPUs across AWS and GCP and OCI
- Future support planned for alternative cloud providers, extending the pool of available GPUs
How OMNI works
OMNI introduces two core concepts:
Edge location: A cloud region where edge nodes can be provisioned. Each edge location contains the networking configuration and credentials needed to create nodes in that region. Edge locations are cluster-specific.
Edge node: A Kubernetes node running outside your cluster's control plane region. Edge nodes appear as virtual nodes in your cluster and can run workloads just like standard nodes.
This is a high-level overview of the entire OMNI flow from onboarding to provisioning:
- You onboard your Phase 2 cluster to OMNI by deploying OMNI components
- You create and configure edge locations in the regions where you want to provision nodes
- You specify the edge locations in your node templates
- Autoscaler provisions nodes in the most cost-effective location based on your node template constraints, comparing prices across the main cluster region and selected edge locations
- Workloads can run on edge nodes in any configured edge location
Architecture overview
The diagram below shows how edge nodes in different cloud regions appear as virtual nodes in your main cluster:
graph LR
subgraph MainCluster["Main Cluster"]
pn1[Physical Node]
pn2[Physical Node]
pn3[Physical Node]
vn1[Virtual Node]
vn2[Virtual Node]
vn3[Virtual Node]
end
subgraph GCPEdge["GCP Edge Location"]
gcp_node1[Edge Node]
gcp_node2[Edge Node]
gcp_node3[Edge Node]
end
subgraph AWSEdge["AWS Edge Location"]
aws_node1[Edge Node]
end
vn1 --> gcp_node1
vn1 --> gcp_node2
vn2 --> gcp_node3
vn3 --> aws_node1
Technical implementation
OMNI is powered by Liqo, an open-source project that enables Kubernetes multi-cluster topologies. Liqo establishes peering connections between your main cluster and edge nodes, allowing them to appear as virtual nodes. You'll see Liqo components deployed in the castai-omni namespace and Liqo-related labels on edge nodes (such as liqo.io/type=virtual-node).
Connecting to workloads on edge nodes
To make a workload accessible from other workloads, you can expose it using a Kubernetes Service, as you would in any setup.
Using Services, you can establish connectivity between workloads that communicate:
- From the Main Cluster to an Edge Cluster
- From an Edge Cluster to the Main Cluster
- From an Edge Cluster to another Edge Cluster
In a nutshell, every workload can connect to the others just as they would in a cluster made up only of traditional (non-edge) nodes.
Connection limitations
Currently, we don’t support direct connectivity between edges. Consequently, if a workload on an edge needs to communicate with another one on a different edge, the traffic will always be forwarded through the main cluster. This approach may cause issues in multi-cluster scenarios where the edges are on other cloud providers than the main cluster, potentially leading to increased cloud networking costs.
Prerequisites
Before using OMNI, ensure you have:
- A Phase 2 EKS or GKE or AKS cluster
- Required tools installed:
kubectl, cloud CLI (awsorgcloud),curl,jq - Appropriate cloud provider permissions for the regions where you want to create edge locations
Required cloud permissions
OMNI needs permissions to create and manage resources in each edge location. The onboarding script automates most of this setup, but your cloud account must have permissions to:
For AWS edge locations:
- Create VPCs, subnets, internet gateways, and route tables
- Create and manage security groups
- Create IAM users and policies
- Launch EC2 instances
AWS EC2 permissions
The following EC2 permissions are required:
* `ec2:DescribeAccountAttributes`
* `ec2:DescribeAddresses`
* `ec2:DescribeAvailabilityZones`
* `ec2:DescribeImages`
* `ec2:DescribeInstanceStatus`
* `ec2:DescribeInstances`
* `ec2:DescribeKeyPairs`
* `ec2:DescribeNatGateways`
* `ec2:DescribeNetworkAcls`
* `ec2:DescribeNetworkInterfaces`
* `ec2:DescribeRegions`
* `ec2:DescribeRouteTables`
* `ec2:DescribeSecurityGroups`
* `ec2:DescribeSubnets`
* `ec2:DescribeTags`
* `ec2:DescribeVolumes`
* `ec2:DescribeVpcAttribute`
* `ec2:DescribeVpcClassicLink`
* `ec2:DescribeVpcEndpoints`
* `ec2:DescribeVpcPeeringConnections`
* `ec2:DescribeVpcs`
* `ec2:CreateVpc`
* `ec2:DeleteVpc`
* `ec2:ModifyVpcAttribute`
* `ec2:CreateSubnet`
* `ec2:DeleteSubnet`
* `ec2:ModifySubnetAttribute`
* `ec2:CreateRouteTable`
* `ec2:DeleteRouteTable`
* `ec2:AssociateRouteTable`
* `ec2:DisassociateRouteTable`
* `ec2:CreateRoute`
* `ec2:DeleteRoute`
* `ec2:ReplaceRoute`
* `ec2:CreateInternetGateway`
* `ec2:DeleteInternetGateway`
* `ec2:AttachInternetGateway`
* `ec2:DetachInternetGateway`
* `ec2:CreateNatGateway`
* `ec2:DeleteNatGateway`
* `ec2:CreateEgressOnlyInternetGateway`
* `ec2:DeleteEgressOnlyInternetGateway`
* `ec2:AllocateAddress`
* `ec2:ReleaseAddress`
* `ec2:AssociateAddress`
* `ec2:DisassociateAddress`For GCP edge locations:
- Create networks, subnets, and firewall rules
- Create service accounts, keys, and role bindings
- Launch compute instances
GCP IAM Roles
The following IAM roles are required:
* `roles/compute.instanceAdmin.v1`
* `roles/iam.serviceAccountUser`The edge location onboarding script will create the necessary cloud resources and configure appropriate permissions. You only need to ensure your cloud account has sufficient privileges to do so.
Limitations
- Architecture constraint: Only the
x86_64architecture is supported for edge nodes. If edge locations are selected in a node template, the architecture must be set tox86_64. - No hibernation support: Clusters with OMNI enabled cannot use Cast AI's hibernation feature.
- No Rebalancer support: OMNI edge nodes cannot be rebalanced at this time. This is one of the improvements that the team is hard at work to deliver to our customers.
- Evictor support: The Evictor can evict workloads from edge nodes to the main cluster when capacity is available, and can pack workloads across multiple edge nodes. However, it will not place workloads on edge nodes unless they have the required
virtual-node.omni.cast.ai/not-allowed=true:NoExecutetoleration. - Node Configuration is not supported: Node Configurations for OMNI (Edge) nodes are automatically created for each edge location and cannot be edited. They are not visible in the Cast AI console and cannot be interacted with in any way.
- Persistent volume limitations: Workloads that depend on persistent volumes (PVs) cannot be offloaded to edge nodes.
- Limited observability: Kvisor is not deployed on edge nodes, so security/netflow monitoring are not available for edge workloads.
Terminology
| Term | Definition |
|---|---|
| Main cluster | Your Phase 2 EKS or GKE or AKS cluster |
| Edge cluster | A lightweight K3s cluster running in an edge location that hosts edge nodes |
| Cluster region | The cloud region where your cluster control plane is deployed |
| Edge node | A Kubernetes node provisioned outside the main cluster's control plane region |
| Edge location | A cloud region configured for edge node provisioning, containing networking setup and credentials |
| Edge configuration | An Edge Node Configuration specific to an edge location (auto-created, not visible in the Console) |
What's next
Ready to get started? See Getting started with OMNI for step-by-step setup instructions.
Updated about 13 hours ago
