Getting started

📣

Early Access Feature

This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

This guide walks you through setting up OMNI for your cluster, from initial onboarding through creating your first edge location, configuring node templates, and provisioning compute capacity in the edge.

Before you begin

Ensure you meet the prerequisites listed in the OMNI Overview.

You'll need:

  • A Phase 2 EKS, GKE or AKS cluster
  • kubectl configured for your cluster
  • Cloud CLI authenticated (aws or gcloud)
  • curl and jq installed
  • Sufficient cloud permissions to create networking resources and compute instances

Step 1: Onboard your cluster to OMNI

Onboarding deploys the OMNI components to your cluster. You can onboard in two ways.

Option A: During Phase 2 onboarding (recommended for new clusters)

If you're onboarding a cluster from Phase 1 (Read-only) to Phase 2 (Automation), you can enable OMNI at the same time.

  1. In the Cast AI console, navigate to your cluster

  2. Follow the standard Phase 2 onboarding flow

  3. Select Extend cluster to other regions and cloud providers under Advanced settings

The script will be updated to include INSTALL_OMNI=true.

  1. Copy and run the script in your terminal

  2. Wait for the script to complete (typically 1-2 minutes)

Option B: Enable OMNI on an existing Phase 2 cluster

If you already have a Phase 2 cluster with cluster optimization enabled:

  1. Navigate to your cluster in the Cluster list
  2. Click on the ellipsis and choose Cast AI features
  1. Under Other features, check the box for Extend cluster to other regions and cloud providers

  2. Copy the updated script and run it in your terminal

  3. Wait for the script to complete (typically 1-2 minutes)

Verify onboarding (optional)

Verify onboarding (optional)

After the script completes, verify OMNI is enabled:

kubectl get pods -n castai-omni

You should see OMNI components running, including:

  • liqo-* pods (controller manager, CRD replicator, fabric, IPAM, proxy, webhook)
  • omni-agent pod

Example output:

NAME                                       READY   STATUS    RESTARTS   AGE
liqo-controller-manager-7cf59bcc64-xxxxx   1/1     Running   0          2m
liqo-crd-replicator-687bdc6f66-xxxxx       1/1     Running   0          2m
liqo-fabric-xxxxx                          1/1     Running   0          2m
liqo-ipam-8667dbccbb-xxxxx                 1/1     Running   0          2m
liqo-metric-agent-55cd8748c5-xxxxx         1/1     Running   0          2m
liqo-proxy-77c66dfb88-xxxxx                1/1     Running   0          2m
liqo-webhook-6f648484cc-xxxxx              1/1     Running   0          2m
omni-agent-595c4b97d9-xxxxx                1/1     Running   0          2m

All pods should be in Running status.

Step 2: Create and onboard an edge location

Edge locations define the regions where edge nodes can be provisioned. Each edge location is cluster-specific and requires its own setup.

  1. In the Cast AI console, navigate to AutoscalerEdge locations
  1. Click Create edge location to open up the creation and configuration drawer
  2. Configure the edge location:
    • Name: A descriptive name (e.g., aws-us-west-2 or gcp-europe-west4)
    • Cloud provider: Select AWS or GCP or OCI
    • Region: Select the target region
    📘

    GCP

    For GCP, providing the Project ID is also required.

  1. Click Next
  2. Copy and run the provided script in your terminal to establish the connection with the edge location
📘

Note

Before running the script, ensure your cloud CLI is authenticated and configured for the correct account and region.

AWS

Set your AWS profile to ensure the script creates resources in the correct AWS account:

export AWS_PROFILE=<your-aws-profile>
# Then run the provided onboarding script

If you're already using your default AWS credentials, you can skip setting the profile.

GCP

Set your active GCP project to ensure the script creates resources in the correct project:

gcloud config set project <your-project-id>
# Then run the provided onboarding script

This should match the Project ID you provided when creating the edge location.

The script will:

  • Create a VPC/network and subnet (if needed)
  • Configure firewall rules and security groups
  • Create service accounts or IAM users with appropriate permissions
  • Register the edge location with Cast AI

Wait for the script to complete (typically 2-3 minutes).

After successful completion, the edge location appears in the Edge locations list with an Incomplete setup status and a notification confirms creation.

A newly created edgle location showing Incomplete setup status

📘

Why "Incomplete setup"?

An edge location shows Incomplete setup until it's added to at least one node template. This is expected behavior.

📘

Skipping the script

If you skip running the script, the edge location is saved in a Pending state. You can return to complete this step later by accessing the edge location from the list.

Create additional edge locations (optional)

You can create multiple edge locations for the same cluster. Repeat the process above for each region where you want to provision edge nodes.

Step 3: Configure node templates for edge locations

Node templates control where the Autoscaler can provision nodes. To enable edge node provisioning, add edge locations to your node templates.

  1. Navigate to AutoscalerNode templates
  2. Select an existing node template or create a new one
  3. In the node template editor, find the Edge locations section and check the box to Enable provisioning in edge locations
  4. Select one or more edge locations from the dropdown
  1. Click Save

When edge locations are selected:

  • The Instance constraints section is updated to account for inventory from all selected edge locations
  • The Available instances list includes instances from the main cluster region and all selected edge locations
  • Autoscaler can now provision nodes in any of these locations based on cost and availability
Instance availability comparison

Before:

After:

After saving, the edge location status changes from Incomplete setup to In use.

Your cluster is now configured for edge node provisioning. The Autoscaler will automatically provision edge nodes as needed.

Edge node provisioning

Once configured, edge nodes are provisioned automatically by the Autoscaler based on:

  • Cost optimization: Autoscaler compares Spot and On-Demand prices across the main cluster region and all edge locations configured in the node template
  • Instance availability: Considers instances that are available in each region, including edge ones
  • Node template constraints: Respects all CPU, memory, architecture, and other constraints otherwise defined in the node template, as one would expect

How edge nodes appear in your cluster

Cast AI Console

In the Cast AI Console, edge nodes are identified in the Nodes list via an additional External region label in the Node list:

Using kubectl

Edge nodes appear as virtual nodes in your cluster:

kubectl get nodes

Example output:

NAME                                              STATUS   ROLE     AGE     VERSION
ip-192-168-56-192.eu-central-1.compute.internal   Ready    <none>   6h2m    v1.30.14-eks-113cf36
cast-7f6821f2-b9fd-47e0-ab38-1f80c9c32dc0         Ready    agent    6m20s   v1.30.14-eks-b707fbb
# The 2nd node is an edge node with ROLE=agent

Edge nodes can be identified by several characteristics:

Node labels:

  • liqo.io/type=virtual-node: Identifies the node as a Liqo virtual node
  • kubernetes.io/role=agent: Role designation for edge nodes
  • omni.cast.ai/edge-location-name: Name of the edge location
  • omni.cast.ai/edge-id: Unique edge identifier
  • omni.cast.ai/csp: Cloud provider of the edge (e.g., gcp, aws)
  • topology.kubernetes.io/region: Region where the edge node is located

Node taints:

  • virtual-node.omni.cast.ai/not-allowed=true:NoExecute: Applied to all edge nodes by default

ProviderID: Edge nodes have a special provider ID format:

castai-omni://<identifier-string>

You can inspect an edge node to see all these identifiers:

kubectl describe node <node-name>

Scheduling workloads on edge nodes

To enable workloads to run on edge nodes, label the namespace to allow offloading:

kubectl label ns <namespace-name> omni.cast.ai/enable-scheduling=true

When you deploy workloads to a labeled namespace, a mutating webhook automatically adds the required toleration to your pods, allowing them to be scheduled on edge nodes.

This label enables Liqo's offloading mechanism for the namespace.

⚠️

Warning

Do not offload the default namespace. The default namespace exists in both the main cluster and edge clusters, and offloading it can cause unexpected behavior.

Manual toleration (optional)

While the toleration is added automatically for pods in labeled namespaces, you can also add it manually to your pod specs if needed:

tolerations:
- key: "virtual-node.omni.cast.ai/not-allowed"
  operator: "Equal"
  value: "true"
  effect: "NoExecute"
📘

Custom taints

If your node template has additional custom taints beyond the default edge taint, you must manually add the corresponding tolerations to your pod specs. Only the default virtual-node.omni.cast.ai/not-allowed toleration is added automatically.

Workload compatibility

Not all workloads are suitable for running on edge nodes. Consider the following when deciding which workloads to offload:

Requirements (hard constraints):

  • Linux x86_64 architecture only (ARM-based workloads are not supported)
  • Stateless workloads or workloads that don't depend on persistent volumes (PVs cannot be offloaded to edge nodes)

Recommendations:

  • Workloads that can tolerate some additional network latency (cross-region or cross-cloud communication adds latency)
  • Workloads with minimal to no dependencies on other in-cluster services

Given the above, workloads such as ML training jobs, or workloads that benefit from GPU availability more than low latency, and workloads where cost savings from cheaper GPU or compute instances justify the operational trade-offs, are prime candidates to be tested and offloaded to edge nodes.

Evictor behavior with edge nodes

The Evictor works with edge nodes but respects the edge node toleration requirement:

What Evictor can do:

  • Evict workloads from edge nodes back to nodes in the main cluster when capacity is available
  • Pack workloads across multiple edge nodes to optimize resource utilization
  • Consider edge nodes in its bin-packing decisions

What Evictor cannot do:

  • Place workloads on edge nodes unless they explicitly tolerate virtual-node.omni.cast.ai/not-allowed=true:NoExecute

This means:

  • Workloads without the edge toleration will never be moved to edge nodes by Evictor
  • Workloads with the edge toleration can be evicted from the main cluster to edges (and vice versa)
  • You maintain control over which workloads can run on edge nodes through tolerations
📘

Note

Only add the edge node toleration to workloads that are compatible with running in different regions or clouds. Consider all requirements when deciding which workloads to allow on edge nodes.

Edge node provisioning time

Edge nodes typically take the same amount of time to become ready as nodes in your main cluster region would.

For GPU instances, provisioning may take slightly longer due to driver installation.