Autoscaler Node Labels and Taints

Reference of all labels and taints that Cast AI automatically applies to nodes it manages.

This document lists all labels and taints that Cast AI automatically applies to nodes it manages.

🚧

Important: Reserved Labels and Taints

Cast AI reserves the labels and taints listed in this document. If you configure any of these in a Node Template, Cast AI will overwrite your values with the ones it determines based on the node's instance type, cloud provider, and feature configuration.

Do not configure these reserved labels or taints in Node Templates. Doing so will not produce the expected result, and may cause pods to not be scheduled correctly.

Labels

Node Identity & Management

LabelValueWhen Applied
provisioner.cast.ai/node-idUUIDSet at node creation. Unique identifier for the node assigned by Cast AI.
provisioner.cast.ai/managed-bycast.aiSet at node creation. Marks the node as managed by Cast AI. Used by controllers to filter Cast AI-owned nodes.
provisioner.cast.ai/node-configuration-nameconfig nameSet at node creation. References the node configuration name used to provision the node.
provisioner.cast.ai/node-configuration-idconfig UUIDSet at node creation. References the node configuration ID used to provision the node.
provisioner.cast.ai/hyper-threading-disabledtrueSet when the node configuration disables hyper-threading. Used to distinguish nodes where HT has been turned off at the OS level.
charts.cast.ai/managedvariesSet on nodes managed via Helm chart deployments.

Lifecycle / Instance Type

LabelValueWhen Applied
scheduling.cast.ai/spottrueSet when the node is a spot/preemptible instance. Used for spot-aware scheduling.
scheduling.cast.ai/on-demandtrueSet when the node is an on-demand instance. Mutually exclusive with scheduling.cast.ai/spot.
scheduling.cast.ai/spot-fallbacktrueSet when the node is a spot-fallback instance (on-demand used as fallback for a spot configuration).
scheduling.cast.ai/spot-reliabilityscore (0–100)Reliability score of the spot instance type, used by the scheduler to prefer more stable spot instances.
scheduling.cast.ai/interruptedtrueSet when a spot node receives an interruption notice. Signals the node is being drained due to spot preemption.

For details on configuring workloads for spot instances, including fallback and diversity settings, see Spot Instances.

Node Templates

LabelValueWhen Applied
scheduling.cast.ai/node-templatetemplate nameSet when the node is provisioned using a specific node template.
scheduling.cast.ai/node-template-versionversion stringSet when a node template is applied or updated. Tracks the version of the node template configuration.

For information on creating and configuring node templates, including custom labels and workload targeting, see Node Templates.

Compute / Storage Optimization

LabelValueWhen Applied
scheduling.cast.ai/compute-optimizedtrueSet when the node is compute-optimized (high CPU-to-memory ratio). Used to schedule compute-intensive workloads.
scheduling.cast.ai/storage-optimizedtrueSet when the node is storage-optimized (high local storage).
scheduling.cast.ai/cpu-manufacturere.g. Intel, AMDSet based on the instance type CPU vendor. Enables CPU-manufacturer-aware scheduling.
scheduling.cast.ai/premium-storagetrueSet on AKS nodes that support premium storage (Premium_LRS, PremiumV2_LRS, Premium_ZRS).

GPU / Accelerators

LabelValueWhen Applied
nvidia.com/gputrueSet by the autoscaler during GPU node provisioning. Applied when the instance type has an NVIDIA GPU device.
nvidia.com/gpu.presenttrueSet by the autoscaler alongside nvidia.com/gpu during GPU node provisioning. Acts as an alternative indicator of GPU presence.
nvidia.com/gpu.nameGPU modelGPU model name (e.g. A100). Set by the autoscaler based on the instance type's GPU device metadata.
nvidia.com/gpu.countintegerNumber of physical GPUs on the node. Set by the autoscaler based on instance type GPU count.
nvidia.com/gpu.memoryMiBMemory per single GPU in MiB. Set by the autoscaler based on instance type GPU memory.
nvidia.com/gpu.total-memoryMiBTotal GPU memory across all GPUs on the node.
nvidia.com/gpu.migtrueSet when MIG (Multi-Instance GPU) partitioning is enabled.
nvidia.com/gpu.mig-partition-{size}trueSet per MIG partition size (e.g. nvidia.com/gpu.mig-partition-1g.10gb). Indicates available MIG slice profiles.
nvidia.com/gpu.dratrueSet when the node uses the NVIDIA DRA (Dynamic Resource Allocation) driver instead of the NVIDIA device plugin.
scheduling.cast.ai/gpu.countintegerCast AI internal GPU count used during scheduling simulation. Set alongside NVIDIA labels.
scheduling.cast.ai/gpu-sharedintegerSet when GPU time-sharing or MPS is configured. Value is the number of max shared clients per GPU.
scheduling.cast.ai/gpu-sharing-strategytime-sharing or mpsSet alongside scheduling.cast.ai/gpu-shared to indicate the GPU sharing strategy configured for the node.
scheduling.cast.ai/nvidia-device-plugin-static-podtrueSet on AWS AL2023 nodes with GPU sharing enabled. Signals that the NVIDIA device plugin is deployed as a static pod rather than a DaemonSet.
scheduling.cast.ai/gpu-partition-sizepartition size (e.g. 1g.5gb)Set on nodes where MIG is configured in single mode. Indicates the active MIG partition size.
scheduling.cast.ai/bottlerocket-gpu-partition-sizepartition size (e.g. 1g.5gb)Set on AWS Bottlerocket nodes with MIG configured in single mode.
scheduling.cast.ai/preinstalled-nvidia-drivertrueSet on nodes that come with a pre-installed NVIDIA driver (e.g. Bottlerocket).
aws.amazon.com/neurontrueSet on AWS instances with Inferentia/Trainium (Neuron) accelerators. Used together with TaintNeuron.
cloud.google.com/gke-acceleratorGPU typeSet on GKE GPU nodes by the autoscaler. Indicates the accelerator type (e.g. nvidia-tesla-t4).
cloud.google.com/gke-gpu-driver-versionversionSet on GKE nodes when using the default NVIDIA device plugin. Specifies the GPU driver version to install.
cloud.google.com/gke-gpu-sharing-strategytime-sharing or mpsSet on GKE nodes when GPU sharing is configured. GKE-specific counterpart to scheduling.cast.ai/gpu-sharing-strategy.
cloud.google.com/gke-max-shared-clients-per-gpuintegerSet on GKE nodes when GPU sharing is configured. GKE-specific counterpart to scheduling.cast.ai/gpu-shared.
gke-no-default-nvidia-gpu-device-plugintrueSet on GKE nodes when Cast AI manages GPU plugin deployment instead of the default GKE NVIDIA DaemonSet.
cloud.google.com/gke-gpu-partition-sizepartition sizeSet on GKE nodes with MIG enabled. Indicates the MIG partition profile for the node.

For GPU provisioning setup, workload configuration examples, and sharing strategies, see GPU Instances. For MIG partitioning, see GPU sharing with MIG. For time-slicing, see GPU sharing with time-slicing.

Rebalancing

LabelValueWhen Applied
rebalancing.cast.ai/plan-idUUIDSet on green nodes created as part of a rebalancing plan. Used to associate the node with its rebalancing operation.
rebalancing.cast.ai/operation-idUUIDSet on green nodes to identify the specific operation within a rebalancing plan.
scheduling.cast.ai/delete-reasonplugin nameSet when the autoscaler marks a node for deletion.
autoscaling.cast.ai/drainingreason stringSet on nodes being drained by the rebalancer or evictor. Possible values: rebalancing, aws-rebalance-recommendation, spot-prediction, spot-fallback, spot-interruption, evictor. Applied alongside TaintNodeDraining.

For information on rebalancing operations and how to prepare workloads, see Rebalancing and Workload preparation. To understand how Evictor and Rebalancer work together, see Evictor vs. Rebalancer.

Removal Control

LabelValueWhen Applied
autoscaling.cast.ai/removal-disabledtrueSet on a node or pod to prevent the rebalancer from evicting pods from that node.
autoscaling.cast.ai/removal-disabled-untilUnix timestamp (seconds)Set to temporarily prevent removal until the given timestamp. Applied during green node initialization to protect it from premature rebalancing.
autoscaling.cast.ai/live-migration-disabledtruePod-level label/annotation. Disables live migration for pods that cannot tolerate it.
kubectl: Query removal-protected nodes and pods
# List nodes protected from removal
kubectl get nodes -l autoscaling.cast.ai/removal-disabled=true

# List pods protected from removal
kubectl get pods -A -l autoscaling.cast.ai/removal-disabled=true

# List pods opted out of live migration
kubectl get pods -A -l autoscaling.cast.ai/live-migration-disabled=true

For details on Evictor override rules and advanced configuration, see Evictor. For Container Live Migration opt-out behavior, see CLM Labels, Annotations, and Events.

Live Migration (CLM / LIVE)

LabelValueWhen Applied
live.cast.ai/installvariesSet by Cast AI to indicate CLM (Cluster Live Migration) component installation status on the node.
live.cast.ai/migration-enabledtrueSet by the LIVE daemonset once all LIVE components are installed and operational on the node. Indicates the node is eligible as both migration source and destination.
kubectl: Check CLM status
# List nodes with Container Live Migration enabled
kubectl get nodes -l live.cast.ai/migration-enabled=true

# List pods eligible for live migration
kubectl get pods -A -l live.cast.ai/migration-enabled=true

# Monitor active migrations
kubectl get migrations -A -w

For an overview of Container Live Migration, see Container Live Migration. For setup instructions, see Getting started with CLM. For the full CLM labels and annotations reference, see CLM Labels, Annotations, and Events.

Predictions / ML

LabelValueWhen Applied
predictions.cast.ai/ttl-minutesintegerSet on nodes that were rebalanced due to an ML-predicted spot interruption. Specifies how many minutes the replacement node should be kept alive after the original was evacuated.

Volume Support

LabelValueWhen Applied
volume.scheduling.cast.ai/{volume-name}trueSet per storage volume/class to indicate the node supports that volume type.

Topology

LabelValueWhen Applied
topology.cast.ai/cspaws, gcp, azureSet at node creation. Identifies the cloud service provider.
topology.cast.ai/subnet-idsubnet IDSet at node creation. Identifies the subnet the node was provisioned in.
topology.cast.ai/pod-subnet-idsubnet IDSet on AKS nodes. Identifies the subnet used for pod IP allocation.
topology.cast.ai/resource-groupresource groupSet on AKS nodes. Azure resource group containing the node.
topology.cast.ai/virtual-networkvnet nameSet on AKS nodes. Azure virtual network name.
topology.cast.ai/subscription-idsubscription IDSet on AKS nodes. Azure subscription ID.
topology.disk.csi.azure.com/zoneAZ nameSet on Azure nodes. Availability zone for Azure CSI disk topology.
topology.ebs.csi.aws.com/zoneAZ nameSet on AWS nodes. Availability zone for EBS CSI disk topology.
network-tag.gcp.cast.ai/{tag-name}trueSet on GCP nodes for each network tag associated with the node. Prefix-based, one label per tag.
topology.kubernetes.io/regionregionStandard Kubernetes label. Set at node creation with the cloud region.
topology.kubernetes.io/zoneAZ nameStandard Kubernetes label. Set at node creation with the availability zone.
topology.gke.io/zoneAZ nameGKE-specific zone label.

For information on configuring pod placement by topology, see Pod placement. For subnet configuration, see Subnets.

Cloud Provider Specific

AKS (Azure)

LabelValueWhen Applied
kubernetes.azure.com/agentpoolpool nameSet to castai for Cast AI provisioned nodes. Required by AKS for node pool membership.
kubernetes.azure.com/clustercluster nameSet to identify the AKS cluster.
kubernetes.azure.com/modesystem or userSet to system on system node pools.
kubernetes.azure.com/roleagentSet on all AKS agent nodes.
agentpool (deprecated)pool nameDeprecated AKS agent pool label (pre k8s 1.24). Replaced by kubernetes.azure.com/agentpool.
kubernetes.io/role (deprecated)roleDeprecated AKS role label (pre k8s 1.24).

EKS (AWS)

LabelValueWhen Applied
eks.amazonaws.com/compute-typefargateSet on Fargate nodes. Used to distinguish Fargate from EC2 node types.

Standard Kubernetes Labels (set by autoscaler)

LabelValueWhen Applied
kubernetes.io/archamd64, arm64Set at node creation based on instance architecture.
kubernetes.io/oslinux, windowsSet at node creation based on instance OS.
beta.kubernetes.io/archamd64, arm64Legacy beta arch label, set alongside kubernetes.io/arch for backwards compatibility.
beta.kubernetes.io/oslinux, windowsLegacy beta OS label.
node.kubernetes.io/instance-typeinstance type nameStandard label set to the cloud provider instance type (e.g. m5.xlarge).
kubernetes.io/hostnamehostnameStandard hostname label.

Resource Offering

LabelValueWhen Applied
autoscaling.cast.ai/provisioned-resource-offeringoffering typeSet at node creation to record the resource offering type used for provisioning.

OMNI Edge

LabelValueWhen Applied
virtual-node.omni.cast.ai/not-allowedtrueSet on nodes that are not allowed to run workloads in the OMNI edge context.

Taints

Rebalancing Taints

KeyValueEffectWhen Applied
rebalancing.cast.ai/preparing(none)NoScheduleApplied to green (replacement) nodes during rebalancing plan execution. Prevents new pods from being scheduled until the node is fully prepared. Removed once the node is ready to receive workloads.
scheduling.cast.ai/pod-pinning-preparing(none)NoScheduleApplied during rebalancing when pod pinning is enabled. Prevents scheduling until pod pinning setup is complete.
autoscaling.cast.ai/drainingtrueNoScheduleApplied to blue (old) nodes when they are being drained during rebalancing or spot interruption handling. Applied alongside the autoscaling.cast.ai/draining label.
provisioner.cast.ai/uninitialized(none)NoScheduleApplied to newly provisioned nodes before initialization is complete. Prevents pod scheduling until the node is fully set up.

For details on how rebalancing operates and the blue/green node replacement process, see Rebalancing. For paused drain configuration, see Paused drain configuration.

Lifecycle / Spot Taint

KeyValueEffectWhen Applied
scheduling.cast.ai/spot(none)NoScheduleApplied to spot nodes when the lifecycle taint feature is enabled in node template configuration. Requires pods to explicitly tolerate spot instances. Only applied when both spot and on-demand nodes exist in the cluster.

For workload configuration patterns using spot tolerations and node selectors, see Spot Instances.

Node Template Taint

KeyValueEffectWhen Applied
scheduling.cast.ai/node-templatetemplate name (e.g. default-by-castai)NoScheduleApplied to all nodes provisioned via a node template.

Scoped Autoscaler Taint

KeyValueEffectWhen Applied
scheduling.cast.ai/scoped-autoscaler(none)NoScheduleApplied when the scoped autoscaler feature is enabled for a node template. Restricts scheduling to workloads explicitly intended for the scoped autoscaler.

Storage Optimization Taint

KeyValueEffectWhen Applied
scheduling.cast.ai/storage-optimized(none)NoScheduleApplied to storage-optimized nodes. Requires workloads to explicitly tolerate storage-optimized instances.

GPU / Accelerators Taints

KeyValueEffectWhen Applied
nvidia.com/gputrueNoScheduleApplied by the autoscaler during GPU node provisioning. Restricts scheduling to GPU-tolerant workloads, preventing non-GPU pods from consuming GPU nodes.
nvidia.com/gpu.migtrueNoScheduleApplied by the autoscaler when provisioning nodes with MIG (Multi-Instance GPU) partitioning enabled. Restricts to MIG-compatible workloads.
aws.amazon.com/neurontrueNoScheduleApplied to AWS Inferentia/Trainium nodes. Restricts scheduling to workloads that explicitly require Neuron accelerators.

For GPU workload configuration examples and toleration patterns, see GPU Instances.

Architecture Taint

KeyValueEffectWhen Applied
kubernetes.io/archarm64NoScheduleApplied to ARM64 nodes. Requires pods to tolerate ARM64 architecture, preventing x86-only images from being scheduled on ARM nodes.

Eviction Taints (applied by Evictor)

KeyValueEffectWhen Applied
evictor.cast.ai/evicting(none)variesApplied by the evictor when it starts draining a node. Signals that the node is being evacuated.
evictor.cast.ai/evicted(none)variesApplied by the evictor after node eviction is complete. Signals the node has been fully drained.

For Evictor operating modes, override rules, and advanced configuration, see Evictor.

OMNI Edge Taint

KeyValueEffectWhen Applied
virtual-node.omni.cast.ai/not-allowedtrueNoExecuteApplied to OMNI virtual nodes that are not allowed to run workloads. Evicts any existing pods in addition to blocking new scheduling.

Annotations (related, node-level)

These annotations are not labels or taints but are closely related and set on nodes:

AnnotationValuePurpose
autoscaling.cast.ai/removal-delay-secondsintegerSets a removal delay in seconds for a node before it can be deleted.
autoscaling.cast.ai/paused-draining-untiltimestampPauses the draining process on a node until the given timestamp.
rebalancing.cast.ai/statusdrain-failedSet on a node when a drain operation during rebalancing has failed.
evictor.cast.ai/eviction-statusvariesTracks the current eviction status of a node set by the evictor.
predictions.cast.ai/remove-aftertimestampSet on nodes rebalanced due to ML spot interruption predictions. Indicates when the node should be removed.
kubectl: Query nodes by annotation
# Find nodes with a removal delay configured
kubectl get nodes -o json | jq -r '.items[] | select(.metadata.annotations["autoscaling.cast.ai/removal-delay-seconds"] != null) | .metadata.name'

# Find nodes with paused draining
kubectl get nodes -o json | jq -r '.items[] | select(.metadata.annotations["autoscaling.cast.ai/paused-draining-until"] != null) | "\(.metadata.name) paused until \(.metadata.annotations["autoscaling.cast.ai/paused-draining-until"])"'

# Find nodes with failed drain operations
kubectl get nodes -o json | jq -r '.items[] | select(.metadata.annotations["rebalancing.cast.ai/status"] == "drain-failed") | .metadata.name'

For configuring paused drain behavior during rebalancing, see Paused drain configuration.