Pod mutations

What are pod mutations?

Pod mutations is a Cast AI feature that simplifies Kubernetes workload configuration and helps optimize cluster resource usage. It allows you to define templates that automatically modify pod specifications when they are created, reducing manual configuration overhead and ensuring consistent pod scheduling across your cluster.

Why use pod mutations?

Managing Kubernetes workloads at scale presents several challenges:

Complex Configuration Requirements: As clusters grow, manually configuring pod specifications becomes increasingly time-consuming and error-prone. Each workload may need specific labels, tolerations, and node selectors to ensure proper scheduling and resource allocation.
Legacy System Integration: When onboarding existing clusters to Cast AI, workloads sometimes need to be reconfigured to take full advantage of cost optimization features. This traditionally requires updating deployment manifests, which can be automated using pod mutations.
Resource Fragmentation: Without standardized pod configurations, clusters can become fragmented with too many node groups, leading to inefficient resource utilization and increased costs.

Pod mutations address all of these challenges.

How it works

Pod mutations allow you to define templates that automatically modify pod specifications when they are created. These templates can:

Apply labels and tolerations
Configure node selectors and affinities
Link pods to specific Node Templates
Consolidate multiple Node Templates
Set Spot Instance preferences

The pod mutations controller, called the pod mutator, runs in your cluster and monitors pod creation events. When a new pod matches a mutation's configured filters, the controller automatically applies that mutation. Note that only one mutation can be applied to any given pod - if multiple mutations match a pod's filters, the most specific filter match will be used.

Installation

Install using the console

Upon selecting a cluster from the cluster list, head over to Autoscaler --> Pod mutations in the sidebar.
If you have not installed the pod-mutator controller, you will be prompted with a script you need to run in your cluster's cloud shell or terminal.

Install using Helm

Add the Cast AI Helm repository:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update

Install the pod mutations controller:

helm repo add castai-helm https://castai.github.io/helm-charts
helm upgrade -i --create-namespace -n castai-agent pod-mutator \
castai-helm/castai-pod-mutator \
--set castai.apiUrl="https://api.cast.ai" \ 
--set castai.apiKey="${API_KEY}" \
--set castai.clusterID="${CLUSTER_ID}"

📘
Note
Prior to Pod Mutator version v0.0.26, an additional parameter --set castai.organizationID="${ORGANIZATION_ID}" was required. If you're using a fixed Pod Mutator version older than v0.0.26, you'll still need to include this parameter.

Advanced installation options

The pod mutator supports the configuration of its webhook reinvocation policy. This controls whether the pod mutator should be reinvoked if other admission plugins modify the pod after the initial mutation.

helm upgrade -i --create-namespace -n castai-agent pod-mutator \
castai-helm/castai-pod-mutator \
--set castai.apiUrl="https://api.cast.ai" \ 
--set castai.apiKey="${API_KEY}" \
--set castai.clusterID="${CLUSTER_ID}" \
--set webhook.reinvocationPolicy="IfNeeded" # Set to "Never" by default

📘
Note
Prior to Pod Mutator version v0.0.26, an additional parameter --set castai.organizationID="${ORGANIZATION_ID}" was required. If you're using a fixed Pod Mutator version older than v0.0.26, you'll still need to include this parameter.

The reinvocationPolicy can be set to:

Never (default): The pod mutator will only be called once during pod admission
IfNeeded: The pod mutator may be called again if other admission plugins modify the pod after the initial mutation

Setting reinvocationPolicy to IfNeeded is useful when you have multiple admission webhooks that may interact with each other. For example:

Pod mutator adds its mutations
Another webhook modifies the pod
Pod mutator is invoked again to ensure its mutations are properly applied

⚠️ However, if you want changes made by other webhooks to persist, setting reinvocationPolicy to IfNeeded may be counterproductive since the pod mutator will override any modifications that fall under its control when it's reinvoked. Consider your specific use case and the interaction between different webhooks in your cluster before changing this setting from its default value.

Creating pod mutations

Pod mutations are either defined through the PodMutations API or created using the Cast AI console. Each mutation consists of:

A unique name
Object filters to select targeted pods
Mutation rules defining what changes to apply
Node Template configurations (optional)
Spot instance preferences (optional)

Console example

After installing the pod-mutator controller in your cluster, you'll have access to the pod mutations in your console:

To create a new mutation template, simply click on Add template in the top-right, which will open the drawer in which you can define the configuration of your mutation.

Begin by giving your mutation template a name:

Then, define the filters by which the mutation candidate pods ought to be discovered by the controller:

Configure your desired mutation configuration:

Finally, choose the Spot settings most appropriate for this template before hitting Create:

The console UI offers a helping hand when creating mutation by means of tooltips and a live preview of what your configuration will look like:

API example

Here's an example pod mutation API request that applies labels and tolerations to specific workloads:

{
  "objectFilter": {
    "names": [
      "app1",
      "app2"
    ],
    "namespaces": [
      "production"
    ]
  },
  "labels": {
    "environment": "production"
  },
  "spotType": "UNSPECIFIED_SPOT_TYPE",
  "name": "production-mutation",
  "organizationId": "fhytif73-f95f-44de-ad4b-f7898ce5ee42",
  "clusterId": "11111111-1111-1111-1111-111111111111",
  "enabled": true,
  "tolerations": [
    {
      "key": "scheduling.cast.ai/node-template",
      "operator": "Equal",
      "value": "production-template",
      "effect": "NoSchedule"
    }
  ]
}

Use the CreatePodMutation endpoint to experiment with your own pod mutations via API.

Node Template consolidation

One powerful feature of pod mutations is the ability to consolidate multiple Node Templates. This helps reduce cluster fragmentation by allowing pods to schedule across multiple Node Template configurations.

When consolidating Node Templates:

Specify the Node Templates to consolidate
The controller converts individual node selectors and tolerations into node affinity rules
Pods can then schedule on any node created by the specified templates

Example consolidation configuration:

{
  "objectFilter": {
    "namespaces": [
      "production"
    ]
  },
  "name": "production-mutation",
  "nodeTemplatesToConsolidate": [
    "template-1",
    "template-2"
  ]
}

Spot Instance configuration

Pod mutations support three Spot Instance modes, and they can specify a percentage-based distribution between Spot and On-Demand instances.

Spot Distribution Percentage

When configuring Spot settings for your pod mutations, you can specify what percentage of pods should receive Spot-related configuration versus remaining on On-Demand instances.

A setting of 50% means that approximately half of your pods will receive the selected Spot configuration (optional, preferred, or use-only), while the other half will be scheduled on On-Demand instances
The higher the percentage, the more pods will receive Spot-related configurations
The lower the percentage, the more pods will remain on On-Demand instances

For example, with a 75% Spot distribution setting:

75% of pods will be scheduled according to your chosen Spot behavior (optional, preferred, or use-only)
25% of pods will be scheduled on On-Demand instances

The mutation controller makes this determination when pods are created, applying Spot-related mutations to the configured percentage of pods while leaving the remainder configured for On-Demand instances.

Spot Distribution Options

Combined with the distribution percentage, you can select one of three Spot behavior options:

Mode	Description
Spot Instances are optional	Allows scheduling on either Spot or On-Demand instances to fulfill the selected Spot percentage. There is no preference between instance types if both are available.
Use only Spot Instances	Strictly maintains the selected Spot/On-Demand ratio. If Spot Instances are unavailable, deployment will fail for the Spot portion.
Spot Instances are preferred	Targets the selected Spot percentage with Spot Instances but automatically falls back to On-Demand instances if Spot becomes unavailable. Will attempt to rebalance back to Spot when available.

📘
Note
The actual distribution may vary slightly from the configured percentage, especially with small pod counts or simultaneous pod creation. For very low replica counts, the system prioritizes maintaining the minimum On-Demand percentage. For example, with a single pod and any Spot distribution below 100%, the pod will be scheduled on On-Demand to ensure the minimum On-Demand percentage is maintained. The distribution may also drift over time as pods are deleted and recreated through the normal application lifecycle, but the system is designed to be self-healing, meaning it will attempt to restore the desired distribution whenever new pods are created.

Example Configuration

{
  "name": "production-spot-mutation",
  "organizationId": "org-12345",
  "clusterId": "cluster-67890",
  "enabled": true,
  "objectFilter": {
    "namespaces": [
      "production"
    ]
  },
  "spotType": "PREFERRED_SPOT",
  "spotDistributionPercentage": 75
}

This configuration creates a mutation that:

Applies to all pods in the "production" namespace
Sets 75% of pods to use Spot Instances with fallback to On-Demand if unavailable
Keeps 25% of pods on On-Demand instances at all times

Combining percentage-based distribution with different Spot behavior options allows you to create deployment strategies that balance cost savings with application reliability requirements.

Advanced Configuration with JSON Patch

The Pod Mutations feature supports advanced configuration using JSON Patch, allowing for precise control over pod specifications beyond what's available through the standard UI options.

What is JSON Patch?

JSON Patch is a format for describing changes to a JSON document, defined in RFC 6902. In Kubernetes, it allows for complex modifications to pod specifications through a series of operations such as add, remove, replace, move, copy, and test.

For more information about JSON Patch operations, refer to Kubernetes documentation.

When to Use JSON Patch

Consider using JSON Patch when:

You need to modify parts of a pod specification not covered by the standard UI options
You want to perform multiple transformations in a specific order
You're implementing complex mutation logic that combines adding, removing, and modifying fields
You need to remove specific elements from arrays or nested structures

Configuring JSON Patch

To configure a JSON Patch:

In the pod mutation configuration, expand the "JSON Patch (advanced)" section:
Enter your JSON Patch operations in the drawer editor
Review the patch for errors before applying

🚧
Warning
JSON Patch operations take precedence over UI-defined settings. If there's a conflict between your patch operations and UI configurations, the patch operations will be applied.

JSON Patch Structure

A JSON Patch consists of an array of operations, where each operation is an object with the following properties:

op: The operation to perform (add, remove, replace, move, copy, or test)
path: A JSON pointer to the location in the document where the operation is performed
value: The value to use for the operation (for add and replace)
from: A JSON pointer for the move and copy operations

Common Examples

Add Node Selector

When you need to ensure pods are scheduled on specific nodes with certain characteristics, adding a node selector is the way to go:

[
  {
    "op": "add",
    "path": "/spec/nodeSelector",
    "value": {
      "scheduling.cast.ai/node-template": "high-performance"
    }
  }
]

This patch adds a node selector that directs pods to nodes using the "high-performance" node template, which might be optimized for CPU-intensive workloads.

🚧
Warning
This patch will replace any existing nodeSelector entirely. If you want to preserve existing nodeSelectors, use the method below.

Add a Single Node Selector Key-Value

To add a single nodeSelector key-value pair while preserving existing ones:

[
  {
    "op": "add",
    "path": "/spec/nodeSelector/scheduling.cast.ai~1node-template",
    "value": "high-performance"
  }
]

Note the special syntax with the tilde character (~1), which is used to escape the forward slash in the key name. However, this patch will fail if the nodeSelector object doesn't already exist in the pod specification.

Replace Toleration Effect

If you need to modify how pods tolerate node taints, you can replace specific fields within existing tolerations:

[
  {
    "op": "replace",
    "path": "/spec/tolerations/0/effect",
    "value": "NoSchedule"
  }
]

This patch changes the effect of the first toleration to "NoSchedule", ensuring pods won't be scheduled on nodes with matching taints rather than potentially being evicted later.

Remove a Specific Array Element

Sometimes, you need to remove specific configuration elements that are no longer needed or might conflict with your intended setup:

[
  {
    "op": "remove",
    "path": "/spec/tolerations/2"
  }
]

This patch removes the third(0, 1, 2, [...]) toleration in the tolerations array, which might be necessary when transitioning workloads between different node types or environments.

Remove by Key

When you need to remove a specific key from a map or object:

[
  {
    "op": "remove",
    "path": "/spec/nodeSelector/environment"
  }
]

This patch removes the "environment" key from nodeSelector while preserving other nodeSelector entries.

Remove a Specific Value from an Array

For more complex scenarios where you need to target array elements based on their content rather than position:

[
  {
    "op": "test",
    "path": "/spec/tolerations/0/key",
    "value": "node-role.kubernetes.io/control-plane"
  },
  {
    "op": "remove",
    "path": "/spec/tolerations/0"
  }
]

This two-step patch first verifies that the first toleration matches a specific control plane role, then removes it if the test passes.

Complex Example: Replace Node Affinity and Add Toleration

For comprehensive pod scheduling adjustments that require multiple coordinated changes:

[
  {
    "op": "remove",
    "path": "/spec/affinity"
  },
  {
    "op": "add",
    "path": "/spec/nodeSelector",
    "value": {
      "scheduling.cast.ai/node-template": "custom-template"
    }
  },
  {
    "op": "add",
    "path": "/spec/tolerations/-",
    "value": {
      "key": "scheduling.cast.ai/node-template",
      "operator": "Equal",
      "value": "custom-template",
      "effect": "NoSchedule"
    }
  }
]

This multi-operation patch completely reconfigures pod scheduling by removing any existing node affinity rules, setting a node selector for a custom template, and adding a matching toleration.

JSON Patch Limitations

JSON Patch operations apply to the pod template, not directly to the running pods
Some Kubernetes fields are immutable and cannot be changed after creation
Patches that would result in invalid pod specifications will be rejected

Best practices

Use meaningful names: Give mutations descriptive names that indicate their purpose so as not to have to look into the configuration to be able to tell a mutation's purpose.
Design mutually exclusive filters: Since only one mutation can be applied to a pod, design your filters to clearly separate which pods should receive which mutations. Avoid overlapping filters that could match the same pod.
Test in non-production: If possible, validate mutation behavior in a test environment.
Monitor changes: Review the effects of mutations through the Cast AI console to ensure desired outcomes.

Limitations

Mutations only apply to newly created pods. Similarly, mutation changes don't affect existing pods until they are recreated.
Only one mutation can be applied to a pod. When multiple mutations have matching filters for a pod, Cast AI selects the mutation with the most specific filter (for example, a filter on pod name is more specific than a filter on namespace). We recommend using mutually exclusive filters rather than relying on this specificity ranking.
Some pod configurations cannot be modified. Refer to the information above on what can be modified. Anything that is not mentioned is beyond the scope of pod mutations now.
Scaling policies are evaluated every 30 seconds. As a result, changes to resource requests or limits may not be applied immediately.

Troubleshooting

Verify controller status

Check if the pod-mutator controller is running:

kubectl get pods -n castai-agent -l app=pod-mutator

Check controller logs

View logs for mutation activity:

kubectl logs -n castai-agent -l app=pod-mutator

Common issues

Mutations not applying: Verify the object filters match your pods, and the controller is running
Configuration conflicts: Check for conflicting mutations targeting the same pods
Invalid mutations: Ensure mutation specifications follow the correct format
Mutations not applying correctly with multiple webhooks: If you have multiple admission webhooks in your cluster that modify pods, you may need to set webhook.reinvocationPolicy="IfNeeded" during installation to ensure the pod mutator can properly apply its mutations after other webhooks make changes. Check the pod mutator logs for any signs of mutation conflicts or ordering issues.

For additional help, contact Cast AI support or visit our community Slack channel.

Pod mutations

What are pod mutations?

Why use pod mutations?

How it works

Installation

Install using the console

Install using Helm

📘Note

Advanced installation options

📘Note

Creating pod mutations

Console example

API example

Node Template consolidation

Spot Instance configuration

Spot Distribution Percentage

Spot Distribution Options

📘Note

Example Configuration

Advanced Configuration with JSON Patch

What is JSON Patch?

When to Use JSON Patch

Configuring JSON Patch

🚧Warning

JSON Patch Structure

Common Examples

Add Node Selector

🚧Warning

Add a Single Node Selector Key-Value

Replace Toleration Effect

Remove a Specific Array Element

Remove by Key

Remove a Specific Value from an Array

Complex Example: Replace Node Affinity and Add Toleration

JSON Patch Limitations

Best practices

Limitations

Troubleshooting

Verify controller status

Check controller logs

Common issues

📘
Note

📘
Note

📘
Note

🚧
Warning

🚧
Warning