Pod mutations

Pod mutations

What are pod mutations?

Pod mutations is a Cast AI feature that simplifies Kubernetes workload configuration and helps optimize cluster resource usage. It allows you to define templates that automatically modify pod specifications when they are created, reducing manual configuration overhead and ensuring consistent pod scheduling across your cluster.

Why use pod mutations?

Managing Kubernetes workloads at scale presents several challenges:

  • Complex Configuration Requirements: As clusters grow, manually configuring pod specifications becomes increasingly time-consuming and error-prone. Each workload may need specific labels, tolerations, and node selectors to ensure proper scheduling and resource allocation.

  • Legacy System Integration: When onboarding existing clusters to Cast AI, workloads sometimes need to be reconfigured to take full advantage of cost optimization features. This traditionally requires updating deployment manifests, which can be automated using pod mutations.

  • Resource Fragmentation: Without standardized pod configurations, clusters can become fragmented with too many node groups, leading to inefficient resource utilization and increased costs.

Pod mutations address all of these challenges.

How it works

Pod mutations allow you to define templates that automatically modify pod specifications when they are created. These templates can:

  • Apply labels and tolerations
  • Configure node selectors and affinities
  • Link pods to specific Node Templates
  • Consolidate multiple Node Templates
  • Set Spot Instance preferences

The pod mutations controller, called the pod mutator, runs in your cluster and monitors pod creation events. When a new pod matches a mutation's configured filters, the controller automatically applies that mutation. Note that only one mutation can be applied to any given pod - if multiple mutations match a pod's filters, the most specific filter match will be used.

Installation

Install using the console

  1. Upon selecting a cluster from the cluster list, head over to Autoscaler --> Pod mutations in the sidebar.
  2. If you have not installed the pod-mutator controller, you will be prompted with a script you need to run in your cluster's cloud shell or terminal.

Install using Helm

  1. Add the Cast AI Helm repository:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
  1. Install the pod mutations controller:
helm repo add castai-helm https://castai.github.io/helm-charts
helm upgrade -i --create-namespace -n castai-agent pod-mutator \
castai-helm/castai-pod-mutator \
--set castai.apiUrl="https://api.cast.ai" \ 
--set castai.apiKey="${API_KEY}" \
--set castai.clusterID="${CLUSTER_ID}"

Advanced installation options

The pod mutator supports the configuration of its webhook reinvocation policy. This controls whether the pod mutator should be reinvoked if other admission plugins modify the pod after the initial mutation.

helm upgrade -i --create-namespace -n castai-agent pod-mutator \
castai-helm/castai-pod-mutator \
--set castai.apiUrl="https://api.cast.ai" \ 
--set castai.apiKey="${API_KEY}" \
--set castai.clusterID="${CLUSTER_ID}" \
--set webhook.reinvocationPolicy="IfNeeded" # Set to "Never" by default

The reinvocationPolicy can be set to:

  • Never (default): The pod mutator will only be called once during pod admission
  • IfNeeded: The pod mutator may be called again if other admission plugins modify the pod after the initial mutation

Setting reinvocationPolicy to IfNeeded is useful when you have multiple admission webhooks that may interact with each other. For example:

  1. Pod mutator adds its mutations
  2. Another webhook modifies the pod
  3. Pod mutator is invoked again to ensure its mutations are properly applied

⚠️ However, if you want changes made by other webhooks to persist, setting reinvocationPolicy to IfNeeded may be counterproductive since the pod mutator will override any modifications that fall under its control when it's reinvoked. Consider your specific use case and the interaction between different webhooks in your cluster before changing this setting from its default value.

Creating pod mutations

Pod mutations are either defined through the PodMutations API or created using the Cast AI console. Each mutation consists of:

  • A unique name
  • Object filters to select targeted pods
  • Mutation rules defining what changes to apply
  • Node Template configurations (optional)
  • Spot instance preferences (optional)

Console example

After installing the pod-mutator controller in your cluster, you'll have access to the pod mutations in your console:

To create a new mutation template, simply click on Add template in the top-right, which will open the drawer in which you can define the configuration of your mutation.

  1. Begin by giving your mutation template a name:
  1. Then, define the filters by which the mutation candidate pods ought to be discovered by the controller:
  1. Configure your desired mutation configuration:
  1. Finally, choose the spot settings most appropriate for this template before hitting Create:

The console UI offers a helping hand when creating mutation by means of tooltips and a live preview of what your configuration will look like:

API example

Here's an example pod mutation API request that applies labels and tolerations to specific workloads:

{
  "objectFilter": {
    "names": [
      "app1",
      "app2"
    ],
    "namespaces": [
      "production"
    ]
  },
  "labels": {
    "environment": "production"
  },
  "spotType": "UNSPECIFIED_SPOT_TYPE",
  "name": "production-mutation",
  "organizationId": "fhytif73-f95f-44de-ad4b-f7898ce5ee42",
  "clusterId": "11111111-1111-1111-1111-111111111111",
  "enabled": true,
  "tolerations": [
    {
      "key": "scheduling.cast.ai/node-template",
      "operator": "Equal",
      "value": "production-template",
      "effect": "NoSchedule"
    }
  ]
}

Use the CreatePodMutation endpoint to experiment with your own pod mutations via API.

Node Template consolidation

One powerful feature of pod mutations is the ability to consolidate multiple Node Templates. This helps reduce cluster fragmentation by allowing pods to schedule across multiple Node Template configurations.

When consolidating Node Templates:

  1. Specify the Node Templates to consolidate
  2. The controller converts individual node selectors and tolerations into node affinity rules
  3. Pods can then schedule on any node created by the specified templates

Example consolidation configuration:

{
  "objectFilter": {
    "namespaces": [
      "production"
    ]
  },
  "name": "production-mutation",
  "nodeTemplatesToConsolidate": [
    "template-1",
    "template-2"
  ]
}

Spot Instance configuration

Pod mutations support three Spot Instance modes, and they can specify a percentage-based distribution between Spot and On-demand instances.

Spot Distribution Percentage

When configuring spot settings for your pod mutations, you can specify what percentage of pods should receive spot-related configuration versus remaining on on-demand instances.

  • A setting of 50% means that approximately half of your pods will receive the selected spot configuration (optional, preferred, or use-only), while the other half will be scheduled on on-demand instances
  • The higher the percentage, the more pods will receive spot-related configurations
  • The lower the percentage, the more pods will remain on on-demand instances

For example, with a 75% spot distribution setting:

  • 75% of pods will be scheduled according to your chosen spot behavior (optional, preferred, or use-only)
  • 25% of pods will be scheduled on on-demand instances

The mutation controller makes this determination when pods are created, applying spot-related mutations to the configured percentage of pods while leaving the remainder configured for on-demand instances.

Spot Distribution Options

Combined with the distribution percentage, you can select one of three spot behavior options:

ModeDescription
Spot Instances are optionalAllows scheduling on either Spot or On-Demand instances to fulfill the selected Spot percentage. There is no preference between instance types if both are available.
Use only Spot InstancesStrictly maintains the selected Spot/On-Demand ratio. If Spot Instances are unavailable, deployment will fail for the Spot portion.
Spot Instances are preferredTargets the selected Spot percentage with Spot Instances but automatically falls back to On-Demand instances if Spot becomes unavailable. Will attempt to rebalance back to Spot when available.

📘

Note

The actual distribution may vary slightly from the configured percentage, especially with small pod counts or simultaneous pod creation. The distribution may also drift over time as pods are deleted and recreated through the normal application lifecycle. But the difference should be very minimal.

Example Configuration

{
  "name": "production-spot-mutation",
  "organizationId": "org-12345",
  "clusterId": "cluster-67890",
  "enabled": true,
  "objectFilter": {
    "namespaces": [
      "production"
    ]
  },
  "spotType": "PREFERRED_SPOT",
  "spotDistributionPercentage": 75
}

This configuration creates a mutation that:

  1. Applies to all pods in the "production" namespace
  2. Sets 75% of pods to use Spot Instances with fallback to On-demand if unavailable
  3. Keeps 25% of pods on On-demand instances at all times

Combining percentage-based distribution with different spot behavior options allows you to create deployment strategies that balance cost savings with application reliability requirements.

Advanced Configuration with JSON Patch

The Pod Mutations feature supports advanced configuration using JSON Patch, allowing for precise control over pod specifications beyond what's available through the standard UI options.

What is JSON Patch?

JSON Patch is a format for describing changes to a JSON document, defined in RFC 6902. In Kubernetes, it allows for complex modifications to pod specifications through a series of operations such as add, remove, replace, move, copy, and test.

For more information about JSON Patch operations, refer to Kubernetes documentation.

When to Use JSON Patch

Consider using JSON Patch when:

  • You need to modify parts of a pod specification not covered by the standard UI options
  • You want to perform multiple transformations in a specific order
  • You're implementing complex mutation logic that combines adding, removing, and modifying fields
  • You need to remove specific elements from arrays or nested structures

Configuring JSON Patch

To configure a JSON Patch:

  1. In the pod mutation configuration, expand the "JSON Patch (advanced)" section:

  2. Enter your JSON Patch operations in the drawer editor

  3. Review the patch for errors before applying

🚧

Warning

JSON Patch operations take precedence over UI-defined settings. If there's a conflict between your patch operations and UI configurations, the patch operations will be applied.

JSON Patch Structure

A JSON Patch consists of an array of operations, where each operation is an object with the following properties:

  • op: The operation to perform (add, remove, replace, move, copy, or test)
  • path: A JSON pointer to the location in the document where the operation is performed
  • value: The value to use for the operation (for add and replace)
  • from: A JSON pointer for the move and copy operations

Common Examples

Add Node Selector

When you need to ensure pods are scheduled on specific nodes with certain characteristics, adding a node selector is the way to go:

[
  {
    "op": "add",
    "path": "/spec/nodeSelector",
    "value": {
      "scheduling.cast.ai/node-template": "high-performance"
    }
  }
]

This patch adds a node selector that directs pods to nodes using the "high-performance" node template, which might be optimized for CPU-intensive workloads.

Replace Toleration Effect

If you need to modify how pods tolerate node taints, you can replace specific fields within existing tolerations:

[
  {
    "op": "replace",
    "path": "/spec/tolerations/0/effect",
    "value": "NoSchedule"
  }
]

This patch changes the effect of the first toleration to "NoSchedule", ensuring pods won't be scheduled on nodes with matching taints rather than potentially being evicted later.

Remove a Specific Array Element

Sometimes, you need to remove specific configuration elements that are no longer needed or might conflict with your intended setup:

[
  {
    "op": "remove",
    "path": "/spec/tolerations/2"
  }
]

This patch removes the third(0, 1, 2, [...]) toleration in the tolerations array, which might be necessary when transitioning workloads between different node types or environments.

Remove by Key-Value Match

When you need to clean up specific labels or selectors that are no longer relevant:

[
  {
    "op": "remove",
    "path": "/spec/nodeSelector/environment"
  }
]

This patch removes only the "environment" selector while preserving other nodeSelector entries.

Remove a Specific Value from an Array

For more complex scenarios where you need to target array elements based on their content rather than position:

[
  {
    "op": "test",
    "path": "/spec/tolerations/0/key",
    "value": "node-role.kubernetes.io/control-plane"
  },
  {
    "op": "remove",
    "path": "/spec/tolerations/0"
  }
]

This two-step patch first verifies that the first toleration matches a specific control plane role, then removes it if the test passes.

Complex Example: Replace Node Affinity and Add Toleration

For comprehensive pod scheduling adjustments that require multiple coordinated changes:

[
  {
    "op": "remove",
    "path": "/spec/affinity"
  },
  {
    "op": "add",
    "path": "/spec/nodeSelector",
    "value": {
      "scheduling.cast.ai/node-template": "custom-template"
    }
  },
  {
    "op": "add",
    "path": "/spec/tolerations/-",
    "value": {
      "key": "scheduling.cast.ai/node-template",
      "operator": "Equal",
      "value": "custom-template",
      "effect": "NoSchedule"
    }
  }
]

This multi-operation patch completely reconfigures pod scheduling by removing any existing node affinity rules, setting a node selector for a custom template, and adding a matching toleration.

JSON Patch Limitations

  • JSON Patch operations apply to the pod template, not directly to the running pods
  • Some Kubernetes fields are immutable and cannot be changed after creation
  • Patches that would result in invalid pod specifications will be rejected

Best practices

  1. Use meaningful names: Give mutations descriptive names that indicate their purpose so as not to have to look into the configuration to be able to tell a mutation's purpose.

  2. Design mutually exclusive filters: Since only one mutation can be applied to a pod, design your filters to clearly separate which pods should receive which mutations. Avoid overlapping filters that could match the same pod.

  3. Test in non-production: If possible, validate mutation behavior in a test environment.

  4. Monitor changes: Review the effects of mutations through the Cast AI console to ensure desired outcomes.

Limitations

  • Mutations only apply to newly created pods. Similarly, mutation changes don't affect existing pods until they are recreated.
  • Only one mutation can be applied to a pod. When multiple mutations have matching filters for a pod, Cast AI selects the mutation with the most specific filter (for example, a filter on pod name is more specific than a filter on namespace). We recommend using mutually exclusive filters rather than relying on this specificity ranking.
  • Some pod configurations cannot be modified. Refer to the information above on what can be modified. Anything that is not mentioned is beyond the scope of pod mutations now.
  • Scaling policies are evaluated every 30 seconds. As a result, changes to resource requests or limits may not be applied immediately.

Troubleshooting

Verify controller status

Check if the pod-mutator controller is running:

kubectl get pods -n castai-agent -l app=pod-mutator

Check controller logs

View logs for mutation activity:

kubectl logs -n castai-agent -l app=pod-mutator

Common issues

  1. Mutations not applying: Verify the object filters match your pods, and the controller is running

  2. Configuration conflicts: Check for conflicting mutations targeting the same pods

  3. Invalid mutations: Ensure mutation specifications follow the correct format

  4. Mutations not applying correctly with multiple webhooks: If you have multiple admission webhooks in your cluster that modify pods, you may need to set webhook.reinvocationPolicy="IfNeeded" during installation to ensure the pod mutator can properly apply its mutations after other webhooks make changes. Check the pod mutator logs for any signs of mutation conflicts or ordering issues.

For additional help, contact Cast AI support or visit our community Slack channel.