Custom metrics

Custom metrics allow the Workload Autoscaler to scale applications based on signals that Kubernetes does not natively expose. The primary use case is connecting a Prometheus data source so the Workload Autoscaler can collect JVM heap metrics and generate heap-aware memory recommendations for Java workloads.

📘

Note

Other features that rely on custom metrics infrastructure — such as PSI (Pressure Stall Information) scaling and startup failure detection — work automatically once the Workload Autoscaler Exporter is installed. They use system-defined data sources and do not require you to connect a Prometheus data source. See System-defined data sources for details.

Prerequisites

The following components must be installed in your cluster:

ComponentMinimum versionDescription
castai-workload-autoscaler-exporter0.85.0Collects custom metrics from your cluster and forwards them to the Workload Autoscaler. Installed by default during cluster onboarding.
castai-agent0.113.0Cluster agent that communicates with the Cast AI control plane.

For upgrade instructions, see Workload Autoscaler configuration.

Install the exporter manually

If the exporter was not installed during cluster onboarding, you can install it manually.

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm install castai-workload-autoscaler-exporter castai-helm/castai-workload-autoscaler-exporter -n castai-agent

Configure a data source

A Prometheus data source connects the Workload Autoscaler to a Prometheus server running in your cluster, enabling it to collect application and runtime metrics for scaling decisions.

Default Exporter

When the exporter is installed, the Workload Autoscaler automatically creates a Default Exporter data source. This data source points to the exporter's own Prometheus instance (http://castai-workload-autoscaler-exporter-prometheus.castai-agent.svc:9090) and provides the system-defined metrics that power PSI scaling and startup failure detection.

For JVM heap scaling, you need to connect a Prometheus data source that scrapes JVM metrics from your Java workloads. If your existing Prometheus server already scrapes JVM metrics, connect it as an additional data source. If you do not have Prometheus scraping JVM metrics, enable auto-instrumentation in your scaling policy — the Workload Autoscaler will inject a JMX agent and route metrics through the Default Exporter automatically.

Connect a Prometheus data source

  1. Navigate to Workload Autoscaler > Custom metrics
  2. Click Add data source
  3. Enter a Data source name (for example, prometheus-prod-eu)
  4. Enter the Prometheus server URL — the in-cluster service URL of your Prometheus server (for example, http://prometheus.production.svc.cluster.local:9090)
  5. Click Connect
  6. Wait a few minutes for the data source to appear in the cluster
📘

Note

You can also reach this page from within a scaling policy. In the policy settings, click the Custom metrics section, then click the Configure custom metrics link.

Metrics explorer

The Custom metrics page includes a Metrics explorer where you can select a workload and time range to inspect the metrics being collected from your data sources. Use this to verify that JVM metrics are flowing correctly after connecting a data source or enabling auto-instrumentation.

Verify data source status

After connecting a data source, verify that it is healthy. The Custom metrics page shows the connection status for each data source. You can also check the custom resource status in the cluster:

kubectl get custommetricsexporterconfigs

Expected output when data sources are healthy:

NAME                      HEALTHY   READY   AGE
castai-data-source-name   True      1/1     21d
exporter-config           True      3/3     8m51s
📘

Note

It may take a few minutes for the custom resource to be created in the cluster after connecting a data source. If it does not appear immediately, wait and check again.

System-defined data sources

The Workload Autoscaler provides built-in data sources that are installed alongside castai-workload-autoscaler-exporter by default. These data sources require no manual configuration and work automatically once the exporter is installed:

  • node-probes — Provides startup-related metrics used by startup failure detection. No additional setup needed.
  • node-psi — Provides Pressure Stall Information metrics used by stall detection. Requires Kubernetes 1.34+. No additional setup needed.

Troubleshooting

Data source not appearing

If the custom resource does not appear after connecting a data source:

  1. Verify that castai-workload-autoscaler-exporter is running:
    kubectl get pods -n castai-agent -l app.kubernetes.io/name=castai-workload-autoscaler-exporter
  2. Check exporter logs for errors:
    kubectl logs -n castai-agent -l app.kubernetes.io/name=castai-workload-autoscaler-exporter --tail=50
  3. Wait 2-3 minutes and check again — resource creation is asynchronous

Data source shows unhealthy

If a data source reports an unhealthy status, inspect the custom resource for details:

kubectl describe custommetricsexporterconfigs <config-name>

The status includes a reason for the unhealthy state. For example, a cluster version incompatibility:

Status:
  Datasource Statuses:
    Type:     NodeWorkload
    Message:  Datasource disabled: node workload data source
              not supported on cluster version 1.33.6+k3s1
    Name:     node-psi
    State:    DatasourceDisabled

Address the underlying issue indicated in the status message, or contact Cast AI support if the error is unclear.

Next steps