Cluster metrics integration

CAST AI delivers detailed metrics on your cluster utilization so that you can better understand your cloud infrastructure and ultimately reduce its cost. All metrics are scrapable, so you can scrape the data using Prometheus API and visualize it in Grafana or another tool of your choice.

As a result, you can draw on the cluster utilization and cost stats and include them effortlessly in your team’s wider cloud monitoring and reporting efforts.

This guide outlines the metrics available in CAST AI and describes the process of exporting them to Prometheus and Grafana step by step.

How to visualize CAST AI metrics in Prometheus and Grafana

Why use CAST AI with Prometheus and Grafana

The combination of Prometheus and Grafana has become a common choice for DevOps and CloudOps teams, and this is for a reason.

The first provides a powerful querying language and gathers rich metrics, while the latter transforms these into meaningful visualizations. Both Prometheus and Grafana are compatible with most data source types.

How to connect CAST AI with Prometheus and Grafana

1. Create your CAST AI API key

Enter your cluster in the CAST AI platform, click the API tab in the top menu, and generate a one-time token.

You’ll need to specify your key name and choose between a read-only or full access. Then, copy and paste it into the respective place in the above code and execute.

You can also use this key to access CAST AI API in tools like Swagger UI.

2. Call the CAST AI API

Open your Prometheus scraper config in your favorite tool and add scraping for CAST AI metrics. Our API provides separate end-points for scraping of cluster, node, workload and allocation group level metrics. Here is an example how to configure scraping all of them:

scrape_configs:
 - job_name: 'castai_cluster_metrics'
   scrape_interval: 15s
   scheme: https
   static_configs:
     - targets: ['api.cast.ai']
   metrics_path: '/v1/metrics/prom'
   authorization:
     type: 'Token'
     credentials: '{apiKey}'
 - job_name: 'castai_node_template_metrics'
   scrape_interval: 1m
   scheme: https
   static_configs:
     - targets: ['api.cast.ai']
   metrics_path: '/v1/metrics/nodes'
   authorization:
     type: 'Token'
     credentials: '{apiKey}'
 - job_name: 'castai_workload_metrics'
   scrape_interval: 15s
   scheme: https
   static_configs:
     - targets: ['api.cast.ai']
   metrics_path: 'v1/metrics/workloads'
   authorization:
     type: 'Token'
     credentials: '{apiKey}'
 - job_name: 'castai_allocation_group_metrics'
   scrape_interval: 15s
   scheme: https
   static_configs:
     - targets: ['api.cast.ai']
   metrics_path: 'v1/metrics/allocation-groups'
   authorization:
     type: 'Token'
     credentials: '{apiKey}'

Please replace {apiKey} with the token created in step 1.

To limit amount of returned data, Node, Workload and Allocation group metric end-points support filtering by clusterID, see API reference.

3. Specify your data source in Grafana

Open Grafana, head to the Configuration tab, and click on Data Sources.

When you select the Add data source option, you’ll see a list of all supported data sources. From here, choose Prometheus and insert all required details, including HTTP, Auth, and more.

After you specify your data source, you can go to Explore, select your data source by name, and start typing the metric name for autocompleting.

4. Create a dashboard in Grafana

Click on the Dashboards tab in Grafana’s main menu and select the Browse option. That’s where you’ll see the button to start a new dashboard. Give it a meaningful name and set the main options.

For more information you can refer to Grafana’s documentation, you can also check this list of best practices for creating dashboards.

Or you can start by importing Example dashboard JSON provided below.

5. Add and format your metrics

Now it’s time to start populating your dashboard with data.

Add a new panel and scroll down to its bottom to ensure that the data source is set to Prometheus. Then, start typing the name of the required metric in the metric browser box, and it will appear on the screen.

Common choices of metrics include the requested vs. provisioned CPUs and memory, and the monthly cost of your cluster. You can also expand the metrics presented in your dashboard by importing data in JSON files.

Use the panel on the right to specify your stat title, legend, visualization styles, and other values to help you ensure the report makes the most sense to your team.

You can then expand your dashboard with additional features, including annotations and alerts.

Example Grafana dashboard

Here’s an example dashboard displaying CAST AI data.

You can get the code here .

CAST AI metrics

Note: Label cast_node_type is deprecated, so instead of it please use castai_node_lifecycle.

NameTypeDescriptionAction
castai_autoscaler_agent_snapshots_received_totalCounterThe CAST AI Autoscaler agent snapshots received total.Check if the Agent is running in the cluster.
castai_autoscaler_agent_snapshots_processed_totalCounterThe CAST AI Autoscaler agent snapshots processed total.Contact CAST AI support.
castai_cluster_total_cost_hourlyGaugeCluster total hourly cost.
castai_cluster_compute_cost_hourlyGaugeCluster compute cost. Has a lifecycle dimensions which can be summed up to a total cost: [on_demand, spot_fallback, spot].
castai_cluster_total_cost_per_cpu_hourlyGaugeNormalized cost per CPU.
castai_cluster_compute_cost_per_cpu_hourlyGaugeNormalized cost per CPU. Has a lifecycle dimension, similar to castai_cluster_compute_cost_hourly.
castai_cluster_allocatable_cpu_coresGaugeCluster allocatable CPU cores.
castai_cluster_allocatable_memory_bytesGaugeCluster allocatable memory.
castai_cluster_provisioned_cpu_coresGaugeCluster provisioned CPU cores.
castai_cluster_provisioned_memory_bytesGaugeCluster provisioner memory.
castai_cluster_requests_cpu_coresGaugeCluster requested CPU cores.
castai_cluster_used_cpu_coresGaugeCluster used CPU cores.
castai_cluster_used_memory_bytesGaugeCluster used memory.
castai_cluster_requests_memory_bytesGaugeCluster requested memory.
castai_cluster_node_countGaugeCluster nodes count.
castai_cluster_pods_countGaugeCluster pods count.
castai_cluster_unschedulable_pods_countGaugeCluster unschedulable pods count.
castai_evictor_node_target_countGaugeCAST AI Evictor targeted nodes count.
castai_evictor_pod_target_countGaugeCAST AI Evictor targeted pods count.
castai_cluster_provisioned_storage_bytesGaugeCluster provisioned storage. Currently available only for GCP.
castai_cluster_requests_storage_bytesGaugeCluster requested storage. Currently available only for GCP.
castai_cluster_storage_cost_hourlyGaugeCluster storage hourly cost. Currently available only for GCP.

Query examples

Cost per cluster:

sum(castai_cluster_total_cost_hourly{}) by (castai_cluster)

Compute cost of spot instances of a specific cluster:

castai_cluster_compute_cost_hourly{castai_cluster="$cluster", lifecycle="spot"}

Received snapshots count:

sum(increase(castai_autoscaler_agent_snapshots_received_total{castai_cluster="$cluster"}[5m]))

Alert on missing snapshots:

absent_over_time(castai_autoscaler_agent_snapshots_received_total{castai_cluster="$cluster"}[5m])

Get castai_node_lifecycle(on_demand, spot, spot_fallback) of running nodes in cluster:

sum(castai_cluster_node_count{castai_cluster="$cluster"}) by (castai_node_lifecycle)

Get CPU cores provisioned for spot_fallback nodes:

castai_cluster_provisioned_cpu_cores{castai_node_lifecycle="spot_fallback"}

Note: Replace $cluster with existing castai_cluster label value.

Node metrics

Node metrics have common labels: cluster_id, node_id, node_name, node_template, managed_by

NameTypeDescriptionNote
castai_node_spot_interruptionGaugeNumber of spot interruptions over last 60 seconds per node templatewill have managed_by="castai" label only when node is managed by CAST AI, otherwise this label will be absent
castai_node_requested_cpu_coresGaugeRequested CPU cores by nodeHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_node_requested_ram_bytesGaugeRequested RAM bytes by nodeHas a lifecycle dimension
castai_node_provisioned_cpu_coresGaugeProvisioned CPU cores by nodeHas a lifecycle dimension
castai_node_provisioned_ram_bytesGaugeProvisioned RAM bytes by nodeHas a lifecycle dimension
castai_node_used_cpu_coresGaugeUtilized CPU cores by nodeHas a lifecycle dimension
castai_node_used_ram_bytesGaugeUtilized RAM bytes by nodeHas a lifecycle dimension
castai_node_overprovisioned_cpu_percentGaugeOverprovisioned percent of CPU by nodeHas a lifecycle dimension
castai_node_overprovisioned_ram_percentGaugeOverprovisioned percent of RAM by nodeHas a lifecycle dimension

Possible managed_by values are: castai, karpenter, eks, gke, aks, kops, openshift

Query example

Get average on_demand CPU overprovisioned % per node provider:

avg(castai_node_overprovisioned_cpu_percent{lifecycle="on_demand"}) by (managed_by)

Get CPU cores provisioned for default CAST AI node template and aggregated by lifecycle:

sum(castai_node_provisioned_cpu_cores{node_template="default-by-castai"}) by (lifecycle)

Get count of node spot interruption aggregated by node templates

sum(castai_node_spot_interruption{}) by (node_template)

Workload metrics

Workload metrics have common labels: cluster_id, namespace, workload_name, workload_type

NameTypeDescriptionNote
castai_workload_cost_hourlyGaugeWorkload cost per hourHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_workload_pod_countGaugeNumber of podsHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_workload_requested_cpu_coresGaugeRequested CPU coresHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_workload_requested_memory_bytesGaugeRequested memory bytesHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_workload_persistent_volume_cost_hourlyGaugeWorkload persistent volume cost per hour
castai_workload_requested_persistent_volume_bytesGaugeWorkload requested persistent volume bytes

Query example

Get workload costs running on spot instances for specific cluster

castai_workload_cost_hourly{cluster_id="$clusterId", lifecycle="spot"}

Get average workload costs running on on_demand instances over 1h aggregated by namespace for specific cluster:

sum(avg_over_time(castai_workload_cost_hourly{cluster_id="$clusterId", lifecycle="on_demand"}[1h])) by (namespace)

Allocation group metrics

NameTypeDescriptionNote
castai_allocation_group_compute_cost_hourlyGaugeCompute cost per hourHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_allocation_group_cpu_cost_hourlyGaugeCPU cost per hourHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_allocation_group_memory_cost_hourlyGaugeMemory cost per hourHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_allocation_group_requested_cpu_coresGaugeRequested CPU coresHas a lifecycle dimension: [on_demand, spot_fallback, spot]
castai_allocation_group_requested_memory_bytesGaugeRequested memory bytesHas a lifecycle dimension: [on_demand, spot_fallback, spot]