Metrics
Learn how to scrape CAST AI metrics and visualize them in Grafana
Cluster metrics integration
CAST AI delivers detailed metrics on your cluster utilization so that you can better understand your cloud infrastructure and ultimately reduce its cost. All metrics are scrapable, so you can scrape the data using Prometheus API and visualize it in Grafana or another tool of your choice.
As a result, you can draw on the cluster utilization and cost stats and include them effortlessly in your team’s wider cloud monitoring and reporting efforts.
This guide outlines the metrics available in CAST AI and describes the process of exporting them to Prometheus and Grafana step by step.
How to visualize CAST AI metrics in Prometheus and Grafana
Why use CAST AI with Prometheus and Grafana
The combination of Prometheus and Grafana has become a common choice for DevOps and CloudOps teams, and this is for a reason.
The first provides a powerful querying language and gathers rich metrics, while the latter transforms these into meaningful visualizations. Both Prometheus and Grafana are compatible with most data source types.
How to connect CAST AI with Prometheus and Grafana
1. Create your CAST AI API key
Enter your cluster in the CAST AI platform, click the API tab in the top menu, and generate a one-time token.
You must specify your key name and choose between a read-only or full access. Then, copy and paste it into the respective place in the above code and execute.
You can also use this key to access CAST AI API in tools like Swagger UI.
2. Call the CAST AI API
Open your Prometheus scraper config in your favorite tool and add scraping for CAST AI metrics. Our API provides separate end-points for scraping the cluster, node, workload, and allocation group-level metrics. Here is an example of how to configure scraping all of them:
scrape_configs:
- job_name: 'castai_cluster_metrics'
scrape_interval: 15s
scheme: https
static_configs:
- targets: ['api.cast.ai']
metrics_path: '/v1/metrics/prom'
authorization:
type: 'Token'
credentials: '{apiKey}'
- job_name: 'castai_node_template_metrics'
scrape_interval: 1m
scheme: https
static_configs:
- targets: ['api.cast.ai']
metrics_path: '/v1/metrics/nodes'
authorization:
type: 'Token'
credentials: '{apiKey}'
- job_name: 'castai_workload_metrics'
scrape_interval: 15s
scheme: https
static_configs:
- targets: ['api.cast.ai']
metrics_path: 'v1/metrics/workloads'
authorization:
type: 'Token'
credentials: '{apiKey}'
- job_name: 'castai_allocation_group_metrics'
scrape_interval: 15s
scheme: https
static_configs:
- targets: ['api.cast.ai']
metrics_path: 'v1/metrics/allocation-groups'
authorization:
type: 'Token'
credentials: '{apiKey}'
- job_name: 'castai_workload_network_metrics'
scrape_interval: 15s
scheme: https
static_configs:
- targets: ['api.cast.ai']
metrics_path: 'v1/metrics/workload-network'
authorization:
type: 'Token'
credentials: '{apiKey}'
Please replace {apiKey} with the token created in step 1.
To limit the amount of returned data, the Node, Workload, Allocation group and Workload network metric endpoints support filtering by clusterID(see API reference).
3. Specify your data source in Grafana
Open Grafana, head to the Configuration tab, and click on Data Sources.
When you select the Add data source option, you’ll see a list of all supported data sources. From here, choose Prometheus and insert all required details, including HTTP, Auth, and more.
After you specify your data source, you can go to Explore, select your data source by name, and start typing the metric name for auto-completing.
4. Create a dashboard in Grafana
Click on the Dashboards tab in Grafana’s main menu and select the Browse option. That’s where you’ll see the button to start a new dashboard. Give it a meaningful name and set the main options.
For more information, refer to Grafana’s documentation or check this list of best practices for creating dashboards.
You can also start by importing the Example dashboard JSON provided below.
5. Add and format your metrics
Now, it’s time to start populating your dashboard with data.
Add a new panel and scroll down to its bottom to ensure that the data source is set to Prometheus. Then, start typing the name of the required metric in the metric browser box, and it will appear on the screen.
Common metrics include the requested vs. provisioned CPUs and memory and the monthly cost of your cluster. You can also expand the metrics presented in your dashboard by importing data in JSON files.
Use the panel on the right to specify your stat title, legend, visualization styles, and other values to help you ensure the report makes the most sense to your team.
You can then expand your dashboard with additional features, including annotations and alerts.
Example Grafana dashboard
Here’s an example dashboard displaying CAST AI data.
You can get the code here.
CAST AI metrics
Note
Label
cast_node_type
is deprecated, so instead of it, please usecastai_node_lifecycle
.
Name | Type | Description | Action |
---|---|---|---|
castai_autoscaler_agent_snapshots_received_total | Counter | The CAST AI Autoscaler agent snapshots received total. | Check if the Agent is running in the cluster. |
castai_autoscaler_agent_snapshots_processed_total | Counter | The CAST AI Autoscaler agent snapshots processed total. | Contact CAST AI support. |
castai_cluster_total_cost_hourly | Gauge | Cluster total hourly cost. | |
castai_cluster_compute_cost_hourly | Gauge | Cluster compute cost. Has a lifecycle dimension which can be summed up to a total cost: [on_demand, spot_fallback, spot] . | |
castai_cluster_total_cost_per_cpu_hourly | Gauge | Normalized cost per CPU. | |
castai_cluster_compute_cost_per_cpu_hourly | Gauge | Normalized cost per CPU. Has a lifecycle dimension, similar to castai_cluster_compute_cost_hourly . | |
castai_cluster_allocatable_cpu_cores | Gauge | Cluster allocatable CPU cores. | |
castai_cluster_allocatable_memory_bytes | Gauge | Cluster allocatable memory. | |
castai_cluster_provisioned_cpu_cores | Gauge | Cluster provisioned CPU cores. | |
castai_cluster_provisioned_memory_bytes | Gauge | Cluster provisioner memory. | |
castai_cluster_requests_cpu_cores | Gauge | Cluster requested CPU cores. | |
castai_cluster_used_cpu_cores | Gauge | Cluster used CPU cores. | |
castai_cluster_used_memory_bytes | Gauge | Cluster used memory. | |
castai_cluster_requests_memory_bytes | Gauge | Cluster requested memory. | |
castai_cluster_node_count | Gauge | Cluster nodes count. | |
castai_cluster_pods_count | Gauge | Cluster pods count. | |
castai_cluster_unschedulable_pods_count | Gauge | Cluster unschedulable pods count. | |
castai_evictor_node_target_count | Gauge | CAST AI Evictor targeted nodes count. | |
castai_evictor_pod_target_count | Gauge | CAST AI Evictor targeted pods count. | |
castai_cluster_provisioned_storage_bytes | Gauge | Cluster provisioned storage. Currently available only for GCP. | |
castai_cluster_requests_storage_bytes | Gauge | Cluster requested storage. Currently available only for GCP. | |
castai_cluster_storage_cost_hourly | Gauge | Cluster storage hourly cost. Currently available only for GCP. |
Query examples
Cost per cluster:
sum(castai_cluster_total_cost_hourly{}) by (castai_cluster)
Compute cost of spot instances of a specific cluster:
castai_cluster_compute_cost_hourly{castai_cluster="$cluster", lifecycle="spot"}
Received snapshots count:
sum(increase(castai_autoscaler_agent_snapshots_received_total{castai_cluster="$cluster"}[5m]))
Alert on missing snapshots:
absent_over_time(castai_autoscaler_agent_snapshots_received_total{castai_cluster="$cluster"}[5m])
Get castai_node_lifecycle
(on_demand
, spot
, spot_fallback
) of running nodes in the cluster:
sum(castai_cluster_node_count{castai_cluster="$cluster"}) by (castai_node_lifecycle)
Get CPU cores provisioned for spot_fallback
nodes:
castai_cluster_provisioned_cpu_cores{castai_node_lifecycle="spot_fallback"}
Note
Replace
$cluster
with existingcastai_cluster
label value.
Node metrics
Node metrics have common labels: cluster_id, node_id, node_name, node_template, managed_by
Name | Type | Description | Note |
---|---|---|---|
castai_node_spot_interruption | Gauge | Number of spot interruptions over last 60 seconds per node template | will have managed_by="castai" label only when CAST AI manages node, otherwise, this label will be absent |
castai_node_requested_cpu_cores | Gauge | Requested CPU cores by node | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_node_requested_ram_bytes | Gauge | Requested RAM bytes by node | Has a lifecycle dimension |
castai_node_provisioned_cpu_cores | Gauge | Provisioned CPU cores by node | Has a lifecycle dimension |
castai_node_provisioned_ram_bytes | Gauge | Provisioned RAM bytes by node | Has a lifecycle dimension |
castai_node_used_cpu_cores | Gauge | Utilized CPU cores by node | Has a lifecycle dimension |
castai_node_used_ram_bytes | Gauge | Utilized RAM bytes by node | Has a lifecycle dimension |
castai_node_cost_cpu_hourly | Gauge | Hourly cost of a single CPU core by node | Has a lifecycle dimension |
castai_node_cost_ram_gib_hourly | Gauge | Hourly cost of one GiB RAM by node | Has a lifecycle dimension |
castai_node_overprovisioned_cpu_percent | Gauge | Overprovisioned percent of CPU by node | Has a lifecycle dimension |
castai_node_overprovisioned_ram_percent | Gauge | Overprovisioned percent of RAM by node | Has a lifecycle dimension |
Possible managed_by
values are: castai
, karpenter
, eks
, gke
, aks
, kops
, openshift
Query example
Get average on_demand
CPU overprovisioned % per node provider:
avg(castai_node_overprovisioned_cpu_percent{lifecycle="on_demand"}) by (managed_by)
Get CPU cores provisioned for default CAST AI node template and aggregated by lifecycle:
sum(castai_node_provisioned_cpu_cores{node_template="default-by-castai"}) by (lifecycle)
Get a count of node spot interruption aggregated by node templates
sum(castai_node_spot_interruption{}) by (node_template)
Workload metrics
By default, when using the v1/metrics/workloads
endpoint, it returns common labels such as cluster_id, namespace, workload_name, workload_type
. However, you can also specify additional labels to be returned by setting workload labels in the additionalLabels
query parameter.
Note
There is a special
workload_optimization_enabled
label, which can be specified in theadditionalLabels
query parameter. This will return a label indicating whether optimization is enabled for a particular workload.
Name | Type | Description | Note |
---|---|---|---|
castai_workload_cost_hourly | Gauge | Workload cost per hour | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_workload_pod_count | Gauge | Number of pods | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_workload_requested_cpu_cores | Gauge | Requested CPU cores | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_workload_requested_memory_bytes | Gauge | Requested memory bytes | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_workload_persistent_volume_cost_hourly | Gauge | Workload persistent volume cost per hour | |
castai_workload_requested_persistent_volume_bytes | Gauge | Workload requested persistent volume bytes |
Query example
Get workload costs running on spot
instances for a specific cluster
castai_workload_cost_hourly{cluster_id="$clusterId", lifecycle="spot"}
Get average workload costs running on on_demand
instances over 1h aggregated by namespace for specific cluster:
sum(avg_over_time(castai_workload_cost_hourly{cluster_id="$clusterId", lifecycle="on_demand"}[1h])) by (namespace)
Allocation group metrics
Name | Type | Description | Note |
---|---|---|---|
castai_allocation_group_compute_cost_hourly | Gauge | Compute cost per hour | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_allocation_group_cpu_cost_hourly | Gauge | CPU cost per hour | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_allocation_group_memory_cost_hourly | Gauge | Memory cost per hour | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_allocation_group_requested_cpu_cores | Gauge | Requested CPU cores | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
castai_allocation_group_requested_memory_bytes | Gauge | Requested memory bytes | Has a lifecycle dimension: [on_demand, spot_fallback, spot] |
Workload network metrics
Name | Type | Description | Note |
---|---|---|---|
castai_workload_network_bytes | Gauge | Workload network bytes | |
castai_workload_to_workload_traffic_bytes | Gauge | Workload to workload traffic bytes |
Updated 18 days ago