GPU metrics exporter

GPU Metrics Exporter is a tool designed to collect GPU metrics from DCGM Exporter instances and forward them to CAST AI. This allows for efficient monitoring and optimization of GPU resources in your Kubernetes cluster.

The tool is open-source and can be found on our GitHub.

How it works

GPU Metrics Exporter is installed as a DaemonSet in your Kubernetes cluster. It collects GPU utilization metrics using NVIDIA's DCGM Exporter. The DaemonSet runs only on nodes that have a GPU attached.

The installation process is automated, choosing the appropriate mode based on your cluster's configuration. Depending on your cluster configuration, it can be installed in three ways:

If dcgm-exporter is already present on the node:

GPU Metrics Exporter (gpu-metrics-exporter) will use the existing dcgm-exporter installation.
We deploy it as a DaemonSet with a single container.

If dcgm-exporter is not present, but nv-hostengine is:

We deploy a DaemonSet with two containers: dcgm-exporter and gpu-metrics-exporter and configure dcgm-exporter to use the existing nv-hostengine.

If both dcgm-exporter and nv-hostengine are not present:

We deploy a DaemonSet with two containers: dcgm-exporter (with nv-hostengine running embedded) and gpu-metrics-exporter.

Once operational, GPU Metrics Exporter continuously collects GPU usage data and sends it to CAST AI. This data forms the basis for our platform's GPU resource optimization and cost management features, helping you maximize the efficiency of your GPU-equipped Kubernetes clusters.

💡
Tip
We recommend caching the DCGM exporter image in your cloud provider's container registry and configuring gpu-metrics-exporter to pull the image from there. This can help avoid unnecessary data transfer costs.

Installation

To install the GPU Metrics Exporter, refer to the section below to install the exporter using Helm.

Using Helm

You can install GPU Metrics Exporter using Helm. There are two ways to get the chart:

Option 1: Add the CAST AI repository

Add the CAST AI Helm repository:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update

Pull and unpack the chart:

helm pull castai-helm/gpu-metrics-exporter --untar
cd gpu-metrics-exporter

Install the chart:

helm install --generate-name castai/gpu-metrics-exporter -f values.yaml -f values-<k8s-provider>.yaml

Option 2: Clone the repository

Navigate to the chart directory and install the chart:

cd charts/gpu-metrics-exporter
helm install --generate-name <deployment-name> -f values.yaml -f values-<k8s-provider>.yaml .

Replace <deployment-name> with a name of your choice and <k8s-provider> with your Kubernetes provider (e.g., EKS, GKE, AKS). The latter sets the proper node affinity, so the Daemon Set only runs on nodes with GPUs.

Via CAST AI console

Alternatively, once you have onboarded your cluster, follow the instructions provided in the CAST AI console. Copy the generated script and execute it in your terminal or cloud shell.

Configuration

By default, GPU Metrics Exporter is deployed as a sidecar to the dcgm-exporter. You can customize the deployment by modifying the values.yaml file.

Deploy as a standalone service

Set dcgmExporter.enabled to false.
Configure DCGM_HOST and DCGM_LABELS in gpuMetricsExporter.config of the values.yaml file.

DCGM_HOST is the address of the DCGM exporter instance.
DCGM_LABELS is a comma-separated list of labels that the DCGM instances have.

📘
Note
If you need to customize the configuration and install the gpu-metrics-exporter as a standalone service, please contact CAST AI support for guidance, as you will need to configure the API endpoint, a token to send the metrics, and a dcgm-exporter port for scraping metrics.

Use an existing `nv-hostengine`

If you want to deploy the dcgm-exporter but have it configured to read the metrics from an existing nv-hostengine, do the following:

Set dcgmExporter.useExternalHostEngine to true.
Make sure it can connect to the 5555 port of the node as it will attempt to do that.

Scraped metrics

The GPU Metrics Exporter collects the following metrics from DCGM:

Exposed metric	Description
DCGM_FI_PROF_SM_ACTIVE	SM (Streaming Multiprocessor) activity
DCGM_FI_PROF_SM_OCCUPANCY	SM occupancy
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Tensor core activity
DCGM_FI_PROF_DRAM_ACTIVE	Device memory (DRAM) activity
DCGM_FI_PROF_PCIE_TX_BYTES	PCIe transmitted bytes
DCGM_FI_PROF_PCIE_RX_BYTES	PCIe received bytes
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Graphics engine activity
DCGM_FI_DEV_FB_TOTAL	Total frame buffer memory
DCGM_FI_DEV_FB_FREE	Free frame buffer memory
DCGM_FI_DEV_FB_USED	Used frame buffer memory
DCGM_FI_DEV_PCIE_LINK_GEN	PCIe link generation
DCGM_FI_DEV_PCIE_LINK_WIDTH	PCIe link width
DCGM_FI_DEV_GPU_TEMP	GPU temperature
DCGM_FI_DEV_MEMORY_TEMP	Memory temperature
DCGM_FI_DEV_POWER_USAGE	Power usage
DCGM_FI_PROF_PIPE_FP64_ACTIVE	Ratio of cycles the fp64 pipe is active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE	Ratio of cycles the fp32 pipe is active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE	Ratio of cycles the fp16 pipe is active (in %). This does not include HMMA.
DCGM_FI_PROF_PIPE_INT_ACTIVE	Ratio of cycles the integer pipe is active.

When exported, each metric described above is enriched with additional information by CAST AI to generate insightful reports and provide various optimization features.

GPU metrics exporter

How it works

💡
Tip

Installation

Using Helm

Option 1: Add the CAST AI repository

Option 2: Clone the repository

Via CAST AI console

Configuration

Deploy as a standalone service

📘
Note

Use an existing `nv-hostengine`

Scraped metrics

How it works

💡Tip

Installation

Using Helm

Option 1: Add the CAST AI repository

Option 2: Clone the repository

Via CAST AI console

Configuration

Deploy as a standalone service

📘Note

Use an existing nv-hostengine

Scraped metrics

💡
Tip

📘
Note

Use an existing `nv-hostengine`