GPU metrics exporter

GPU Metrics Exporter is a tool designed to collect GPU metrics from DCGM Exporter instances and forward them to CAST AI. This allows for efficient monitoring and optimization of GPU resources in your Kubernetes cluster.

The tool is open-source and can be found on our GitHub.

How it works

GPU Metrics Exporter is installed as a DaemonSet in your Kubernetes cluster. It collects GPU utilization metrics using NVIDIA's DCGM Exporter. The DaemonSet runs only on nodes that have a GPU attached.

The installation process is automated, choosing the appropriate mode based on your cluster's configuration. Depending on your cluster configuration, it can be installed in three ways:

If dcgm-exporter is already present on the node:

  • GPU Metrics Exporter (gpu-metrics-exporter) will use the existing dcgm-exporter installation.
  • We deploy it as a DaemonSet with a single container.

If dcgm-exporter is not present, but nv-hostengine is:

  • We deploy a DaemonSet with two containers: dcgm-exporter and gpu-metrics-exporter and configure dcgm-exporter to use the existing nv-hostengine.

If both dcgm-exporter and nv-hostengine are not present:

  • We deploy a DaemonSet with two containers: dcgm-exporter (with nv-hostengine running embedded) and gpu-metrics-exporter.

Once operational, GPU Metrics Exporter continuously collects GPU usage data and sends it to CAST AI. This data forms the basis for our platform's GPU resource optimization and cost management features, helping you maximize the efficiency of your GPU-equipped Kubernetes clusters.

Installation

To install the GPU Metrics Exporter, refer to the section below to install the exporter using Helm.

Using Helm

You can install GPU Metrics Exporter using Helm. There are two ways to get the chart:

Option 1: Add the CAST AI repository

  1. Add the CAST AI Helm repository:
helm repo add castai https://castai.github.io/charts
helm repo update
  1. Pull and unpack the chart:
helm pull castai/gpu-metrics-exporter --untar
cd gpu-metrics-exporter
  1. Install the chart:
helm install --generate-name castai/gpu-metrics-exporter -f values.yaml -f values-<k8s-provider>.yaml

Option 2: Clone the repository

  1. Navigate to the chart directory and install the chart:
cd charts/gpu-metrics-exporter
helm install --generate-name <deployment-name> -f values.yaml -f values-<k8s-provider>.yaml .

Replace <deployment-name> with a name of your choice and <k8s-provider> with your Kubernetes provider (e.g., EKS, GKE, AKS). The latter sets the proper node affinity, so the Daemon Set only runs on nodes with GPUs.

Via CAST AI console

Alternatively, once you have onboarded your cluster, follow the instructions provided in the CAST AI console. Copy the generated script and execute it in your terminal or cloud shell.

Configuration

By default, GPU Metrics Exporter is deployed as a sidecar to the dcgm-exporter. You can customize the deployment by modifying the values.yaml file.

Deploy as a standalone service

  1. Set dcgmExporter.enabled to false.
  2. Configure DCGM_HOST and DCGM_LABELS in gpuMetricsExporter.config of the values.yaml file.
  • DCGM_HOST is the address of the DCGM exporter instance.
  • DCGM_LABELS is a comma-separated list of labels that the DCGM instances have.

πŸ“˜

Note

If you need to customize the configuration and install the gpu-metrics-exporter as a standalone service, please contact CAST AI support for guidance, as you will need to configure the API endpoint, a token to send the metrics, and a dcgm-exporter port for scraping metrics.

Use an existing nv-hostengine

If you want to deploy the dcgm-exporter but have it configured to read the metrics from an existing nv-hostengine, do the following:

  1. Set dcgmExporter.useExternalHostEngine to true.
  2. Make sure it can connect to the 5555 port of the node as it will attempt to do that.

Scraped metrics

The GPU Metrics Exporter collects the following metrics from DCGM:

Exposed metricDescription
DCGM_FI_PROF_SM_ACTIVESM (Streaming Multiprocessor) activity
DCGM_FI_PROF_SM_OCCUPANCYSM occupancy
DCGM_FI_PROF_PIPE_TENSOR_ACTIVETensor core activity
DCGM_FI_PROF_DRAM_ACTIVEDevice memory (DRAM) activity
DCGM_FI_PROF_PCIE_TX_BYTESPCIe transmitted bytes
DCGM_FI_PROF_PCIE_RX_BYTESPCIe received bytes
DCGM_FI_PROF_GR_ENGINE_ACTIVEGraphics engine activity
DCGM_FI_DEV_FB_TOTALTotal frame buffer memory
DCGM_FI_DEV_FB_FREEFree frame buffer memory
DCGM_FI_DEV_FB_USEDUsed frame buffer memory
DCGM_FI_DEV_PCIE_LINK_GENPCIe link generation
DCGM_FI_DEV_PCIE_LINK_WIDTHPCIe link width
DCGM_FI_DEV_GPU_TEMPGPU temperature
DCGM_FI_DEV_MEMORY_TEMPMemory temperature
DCGM_FI_DEV_POWER_USAGEPower usage

When exported, each metric described above is enriched with additional information by CAST AI to generate insightful reports and provide various optimization features.