GPU metrics exporter
GPU Metrics Exporter is a tool designed to collect GPU metrics from DCGM Exporter instances and forward them to Cast AI. This allows for efficient monitoring and optimization of GPU resources in your Kubernetes cluster.
The tool is open-source and can be found on our GitHub.
How it works
GPU Metrics Exporter is installed as a DaemonSet in your Kubernetes cluster. It collects GPU utilization metrics using NVIDIA's DCGM Exporter. The DaemonSet runs only on nodes that have a GPU attached.
The installation process is automated, choosing the appropriate mode based on your cluster's configuration. Depending on your cluster configuration, it can be installed in three ways:
Ifdcgm-exporter is already present on the node:
- GPU Metrics Exporter (
gpu-metrics-exporter) will use the existingdcgm-exporterinstallation. - We deploy it as a DaemonSet with a single container.
Ifdcgm-exporter is not present, but nv-hostengine is:
- We deploy a DaemonSet with two containers:
dcgm-exporterandgpu-metrics-exporterand configuredcgm-exporterto use the existingnv-hostengine.
If bothdcgm-exporter and nv-hostengine are not present:
- We deploy a DaemonSet with two containers:
dcgm-exporter(withnv-hostenginerunning embedded) andgpu-metrics-exporter.
Once operational, GPU Metrics Exporter continuously collects GPU usage data and sends it to Cast AI. This data forms the basis for our platform's GPU resource optimization and cost management features, helping you maximize the efficiency of your GPU-equipped Kubernetes clusters.
TipWe recommend caching the DCGM exporter image in your cloud provider's container registry and configuring
gpu-metrics-exporterto pull the image from there. This can help avoid unnecessary data transfer costs.
Installation
To install the GPU Metrics Exporter, refer to the section below to install the exporter using Helm.
Using Helm
You can install GPU Metrics Exporter using Helm. There are two ways to get the chart:
Option 1: Add the Cast AI repository
- Add the Cast AI Helm repository:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update- Pull and unpack the chart:
helm pull castai-helm/gpu-metrics-exporter --untar
cd gpu-metrics-exporter- Install the chart:
helm install --generate-name castai-helm/gpu-metrics-exporter -n castai-agent -f values.yamlOption 2: Clone the repository
- Navigate to the chart directory and install the chart:
cd charts/gpu-metrics-exporter
helm upgrade -i gpu-metrics-exporter -n castai-agent -f values.yaml .
NoteMake sure to set correct variables
castai.clusterId,castai.apiUrlandcastai.apiKey(orcastai.apiKeySecretRefif you prefer to set apiKey via your own secret) invalues.yaml
Via Cast AI console
Alternatively, once you have onboarded your cluster, follow the instructions provided in the Cast AI console. Copy the generated script and execute it in your terminal or cloud shell.
Configuration
By default, GPU Metrics Exporter is deployed as a sidecar to the dcgm-exporter. You can customize the deployment by modifying the values.yaml file.
Deploy as a standalone service
- Set
dcgmExporter.enabledtofalse. - Configure
DCGM_HOSTandDCGM_LABELSingpuMetricsExporter.configof thevalues.yamlfile.
DCGM_HOSTis the address of the DCGM exporter instance.DCGM_LABELSis a comma-separated list of labels that the DCGM instances have.
NoteIf you need to customize the configuration and install the
gpu-metrics-exporteras a standalone service, please contact Cast AI support for guidance, as you will need to configure the API endpoint, a token to send the metrics, and adcgm-exporterport for scraping metrics.
Use an existing nv-hostengine
nv-hostengineIf you want to deploy the dcgm-exporter but have it configured to read the metrics from an existing nv-hostengine, do the following:
- Set
dcgmExporter.useExternalHostEnginetotrue. - Make sure it can connect to the
5555port of the node as it will attempt to do that.
Scraped metrics
The GPU Metrics Exporter collects the following metrics from DCGM:
| Exposed metric | Description |
|---|---|
| DCGM_FI_PROF_SM_ACTIVE | SM (Streaming Multiprocessor) activity |
| DCGM_FI_PROF_SM_OCCUPANCY | SM occupancy |
| DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Tensor core activity |
| DCGM_FI_PROF_DRAM_ACTIVE | Device memory (DRAM) activity |
| DCGM_FI_PROF_PCIE_TX_BYTES | PCIe transmitted bytes |
| DCGM_FI_PROF_PCIE_RX_BYTES | PCIe received bytes |
| DCGM_FI_PROF_GR_ENGINE_ACTIVE | Graphics engine activity |
| DCGM_FI_DEV_FB_TOTAL | Total frame buffer memory |
| DCGM_FI_DEV_FB_FREE | Free frame buffer memory |
| DCGM_FI_DEV_FB_USED | Used frame buffer memory |
| DCGM_FI_DEV_PCIE_LINK_GEN | PCIe link generation |
| DCGM_FI_DEV_PCIE_LINK_WIDTH | PCIe link width |
| DCGM_FI_DEV_GPU_TEMP | GPU temperature |
| DCGM_FI_DEV_MEMORY_TEMP | Memory temperature |
| DCGM_FI_DEV_POWER_USAGE | Power usage |
| DCGM_FI_PROF_PIPE_FP64_ACTIVE | Ratio of cycles the fp64 pipe is active (in %). |
| DCGM_FI_PROF_PIPE_FP32_ACTIVE | Ratio of cycles the fp32 pipe is active (in %). |
| DCGM_FI_PROF_PIPE_FP16_ACTIVE | Ratio of cycles the fp16 pipe is active (in %). This does not include HMMA. |
| DCGM_FI_PROF_PIPE_INT_ACTIVE | Ratio of cycles the integer pipe is active. |
When exported, each metric described above is enriched with additional information by Cast AI to generate insightful reports and provide various optimization features.
Updated 9 days ago
