GPU metrics exporter
GPU Metrics Exporter is a tool designed to collect GPU metrics from DCGM Exporter instances and forward them to CAST AI. This allows for efficient monitoring and optimization of GPU resources in your Kubernetes cluster.
The tool is open-source and can be found on our GitHub.
How it works
GPU Metrics Exporter is installed as a DaemonSet in your Kubernetes cluster. It collects GPU utilization metrics using NVIDIA's DCGM Exporter. The DaemonSet runs only on nodes that have a GPU attached.
The installation process is automated, choosing the appropriate mode based on your cluster's configuration. Depending on your cluster configuration, it can be installed in three ways:
If dcgm-exporter
is already present on the node:
- GPU Metrics Exporter (
gpu-metrics-exporter
) will use the existingdcgm-exporter
installation. - We deploy it as a DaemonSet with a single container.
If dcgm-exporter
is not present, but nv-hostengine
is:
- We deploy a DaemonSet with two containers:
dcgm-exporter
andgpu-metrics-exporter
and configuredcgm-exporter
to use the existingnv-hostengine
.
If both dcgm-exporter
and nv-hostengine
are not present:
- We deploy a DaemonSet with two containers:
dcgm-exporter
(withnv-hostengine
running embedded) andgpu-metrics-exporter
.
Once operational, GPU Metrics Exporter continuously collects GPU usage data and sends it to CAST AI. This data forms the basis for our platform's GPU resource optimization and cost management features, helping you maximize the efficiency of your GPU-equipped Kubernetes clusters.
Tip
We recommend caching the DCGM exporter image in your cloud provider's container registry and configuring
gpu-metrics-exporter
to pull the image from there. This can help avoid unnecessary data transfer costs.
Installation
To install the GPU Metrics Exporter, refer to the section below to install the exporter using Helm.
Using Helm
You can install GPU Metrics Exporter using Helm. There are two ways to get the chart:
Option 1: Add the CAST AI repository
- Add the CAST AI Helm repository:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
- Pull and unpack the chart:
helm pull castai-helm/gpu-metrics-exporter --untar
cd gpu-metrics-exporter
- Install the chart:
helm install --generate-name castai/gpu-metrics-exporter -f values.yaml -f values-<k8s-provider>.yaml
Option 2: Clone the repository
- Navigate to the chart directory and install the chart:
cd charts/gpu-metrics-exporter
helm install --generate-name <deployment-name> -f values.yaml -f values-<k8s-provider>.yaml .
Replace <deployment-name>
with a name of your choice and <k8s-provider>
with your Kubernetes provider (e.g., EKS, GKE, AKS). The latter sets the proper node affinity, so the Daemon Set only runs on nodes with GPUs.
Via CAST AI console
Alternatively, once you have onboarded your cluster, follow the instructions provided in the CAST AI console. Copy the generated script and execute it in your terminal or cloud shell.
Configuration
By default, GPU Metrics Exporter is deployed as a sidecar to the dcgm-exporter
. You can customize the deployment by modifying the values.yaml
file.
Deploy as a standalone service
- Set
dcgmExporter.enabled
tofalse
. - Configure
DCGM_HOST
andDCGM_LABELS
ingpuMetricsExporter.config
of thevalues.yaml
file.
DCGM_HOST
is the address of the DCGM exporter instance.DCGM_LABELS
is a comma-separated list of labels that the DCGM instances have.
Note
If you need to customize the configuration and install the
gpu-metrics-exporter
as a standalone service, please contact CAST AI support for guidance, as you will need to configure the API endpoint, a token to send the metrics, and adcgm-exporter
port for scraping metrics.
Use an existing nv-hostengine
nv-hostengine
If you want to deploy the dcgm-exporter
but have it configured to read the metrics from an existing nv-hostengine
, do the following:
- Set
dcgmExporter.useExternalHostEngine
totrue
. - Make sure it can connect to the
5555
port of the node as it will attempt to do that.
Scraped metrics
The GPU Metrics Exporter collects the following metrics from DCGM:
Exposed metric | Description |
---|---|
DCGM_FI_PROF_SM_ACTIVE | SM (Streaming Multiprocessor) activity |
DCGM_FI_PROF_SM_OCCUPANCY | SM occupancy |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Tensor core activity |
DCGM_FI_PROF_DRAM_ACTIVE | Device memory (DRAM) activity |
DCGM_FI_PROF_PCIE_TX_BYTES | PCIe transmitted bytes |
DCGM_FI_PROF_PCIE_RX_BYTES | PCIe received bytes |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Graphics engine activity |
DCGM_FI_DEV_FB_TOTAL | Total frame buffer memory |
DCGM_FI_DEV_FB_FREE | Free frame buffer memory |
DCGM_FI_DEV_FB_USED | Used frame buffer memory |
DCGM_FI_DEV_PCIE_LINK_GEN | PCIe link generation |
DCGM_FI_DEV_PCIE_LINK_WIDTH | PCIe link width |
DCGM_FI_DEV_GPU_TEMP | GPU temperature |
DCGM_FI_DEV_MEMORY_TEMP | Memory temperature |
DCGM_FI_DEV_POWER_USAGE | Power usage |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | Ratio of cycles the fp64 pipe is active (in %). |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | Ratio of cycles the fp32 pipe is active (in %). |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | Ratio of cycles the fp16 pipe is active (in %). This does not include HMMA. |
DCGM_FI_PROF_PIPE_INT_ACTIVE | Ratio of cycles the integer pipe is active. |
When exported, each metric described above is enriched with additional information by CAST AI to generate insightful reports and provide various optimization features.
Updated about 2 months ago