How to Monitor GPU Utilization in Kubernetes with DCGM, Prometheus, Grafana, and Chamber
How to Monitor GPU Utilization in Kubernetes with DCGM, Prometheus, and Grafana
You can't optimize what you don't measure. Most GPU clusters run blind: teams know their GPUs are expensive, but they don't know which ones are idle right now, which jobs are memory-bound, or which nodes are approaching thermal limits.
The default Kubernetes tooling does not help. kubectl top shows CPU and memory. It says nothing about GPU utilization, tensor core activity, or XID errors. Getting real GPU telemetry requires a dedicated monitoring stack, and the Kubernetes GPU scheduling documentation confirms that GPU metrics are outside the scope of default resource reporting.
We have managed GPU infrastructure at Amazon and built monitoring systems for large scale GPU clusters and learned from hundreds of AIML teams across industries and use cases. Most attempt to use an open-source stack as a starting point: DCGM Exporter, Prometheus, and Grafana. This guide walks you through deploying that stack from zero to a working dashboard in under an hour. However, often these solutions require extensive engineering hours to maintain, customize, and extend in the long run, prompting most to begin searching for alternative solutions. In this guide you'll also learn more about how Chamber provides GPU monitoring for Kubernetes out of the box with zero setup required.
What You Need Before Starting
Before deploying GPU monitoring, verify your cluster meets these prerequisites.
| Prerequisite | Minimum Version | Notes |
|---|---|---|
| Kubernetes cluster | 1.21+ | Must have NVIDIA GPU nodes (Kubernetes GPU Scheduling Docs) |
| NVIDIA drivers | 450.80+ | Installed on all GPU nodes (NVIDIA GPU Operator Installation Guide) |
| Helm | 3.x | For chart-based installation (NVIDIA dcgm-exporter GitHub) |
| kubectl access | cluster-admin | Required for DaemonSet and ServiceMonitor creation |
For new clusters, the NVIDIA GPU Operator is the simplest path. It installs drivers, the device plugin, and DCGM Exporter together in a single Helm release (NVIDIA GPU Operator Installation Guide). For existing clusters with drivers already configured, you can install DCGM Exporter standalone.
GPU Monitoring Architecture in Kubernetes
The monitoring stack has four components, each handling one responsibility in the data pipeline.
DCGM Exporter runs as a DaemonSet on every GPU node. It connects to the NVIDIA Data Center GPU Manager (DCGM) library to collect GPU telemetry: utilization, memory, temperature, power, errors. It exposes these metrics on a /metrics endpoint in Prometheus format. DCGM Exporter also connects to the kubelet pod-resources socket to map GPU metrics to pod and namespace labels (NVIDIA dcgm-exporter GitHub). Without this mapping, you get node-level metrics but cannot attribute GPU usage to specific workloads.
Prometheus scrapes the /metrics endpoint from every DCGM Exporter pod on a configurable interval (default: 15 seconds). It stores the time-series data and provides the query engine.
Grafana connects to Prometheus as a data source and visualizes GPU metrics in dashboards. Panels show utilization trends, memory pressure, thermal status, and error counts.
Alertmanager (optional but recommended) evaluates Prometheus alerting rules and routes notifications to Slack, PagerDuty, or email when GPU health degrades.
Why DCGM over nvidia-smi? nvidia-smi is a point-in-time CLI snapshot. DCGM provides continuous streaming telemetry with profiling-grade metrics that nvidia-smi cannot access, including Tensor Core utilization (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE) and NVLink traffic (NVIDIA DCGM Documentation). For production Kubernetes monitoring, there is no substitute.
How to Install DCGM Exporter with Helm
Two installation paths, depending on your cluster state.
Option A: NVIDIA GPU Operator (Recommended for New Clusters)
The GPU Operator bundles drivers, device plugin, container runtime, and DCGM Exporter into a single managed deployment (NVIDIA GPU Operator Installation Guide).
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set dcgmExporter.enabled=true \
--set dcgmExporter.serviceMonitor.enabled=true
The serviceMonitor.enabled=true flag creates a Prometheus ServiceMonitor automatically, which saves a manual configuration step.
Option B: Standalone DCGM Exporter
For clusters that already have NVIDIA drivers and the device plugin installed (NVIDIA dcgm-exporter GitHub).
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--create-namespace \
--set serviceMonitor.enabled=true
To customize which metrics DCGM Exporter collects, create a ConfigMap with your metrics file. The default configuration exposes 20+ metrics. For most clusters, the defaults are sufficient. If you need profiling metrics (Tensor Core utilization, NVLink bandwidth), enable them in the metrics ConfigMap (NVIDIA DCGM Documentation).
Verify the Installation
Confirm DCGM Exporter pods are running on every GPU node:
kubectl get pods -n monitoring -l app.kubernetes.io/name=dcgm-exporter
# Expected: one pod per GPU node, all Running
Test that metrics are flowing by port-forwarding to any DCGM Exporter pod:
kubectl port-forward -n monitoring <dcgm-exporter-pod> 9400:9400
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
You should see lines like DCGM_FI_DEV_GPU_UTIL{gpu="0",...} 45.0. If no metrics appear, check that NVIDIA drivers are loaded on the node and that the DCGM Exporter has permission to access the GPU device files.
How to Connect Prometheus to DCGM Exporter
If you enabled serviceMonitor.enabled=true during installation and your cluster runs the kube-prometheus-stack (or any Prometheus Operator), scraping is automatic. The ServiceMonitor resource tells Prometheus where to find DCGM Exporter endpoints.
For clusters using standalone Prometheus without the Operator, add a scrape config manually:
scrape_configs:
- job_name: 'dcgm-exporter'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: dcgm-exporter
action: keep
Add this block to your prometheus.yml and reload Prometheus.
Verify that Prometheus is scraping by navigating to the Prometheus UI targets page (/targets). The DCGM Exporter endpoints should show as UP. You can also run a test query in the Prometheus expression browser:
DCGM_FI_DEV_GPU_UTIL
This should return utilization values for every GPU in the cluster.
How to Build a GPU Monitoring Dashboard in Grafana
The fastest path to a working dashboard is importing the official NVIDIA DCGM Exporter dashboard.
In Grafana, navigate to Dashboards > Import and enter dashboard ID 12239 (Grafana DCGM Exporter Dashboard). Select your Prometheus data source and click Import. You immediately get panels for GPU utilization, memory usage, temperature, and power draw per GPU.
The default dashboard gives you node-level visibility. For production clusters, add these custom panels:
Per-namespace GPU utilization. Group by the namespace label to see which teams consume the most GPU capacity. This query shows average utilization by namespace:
avg by (namespace) (DCGM_FI_DEV_GPU_UTIL)
Fleet-wide utilization summary. A single-stat panel showing the average utilization across all GPUs in the cluster. This is the number your infrastructure leadership cares about:
avg(DCGM_FI_DEV_GPU_UTIL)
XID error timeline. A time-series panel tracking DCGM_FI_DEV_XID_ERRORS to identify nodes with recurring hardware faults. Any non-zero value warrants investigation.
A healthy dashboard shows GPU utilization between 70-95% for training workloads, temperatures below 85°C, and zero XID errors. If your fleet-wide average sits below 50%, you are leaving significant capacity on the table. For strategies to improve utilization after you have visibility, see our GPU utilization optimization guide.
Key GPU Metrics and What They Mean
DCGM exposes dozens of metrics. These six are the ones that matter for day-to-day operations (NVIDIA DCGM Documentation).
| Metric | DCGM Field | What It Measures | Healthy Range | Action If Outside Range |
|---|---|---|---|---|
| GPU utilization | DCGM_FI_DEV_GPU_UTIL | % of time SMs are active | 70-95% for training | Investigate scheduling and workload placement |
| Tensor Core utilization | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | % of time Tensor Cores are active | 60-85% for mixed precision | Verify mixed precision is enabled in training code |
| Memory utilization | DCGM_FI_DEV_FB_USED | GPU memory (VRAM) in use | Varies by model size | Right-size GPU allocation or enable MIG partitioning |
| Temperature | DCGM_FI_DEV_GPU_TEMP | GPU core temperature in °C | Below 85°C | Check cooling, airflow, and rack density |
| Power draw | DCGM_FI_DEV_POWER_USAGE | Watts consumed | Below TDP rating | Monitor for thermal throttling |
| XID errors | DCGM_FI_DEV_XID_ERRORS | Most recent NVIDIA error code | 0 (no errors) | Drain node, investigate fault (NVIDIA XID Errors Documentation) |
One distinction worth understanding: SM utilization and actual compute throughput are not the same thing. SM utilization (DCGM_FI_DEV_GPU_UTIL) tells you the GPU cores are active, but not what they are doing. A GPU running unoptimized CUDA kernels can show 90% SM utilization while its Tensor Cores sit idle. Track both SM utilization and Tensor Core utilization together to understand whether the GPU is busy and productive.
If GPU utilization is consistently below 35%, the workload is likely CPU-bound and could run on a less expensive GPU type. If GPU memory utilization stays below 50%, the workload is overprovisioned and should be tested on a smaller GPU. These are low-hanging-fruit cost savings that monitoring reveals immediately.
How to Set Up GPU Alerts
Monitoring dashboards are useful when someone is watching. Alerts catch problems when nobody is.
Define Prometheus alerting rules for these critical conditions:
groups:
- name: gpu-health
rules:
- alert: GPUUtilizationLow
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
for: 30m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} on {{ $labels.instance }} below 20% utilization for 30+ minutes"
description: "Wasted capacity. Investigate whether the workload has stalled or the GPU should be released."
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: critical
annotations:
summary: "GPU {{ $labels.gpu }} on {{ $labels.instance }} above 85°C"
description: "Thermal throttling risk. Check cooling and consider draining the node."
- alert: GPUXIDError
expr: DCGM_FI_DEV_XID_ERRORS > 0
for: 1m
labels:
severity: critical
annotations:
summary: "XID error {{ $value }} on GPU {{ $labels.gpu }}, node {{ $labels.instance }}"
description: "Hardware fault detected. Drain node and investigate. See NVIDIA XID error documentation."
- alert: GPUMemoryNearFull
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 > 95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} memory above 95% on {{ $labels.instance }}"
description: "OOM risk. Consider reducing batch size or enabling gradient checkpointing."
Route these alerts to your team's Slack channel, PagerDuty rotation, or email via Alertmanager. GPU temperature and XID errors should page immediately. Low utilization alerts are better sent to a monitoring channel for triage during business hours.
For deeper coverage of XID error handling and automated node remediation, see our guide to GPU fault detection in Kubernetes.
GPU Monitoring Tools Compared
DCGM + Prometheus + Grafana is the production standard for Kubernetes GPU monitoring, but it is not the only option. Here is how the main tools compare.
| Tool | Type | GPU Metrics | K8s Labels | Historical Data | Alerting | Setup Complexity |
|---|---|---|---|---|---|---|
| nvidia-smi | CLI | Basic (utilization, memory, temp) | No | No | No | Trivial |
| gpustat | CLI | Basic + per-process | No | No | No | Trivial |
| nvtop / nvitop | TUI | Detailed, interactive | No | No | No | Easy |
| DCGM + Prometheus + Grafana | Stack | Comprehensive + profiling metrics | Full (pod, namespace) | Yes | Yes | Moderate |
| Datadog GPU Monitoring | SaaS | Comprehensive | Full (pod, namespace) | Yes | Yes | Easy (paid) |
For production Kubernetes clusters, DCGM + Prometheus + Grafana is the standard (NVIDIA dcgm-exporter GitHub). It gives you full Kubernetes-aware metrics, historical data for capacity planning, and alerting for operational health.
For development machines and quick debugging, gpustat or nvitop is sufficient. These tools install in seconds and give you real-time GPU status without any infrastructure setup.
For teams that want the depth of DCGM metrics without maintaining a Prometheus stack, a managed GPU monitoring platform eliminates the operational overhead. For a broader comparison of monitoring approaches, see our GPU monitoring tools comparison.
Frequently Asked Questions
What is the difference between nvidia-smi and DCGM?
nvidia-smi is a command-line utility that provides point-in-time GPU snapshots. DCGM (Data Center GPU Manager) provides continuous monitoring with profiling-grade metrics and native Kubernetes integration. DCGM exposes metrics like Tensor Core utilization and NVLink traffic that nvidia-smi cannot access (NVIDIA DCGM Documentation).
Does DCGM Exporter work with AMD GPUs?
No. DCGM is an NVIDIA-only tool built on top of NVIDIA's management libraries (NVIDIA DCGM Documentation). AMD GPUs require ROCm-based monitoring solutions.
How much overhead does GPU monitoring add?
DCGM Exporter adds less than 1% GPU utilization impact (NVIDIA dcgm-exporter GitHub). The exporter reads telemetry from DCGM's shared memory interface without interfering with GPU workloads.
Can I monitor GPU utilization per pod in Kubernetes?
Yes. DCGM Exporter maps GPU metrics to pod and namespace labels by connecting to the kubelet pod-resources API (NVIDIA dcgm-exporter GitHub). This gives you per-workload attribution for utilization, memory, temperature, and error metrics.
What are XID errors and should I worry about them?
XID errors are NVIDIA driver-reported error codes indicating hardware or software faults. Some are informational (XID 13: graphics engine exception, often software-related). Others are critical: XID 79 (GPU has fallen off the bus) means the GPU is no longer communicating with the system and requires immediate action (NVIDIA XID Errors Documentation).
What actions can I take based on GPU monitoring metrics?
Monitoring reveals cost-saving opportunities directly. If GPU utilization is consistently below 35%, the workload is likely CPU-bound and should be tested on a less expensive GPU type. If GPU memory utilization stays below 50%, the workload is overprovisioned and should run on a smaller GPU. If utilization is high but Tensor Core activity is low, enabling mixed precision training can improve throughput without additional hardware.
How Chamber simplifies GPU monitoring
Now that we understand the open-source way to monitor GPUs, let's explore how Chamber simplifies this for AI/ML teams. Configuring, scaling and maintaining open-source metrics and dashboards can eat up valuable engineering resources. That's why many teams opt for managed services like Chamber that can provide the full GPU observability stack out-of-the-box without any manual setup.
At Chamber our mission is to make monitoring GPU usage, attributing cost, wasted resources and more by cluster, GPU type, team, user and workload seamless. Chamber not only provides the monitoring to understand how to get the most ROI out of your GPUs, but also the intelligent scheduling and GPU orchestration layer for all of your Kubernetes clusters, across all clouds and on-prem.
When you deploy the Chamber agent and have the Nvidia DCGM exporter enabled, Chamber automatically discovers your GPU resources, workloads and produces key utilization metrics. With no additional effort or customization, you immediately gain visibility across all of your clusters from a single view, with drill down capaibilities into cluster, team, user and workloads.
Teams tell us that they lack a single unified view of their job history that they can slice and dice across any dimension to quickly find their historical and active workkloads. Moreever, they also tell us that creating custom dashboarsd for each team, and executive level usage dashboards is a heavy timeconsuming process. With Chamber you don't need to spend time setting up custom dashboards, and instead get insights instantly without any manual effort.
See how Chamber's monitoring and debugging platform works.
Interested to learn more about how to get started with Chamber?Book a time to talkFor additional information on how the Chamber Agent works and which metrics are collected, please see Chamber's agent documentation.
Key Takeaways
- The standard GPU monitoring stack for Kubernetes is DCGM Exporter + Prometheus + Grafana. You can deploy it in under an hour (NVIDIA dcgm-exporter GitHub).
- Default Kubernetes tooling does not expose GPU metrics.
kubectl topshows CPU and memory only (Kubernetes GPU Scheduling Docs). - Track both SM utilization and Tensor Core utilization to distinguish "busy" from "productive" (NVIDIA DCGM Documentation).
- Set alerts for low utilization (wasted capacity), high temperature (throttling risk), XID errors (hardware faults), and near-full GPU memory (OOM risk).
- DCGM Exporter maps GPU metrics to Kubernetes pod and namespace labels, enabling per-workload cost attribution (NVIDIA dcgm-exporter GitHub).
- nvtop and gpustat are better choices for dev machines; DCGM + Prometheus + Grafana is the standard for production clusters (Lambda.ai GPU Guide).
- Monitoring is step one. The data tells you where to optimize scheduling, right-size allocations, and reduce GPU spend. This full lifecycle is what tools like Chamber are on a mission to solve.
The Bottom Line
GPU monitoring is the prerequisite for every other optimization. Without per-workload utilization data, scheduling improvements are guesswork, capacity planning is speculation, and cost attribution is impossible.
The DCGM + Prometheus + Grafana stack covered in this guide gives you production-grade GPU observability with full Kubernetes awareness. The investment is an hour of setup time and moderate operational overhead to maintain Prometheus and Grafana.
Chamber deploys in minutes via Helm and gives you immediate cross-cluster, cross-cloud visibility into GPU utilization, health, and cost without maintaining a Prometheus stack. Start with monitoring to understand your GPU landscape, usage, and cost before optimizing scheduling.
For what to do with the monitoring data once you have it, see our GPU utilization optimization guide.
Sources
- NVIDIA DCGM Documentation. "DCGM Feature Overview." 2024.
- NVIDIA GPU Operator Installation Guide. "Getting Started with the GPU Operator." 2024.
- NVIDIA XID Errors Documentation. "XID Errors." 2024.
- Grafana DCGM Exporter Dashboard. "NVIDIA DCGM Exporter Dashboard." Dashboard ID 12239.
- NVIDIA dcgm-exporter GitHub. Helm chart and Prometheus integration reference.
- Kubernetes GPU Scheduling Documentation. "Schedule GPUs." 2024.
- Lambda Labs. "Keeping an Eye on Your GPUs." GPU monitoring tool comparison.