Allow pod-based GPU metrics in Amazon CloudWatch



In February 2022, Amazon Net Companies added assist for NVIDIA GPU metrics in Amazon CloudWatch, making it potential to push metrics from the Amazon CloudWatch Agent to Amazon CloudWatch and monitor your code for optimum GPU utilization. Since then, this characteristic has been built-in into lots of our managed Amazon Machine Pictures (AMIs), such because the Deep Studying AMI and the AWS ParallelCluster AMI. To acquire instance-level metrics of GPU utilization, you should utilize Packer or the Amazon ImageBuilder to bootstrap your individual customized AMI and use it in varied managed service choices like AWS Batch, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS). Nonetheless, for a lot of container-based service choices and workloads, it’s excellent to seize utilization metrics on the container, pod, or namespace degree.

This put up particulars the best way to arrange container-based GPU metrics and offers an instance of amassing these metrics from EKS pods.

Answer overview

To show container-based GPU metrics, we create an EKS cluster with g5.2xlarge cases; nevertheless, this may work with any supported NVIDIA accelerated occasion household.

We deploy the NVIDIA GPU operator to allow use of GPU sources and the NVIDIA DCGM Exporter to allow GPU metrics assortment. Then we discover two architectures. The primary one connects the metrics from NVIDIA DCGM Exporter to CloudWatch through a CloudWatch agent, as proven within the following diagram.

GPU Monitoring Architecture with CloudWatch

The second structure (see the next diagram) connects the metrics from DCGM Exporter to Prometheus, then we use a Grafana dashboard to visualise these metrics.

GPU Monitoring Architecture with Grafana


To simplify reproducing all the stack from this put up, we use a container that has all of the required tooling (aws cli, eksctl, helm, and many others.) already put in. With a purpose to clone the container challenge from GitHub, you have to git. To construct and run the container, you have to Docker. To deploy the structure, you have to AWS credentials. To allow entry to Kubernetes companies utilizing port-forwarding, additionally, you will want kubectl.

These conditions might be put in in your native machine, EC2 occasion with NICE DCV, or AWS Cloud9. On this put up, we are going to use a c5.2xlarge Cloud9 occasion with a 40GB native storage quantity. When utilizing Cloud9, please disable AWS managed momentary credentials by visiting Cloud9->Preferences->AWS Settings as proven on the screenshot beneath.

Construct and run the aws-do-eks container

Open a terminal shell in your most well-liked setting and run the next instructions:

git clone
cd aws-do-eks

The result’s as follows:

You now have a shell in a container setting that has all of the instruments wanted to finish the duties beneath. We are going to confer with it as “aws-do-eks shell”. You can be working the instructions within the following sections on this shell, except particularly instructed in any other case.

Create an EKS cluster with a node group

This group features a GPU occasion household of your alternative; on this instance, we use the g5.2xlarge occasion kind.

The aws-do-eks challenge comes with a set of cluster configurations. You’ll be able to set your required cluster configuration with a single configuration change.

Within the container shell, run ./ after which set CONF=conf/eksctl/yaml/eks-gpu-g5.yaml
To confirm the cluster configuration, run ./

It is best to see the next cluster manifest:

sort: ClusterConfig
identify: do-eks-yaml-g5
model: “1.25”
area: us-east-1
– us-east-1a
– us-east-1b
– us-east-1c
– us-east-1d
– identify: sys
instanceType: m5.xlarge
desiredCapacity: 1
autoScaler: true
cloudWatch: true
– identify: g5
instanceType: g5.2xlarge
instancePrefix: g5-2xl
privateNetworking: true
efaEnabled: false
minSize: 0
desiredCapacity: 1
maxSize: 10
volumeSize: 80
cloudWatch: true
withOIDC: true

To create the cluster, run the next command within the container

The output is as follows:

root@e5ecb162812f:/eks# ./
/eks/impl/eksctl/yaml /eks


Mon Might 22 20:50:59 UTC 2023
Creating cluster utilizing /eks/conf/eksctl/yaml/eks-gpu-g5.yaml …

eksctl create cluster -f /eks/conf/eksctl/yaml/eks-gpu-g5.yaml

2023-05-22 20:50:59 (ℹ) eksctl model 0.133.0
2023-05-22 20:50:59 (ℹ) utilizing area us-east-1
2023-05-22 20:50:59 (ℹ) subnets for us-east-1a – public: non-public:
2023-05-22 20:50:59 (ℹ) subnets for us-east-1b – public: non-public:
2023-05-22 20:50:59 (ℹ) subnets for us-east-1c – public: non-public:
2023-05-22 20:50:59 (ℹ) subnets for us-east-1d – public: non-public:
2023-05-22 20:50:59 (ℹ) nodegroup “sys” will use “” (AmazonLinux2/1.25)
2023-05-22 20:50:59 (ℹ) nodegroup “g5” will use “” (AmazonLinux2/1.25)
2023-05-22 20:50:59 (ℹ) utilizing Kubernetes model 1.25
2023-05-22 20:50:59 (ℹ) creating EKS cluster “do-eks-yaml-g5” in “us-east-1” area with managed nodes
2023-05-22 20:50:59 (ℹ) 2 nodegroups (g5, sys) had been included (based mostly on the embrace/exclude guidelines)
2023-05-22 20:50:59 (ℹ) will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2023-05-22 20:50:59 (ℹ) will create a CloudFormation stack for cluster itself and a pair of managed nodegroup stack(s)
2023-05-22 20:50:59 (ℹ) for those who encounter any points, verify CloudFormation console or strive ‘eksctl utils describe-stacks –region=us-east-1 –cluster=do-eks-yaml-g5’
2023-05-22 20:50:59 (ℹ) Kubernetes API endpoint entry will use default of {publicAccess=true, privateAccess=false} for cluster “do-eks-yaml-g5” in “us-east-1”
2023-05-22 20:50:59 (ℹ) CloudWatch logging won’t be enabled for cluster “do-eks-yaml-g5” in “us-east-1”
2023-05-22 20:50:59 (ℹ) you may allow it with ‘eksctl utils update-cluster-logging –enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} –region=us-east-1 –cluster=do-eks-yaml-g5’
2023-05-22 20:50:59 (ℹ)
2 sequential duties: { create cluster management airplane “do-eks-yaml-g5”,
2 sequential sub-tasks: {
4 sequential sub-tasks: {
look ahead to management airplane to turn out to be prepared,
affiliate IAM OIDC supplier,
2 sequential sub-tasks: {
create IAM function for serviceaccount “kube-system/aws-node”,
create serviceaccount “kube-system/aws-node”,
restart daemonset “kube-system/aws-node”,
2 parallel sub-tasks: {
create managed nodegroup “sys”,
create managed nodegroup “g5”,
2023-05-22 20:50:59 (ℹ) constructing cluster stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:51:00 (ℹ) deploying stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:51:30 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:52:00 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:53:01 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:54:01 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:55:01 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:56:02 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:57:02 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:58:02 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:59:02 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:00:03 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:01:03 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:02:03 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:03:04 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:05:07 (ℹ) constructing iamserviceaccount stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:10 (ℹ) deploying stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:10 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:40 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:40 (ℹ) serviceaccount “kube-system/aws-node” already exists
2023-05-22 21:05:41 (ℹ) up to date serviceaccount “kube-system/aws-node”
2023-05-22 21:05:41 (ℹ) daemonset “kube-system/aws-node” restarted
2023-05-22 21:05:41 (ℹ) constructing managed nodegroup stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:05:41 (ℹ) constructing managed nodegroup stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:05:42 (ℹ) deploying stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:05:42 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:05:42 (ℹ) deploying stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:05:42 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:06:12 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:06:12 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:06:55 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:07:11 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:08:29 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:08:45 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:09:52 (ℹ) ready for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:09:53 (ℹ) ready for the management airplane to turn out to be prepared
2023-05-22 21:09:53 (✔) saved kubeconfig as “/root/.kube/config”
2023-05-22 21:09:53 (ℹ) 1 job: { set up Nvidia system plugin }
W0522 21:09:54.155837 1668 warnings.go:70) spec.template.metadata.annotations( non-functional in v1.16+; use the “priorityClassName” discipline as an alternative
2023-05-22 21:09:54 (ℹ) created “kube-system:DaemonSet.apps/nvidia-device-plugin-daemonset”
2023-05-22 21:09:54 (ℹ) as you’re utilizing the EKS-Optimized Accelerated AMI with a GPU-enabled occasion kind, the Nvidia Kubernetes system plugin was robotically put in.
to skip putting in it, use –install-nvidia-plugin=false.
2023-05-22 21:09:54 (✔) all EKS cluster sources for “do-eks-yaml-g5” have been created
2023-05-22 21:09:54 (ℹ) nodegroup “sys” has 1 node(s)
2023-05-22 21:09:54 (ℹ) node “ip-192-168-18-137.ec2.inner” is prepared
2023-05-22 21:09:54 (ℹ) ready for no less than 1 node(s) to turn out to be prepared in “sys”
2023-05-22 21:09:54 (ℹ) nodegroup “sys” has 1 node(s)
2023-05-22 21:09:54 (ℹ) node “ip-192-168-18-137.ec2.inner” is prepared
2023-05-22 21:09:55 (ℹ) kubectl command ought to work with “/root/.kube/config”, strive ‘kubectl get nodes’
2023-05-22 21:09:55 (✔) EKS cluster “do-eks-yaml-g5” in “us-east-1” area is prepared

Mon Might 22 21:09:55 UTC 2023
Finished creating cluster utilizing /eks/conf/eksctl/yaml/eks-gpu-g5.yaml


To confirm that your cluster is created efficiently, run the next command

kubectl get nodes -L

The output is just like the next:

ip-192-168-18-137.ec2.inner Prepared <none> 47m v1.25.9-eks-0a21954 m5.xlarge
ip-192-168-214-241.ec2.inner Prepared <none> 46m v1.25.9-eks-0a21954 g5.2xlarge

On this instance, we have now one m5.xlarge and one g5.2xlarge occasion in our cluster; due to this fact, we see two nodes listed within the previous output.

Through the cluster creation course of, the NVIDIA system plugin will get put in. You will have to take away it after cluster creation as a result of we are going to use the NVIDIA GPU Operator as an alternative.

Delete the plugin with the next command

kubectl -n kube-system delete daemonset nvidia-device-plugin-daemonset

We get the next output:

daemonset.apps “nvidia-device-plugin-daemonset” deleted

Set up the NVIDIA Helm repo

Set up the NVIDIA Helm repo with the next command:

helm repo add nvidia && helm repo replace

Deploy the DCGM exporter with the NVIDIA GPU Operator

To deploy the DCGM exporter, full the next steps:

Put together the DCGM exporter GPU metrics configuration

curl many others/dcp-metrics-included.csv > dcgm-metrics.csv

You could have the choice to edit the dcgm-metrics.csv file. You’ll be able to add or take away any metrics as wanted.

Create the gpu-operator namespace and DCGM exporter ConfigMap

kubectl create namespace gpu-operator && /
kubectl create configmap metrics-config -n gpu-operator –from-file=dcgm-metrics.csv

The output is as follows:

namespace/gpu-operator created
configmap/metrics-config created

Apply the GPU operator to the EKS cluster

helm set up –wait –generate-name -n gpu-operator –create-namespace nvidia/gpu-operator
–set dcgmExporter.config.identify=metrics-config
–set dcgmExporter.env(0).identify=DCGM_EXPORTER_COLLECTORS
–set dcgmExporter.env(0).worth=/and many others/dcgm-exporter/dcgm-metrics.csv
–set toolkit.enabled=false

The output is as follows:

NAME: gpu-operator-1684795140
LAST DEPLOYED: Day Month Date HH:mm:ss YYYY
NAMESPACE: gpu-operator
STATUS: deployed

Verify that the DCGM exporter pod is working

kubectl -n gpu-operator get pods | grep dcgm

The output is as follows:

nvidia-dcgm-exporter-lkmfr       1/1     Operating    0   1m

When you examine the logs, it is best to see the “Beginning webserver” message:

kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | lower -d ‘ ‘ -f 1)

The output is as follows:

Defaulted container “nvidia-dcgm-exporter” out of: nvidia-dcgm-exporter, toolkit-validation (init)
time=”2023-05-22T22:40:08Z” degree=information msg=”Beginning dcgm-exporter”
time=”2023-05-22T22:40:08Z” degree=information msg=”DCGM efficiently initialized!”
time=”2023-05-22T22:40:08Z” degree=information msg=”Accumulating DCP Metrics”
time=”2023-05-22T22:40:08Z” degree=information msg=”No configmap information specified, falling again to metric file /and many others/dcgm-exporter/dcgm-metrics.csv”
time=”2023-05-22T22:40:08Z” degree=information msg=”Initializing system entities of kind: GPU”
time=”2023-05-22T22:40:09Z” degree=information msg=”Initializing system entities of kind: NvSwitch”
time=”2023-05-22T22:40:09Z” degree=information msg=”Not amassing change metrics: no switches to watch”
time=”2023-05-22T22:40:09Z” degree=information msg=”Initializing system entities of kind: NvLink”
time=”2023-05-22T22:40:09Z” degree=information msg=”Not amassing hyperlink metrics: no switches to watch”
time=”2023-05-22T22:40:09Z” degree=information msg=”Kubernetes metrics assortment enabled!”
time=”2023-05-22T22:40:09Z” degree=information msg=”Pipeline beginning”
time=”2023-05-22T22:40:09Z” degree=information msg=”Beginning webserver”

NVIDIA DCGM Exporter exposes a Prometheus metrics endpoint, which might be ingested by the CloudWatch agent. To see the endpoint, use the next command:

kubectl -n gpu-operator get companies | grep dcgm

We get the next output:

nvidia-dcgm-exporter    ClusterIP   <none>   9400/TCP   10m

To generate some GPU utilization, we deploy a pod that runs the gpu-burn binary

kubectl apply -f

The output is as follows:

deployment.apps/gpu-burn created

This deployment makes use of a single GPU to supply a steady sample of 100% utilization for 20 seconds adopted by 0% utilization for 20 seconds.

To verify the endpoint works, you may run a short lived container that makes use of curl to learn the content material of http://nvidia-dcgm-exporter:9400/metrics

kubectl -n gpu-operator run -it –rm curl –restart=”By no means” –image=curlimages/curl –command — curl http://nvidia-dcgm-exporter:9400/metrics

We get the next output:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
DCGM_FI_DEV_SM_CLOCK{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 1455
# HELP DCGM_FI_DEV_MEM_CLOCK Reminiscence clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 6250
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
DCGM_FI_DEV_GPU_TEMP{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 65
# HELP DCGM_FI_DEV_POWER_USAGE Energy draw (in W).
DCGM_FI_DEV_POWER_USAGE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 299.437000
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Complete vitality consumption since boot (in mJ).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 15782796862
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Complete variety of PCIe retries.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
DCGM_FI_DEV_GPU_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 100
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Reminiscence utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 38
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
DCGM_FI_DEV_ENC_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_XID_ERRORS Worth of the final XID error encountered.
DCGM_FI_DEV_XID_ERRORS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer reminiscence free (in MiB).
DCGM_FI_DEV_FB_FREE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 2230
# HELP DCGM_FI_DEV_FB_USED Framebuffer reminiscence used (in MiB).
DCGM_FI_DEV_FB_USED{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 20501
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Complete variety of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Variety of remapped rows for uncorrectable errors
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Variety of remapped rows for correctable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether or not remapping of rows has failed
DCGM_FI_DEV_ROW_REMAP_FAILURE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is energetic (in %).
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0.808369
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is energetic (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the system reminiscence interface is energetic sending or receiving information (in %).
DCGM_FI_PROF_DRAM_ACTIVE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0.315787
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The speed of information transmitted over the PCIe bus – together with each protocol headers and information payloads – in bytes per second.
DCGM_FI_PROF_PCIE_TX_BYTES{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 3985328
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The speed of information obtained over the PCIe bus – together with each protocol headers and information payloads – in bytes per second.
DCGM_FI_PROF_PCIE_RX_BYTES{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,system=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”principal”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 21715174
pod “curl” deleted

Configure and deploy the CloudWatch agent

To configure and deploy the CloudWatch agent, full the next steps:

Obtain the YAML file and edit it

curl -O

The file incorporates a cwagent configmap and a prometheus configmap. For this put up, we edit each.

Edit the prometheus-eks.yaml file

Open the prometheus-eks.yaml file in your favourite editor and change the cwagentconfig.json part with the next content material:

apiVersion: v1
# cwagent json config
cwagentconfig.json: |
“logs”: {
“metrics_collected”: {
“prometheus”: {
“prometheus_config_path”: “/and many others/prometheusconfig/prometheus.yaml”,
“emf_processor”: {
“metric_declaration”: (
“source_labels”: (“Service”),
“label_matcher”: “.*dcgm.*”,
“dimensions”: ((“Service”,”Namespace”,”ClusterName”,”job”,”pod”)),
“metric_selectors”: (
“force_flush_interval”: 5

Within the prometheus config part, append the next job definition for the DCGM exporter

– job_name: ‘kubernetes-pod-dcgm-exporter’
sample_limit: 10000
metrics_path: /api/v1/metrics/prometheus
– function: pod
– source_labels: (__meta_kubernetes_pod_container_name)
motion: hold
regex: ‘^DCGM.*$’
– source_labels: (__address__)
motion: change
regex: ((^:)+)(?::d+)?
substitute: ${1}:9400
target_label: __address__
– motion: labelmap
regex: __meta_kubernetes_pod_label_(.+)
– motion: change
– __meta_kubernetes_namespace
target_label: Namespace
– source_labels: (__meta_kubernetes_pod)
motion: change
target_label: pod
– motion: change
– __meta_kubernetes_pod_container_name
target_label: container_name
– motion: change
– __meta_kubernetes_pod_controller_name
target_label: pod_controller_name
– motion: change
– __meta_kubernetes_pod_controller_kind
target_label: pod_controller_kind
– motion: change
– __meta_kubernetes_pod_phase
target_label: pod_phase
– motion: change
– __meta_kubernetes_pod_node_name
target_label: NodeName

Save the file and apply the cwagent-dcgm configuration to your cluster

kubectl apply -f ./prometheus-eks.yaml

We get the next output:

namespace/amazon-cloudwatch created
configmap/prometheus-cwagentconfig created
configmap/prometheus-config created
serviceaccount/cwagent-prometheus created created created
deployment.apps/cwagent-prometheus created

Verify that the CloudWatch agent pod is working

kubectl -n amazon-cloudwatch get pods

We get the next output:

cwagent-prometheus-7dfd69cc46-s4cx7 1/1 Operating 0 15m

Visualize metrics on the CloudWatch console

To visualise the metrics in CloudWatch, full the next steps:

On the CloudWatch console, underneath Metrics within the navigation pane, select All metrics
Within the Customized namespaces part, select the brand new entry for ContainerInsights/Prometheus

For extra details about the ContainerInsights/Prometheus namespace, confer with Scraping further Prometheus sources and importing these metrics.

CloudWatch - ContainerInsights/Prometeus

Drill right down to the metric names and select DCGM_FI_DEV_GPU_UTIL
On the Graphed metrics tab, set Interval to five seconds

CloudWatch - Period Setting

Set the refresh interval to 10 seconds

You will notice the metrics collected from DCGM exporter that visualize the gpu-burn sample on and off every 20 seconds.

CloudWatch - gpuburn pattern

On the Browse tab, you may see the info, together with the pod identify for every metric.

CloudWatch - pod name for metric

The EKS API metadata has been mixed with the DCGM metrics information, ensuing within the offered pod-based GPU metrics.

This concludes the primary strategy of exporting DCGM metrics to CloudWatch through the CloudWatch agent.

Within the subsequent part, we configure the second structure, which exports the DCGM metrics to Prometheus, and we visualize them with Grafana.

Use Prometheus and Grafana to visualise GPU metrics from DCGM

Full the next steps:

Add the Prometheus group helm chart

helm repo add prometheus-community

This chart deploys each Prometheus and Grafana. We have to make some edits to the chart earlier than working the set up command.

Save the chart configuration values to a file in /tmp

helm examine values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values

Edit the char configuration file

Edit the saved file (/tmp/kube-prometheus-stack.values) and set the next possibility by searching for the setting identify and setting the worth:


Add the next ConfigMap to the additionalScrapeConfigs part

– job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
– function: endpoints
– gpu-operator
– source_labels: (__meta_kubernetes_pod_node_name)
motion: change
target_label: kubernetes_node

Deploy the Prometheus stack with the up to date values

helm set up prometheus-community/kube-prometheus-stack
–create-namespace –namespace prometheus
–values /tmp/kube-prometheus-stack.values

We get the next output:

NAME: kube-prometheus-stack-1684965548
LAST DEPLOYED: Wed Might 24 21:59:14 2023
NAMESPACE: prometheus
STATUS: deployed
kube-prometheus-stack has been put in. Examine its standing by working:
kubectl –namespace prometheus get pods -l “launch=kube-prometheus-stack-1684965548”

Go to
for directions on the best way to create & configure Alertmanager
and Prometheus cases utilizing the Operator.

Verify that the Prometheus pods are working

kubectl get pods -n prometheus

We get the next output:

alertmanager-kube-prometheus-stack-1684-alertmanager-0 2/2 Operating 0 6m55s
kube-prometheus-stack-1684-operator-6c87649878-j7v55 1/1 Operating 0 6m58s
kube-prometheus-stack-1684965548-grafana-dcd7b4c96-bzm8p 3/3 Operating 0 6m58s
kube-prometheus-stack-1684965548-kube-state-metrics-7d856dptlj5 1/1 Operating 0 6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-2fbl5 1/1 Operating 0 6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-m7zmv 1/1 Operating 0 6m58s
prometheus-kube-prometheus-stack-1684-prometheus-0 2/2 Operating 0 6m55s

Prometheus and Grafana pods are within the Operating state.

Subsequent, we validate that DCGM metrics are flowing into Prometheus.

Port-forward the Prometheus UI

There are other ways to show the Prometheus UI working in EKS to requests originating exterior of the cluster. We are going to use kubectl port-forwarding. Up to now, we have now been executing instructions contained in the aws-do-eks container. To entry the Prometheus service working within the cluster, we are going to create a tunnel from the host. Right here the aws-do-eks container is working by executing the next command exterior of the container, in a brand new terminal shell on the host. We are going to confer with this as “host shell”.

kubectl -n prometheus port-forward svc/$(kubectl -n prometheus get svc | grep prometheus | grep -v alertmanager | grep -v operator | grep -v grafana | grep -v metrics | grep -v exporter | grep -v operated | lower -d ‘ ‘ -f 1) 8080:9090 &

Whereas the port-forwarding course of is working, we’re in a position to entry the Prometheus UI from the host as described beneath.

Open the Prometheus UI

If you’re utilizing Cloud9, please navigate to Preview->Preview Operating Utility to open the Prometheus UI in a tab contained in the Cloud9 IDE, then click on the icon within the upper-right nook of the tab to come out in a brand new window.
If you’re in your native host or linked to an EC2 occasion through distant desktop open a browser and go to the URL http://localhost:8080.

Prometheus - DCGM metrics

Enter DCGM to see the DCGM metrics which are flowing into Prometheus
Choose DCGM_FI_DEV_GPU_UTIL, select Execute, after which navigate to the Graph tab to see the anticipated GPU utilization sample

Prometheus - gpuburn pattern

Cease the Prometheus port-forwarding course of

Run the next command line in your host shell:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk ‘{print $2}’)

Now we are able to visualize the DCGM metrics through Grafana Dashboard.

Retrieve the password to log in to the Grafana UI

kubectl -n prometheus get secret $(kubectl -n prometheus get secrets and techniques | grep grafana | lower -d ‘ ‘ -f 1) -o jsonpath=”{.information.admin-password}” | base64 –decode ; echo

Port-forward the Grafana service

Run the next command line in your host shell:

kubectl port-forward -n prometheus svc/$(kubectl -n prometheus get svc | grep grafana | lower -d ‘ ‘ -f 1) 8080:80 &

Log in to the Grafana UI

Entry the Grafana UI login display screen the identical means as you accessed the Prometheus UI earlier. If utilizing Cloud9, choose Preview->Preview Operating Utility, then come out in a brand new window. If utilizing your native host or an EC2 occasion with distant desktop go to URL http://localhost:8080. Login with the consumer identify admin and the password you retrieved earlier.

Grafana - login

Within the navigation pane, select Dashboards

Grafana - dashboards

Select New and Import

Grafana - load by id from
We’re going to import the default DCGM Grafana dashboard described in NVIDIA DCGM Exporter Dashboard.

Within the discipline import through, enter 12239 and select Load
Select Prometheus as the info supply
Select Import

Grafana - import dashboard

You will notice a dashboard just like the one within the following screenshot.

Grafana - dashboard

To show that these metrics are pod-based, we’re going to modify the GPU Utilization pane on this dashboard.

Select the pane and the choices menu (three dots)
Increase the Choices part and edit the Legend discipline
Change the worth there with Pod {{pod}}, then select Save

Grafana - pod-based metric
The legend now reveals the gpu-burn pod identify related to the displayed GPU utilization.

Cease port-forwarding the Grafana UI service

Run the next in your host shell:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk ‘{print $2}’)

On this put up, we demonstrated utilizing open-source Prometheus and Grafana deployed to the EKS cluster. If desired, this deployment might be substituted with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Clear up

To scrub up the sources you created, run the next script from the aws-do-eks container shell:


On this put up, we utilized NVIDIA DCGM Exporter to gather GPU metrics and visualize them with both CloudWatch or Prometheus and Grafana. We invite you to make use of the architectures demonstrated right here to allow GPU utilization monitoring with NVIDIA DCGM in your individual AWS setting.

Extra sources

In regards to the authors

Amr Ragab is a former Principal Options Architect, EC2 Accelerated Computing at AWS. He’s dedicated to serving to prospects run computational workloads at scale. In his spare time, he likes touring and discovering new methods to combine expertise into every day life.

Alex IankoulskiAlex Iankoulski is a Principal Options Architect, Self-managed Machine Studying at AWS. He’s a full-stack software program and infrastructure engineer who likes to do deep, hands-on work. In his function, he focuses on serving to prospects with containerization and orchestration of ML and AI workloads on container-powered AWS companies. He’s additionally the writer of the open-source do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s largest challenges. Through the previous 10 years, Alex has labored on democratizing AI and ML, combating local weather change, and making journey safer, healthcare higher, and vitality smarter.

Keita Watanabe is a Senior Options Architect of Frameworks ML Options at Amazon Net Companies the place he helps develop the business’s greatest cloud based mostly Self-managed Machine Studying options. His background is in Machine Studying analysis and growth. Previous to becoming a member of AWS, Keita was working within the e-commerce business. Keita holds a Ph.D. in Science from the College of Tokyo.


Supply hyperlink

What do you think?

Written by TechWithTrends

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings


Nation-State Hackers Exploit Fortinet and Zoho Vulnerabilities


Newbie’s Information to Constructing a Web site (2023)