Kubernetes Nvidia GPU Monitor & Grafana Dashboard
欢迎来到阿八个人博客网站。本 阿八个人博客 网站提供最新的站长新闻,各种互联网资讯。 喜欢本站的朋友可以收藏本站,或者加QQ:我们大家一起来交流技术! URL链接:https://www.abboke.com/jsh/2019/1010/116536.html
More info, please see https://github.com/NVIDIA/gpu-monitoring-tools 4、Create Service 5、Test Metrics then you will see some metrics like this: 1、Create ConfigMap 2、Create Deployment 3、Create Service 1、Deploy grafana in your kubernetes cluster 2、Create Service Expose Your Grafana Service 3、Access Grafana grafana address may be http://<kubernetes-node-ip>:31111/ , username and password is that you config in step 1. 4、Add New DataSource Click Then click 5、Custom GPU Monitoring Dashboard Get each gpu temperature by query extra query:kind: ServiceapiVersion: v1metadata: labels: k8s-app: prometheus-gpu name: prometheus-gpu-service namespace: kube-systemspec: ports: - port: 9100 targetPort: 9100 selector: k8s-app: prometheus-gpu
curl prometheus-gpu-service.kube-system:9100/metrics
# HELP dcgm_board_limit_violation Throttling duration due to board limit constraints (in us).# TYPE dcgm_board_limit_violation counterdcgm_board_limit_violation{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0dcgm_board_limit_violation{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0dcgm_board_limit_violation{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0dcgm_board_limit_violation{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0# HELP dcgm_dec_utilization Decoder utilization (in %).# TYPE dcgm_dec_utilization gaugedcgm_dec_utilization{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0dcgm_dec_utilization{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0dcgm_dec_utilization{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0dcgm_dec_utilization{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0.....
▶ Using Prometheus Collect Metrics
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-config namespace: kube-systemdata: prometheus.yml: | scrape_configs: - job_name: 'gpu' honor_labels: true static_configs: - targets: ['prometheus-gpu-service.kube-system:9100']
apiVersion: apps/v1kind: Deploymentmetadata: name: prometheus namespace: kube-systemspec: replicas: 1 revisionHistoryLimit: 3 selector: matchLabels: k8s-app: prometheus template: metadata: labels: k8s-app: prometheus spec: volumes: - name: prometheus configMap: name: prometheus-config serviceAccountName: admin-user containers: - name: prometheus image: "prom/prometheus:latest" volumeMounts: - name: prometheus mountPath: /etc/prometheus/ imagePullPolicy: Always ports: - containerPort: 9090 protocol: TCP
kind: ServiceapiVersion: v1metadata: labels: k8s-app: prometheus name: prometheus-service namespace: kube-systemspec: ports: - port: 9090 targetPort: 9090 selector: k8s-app: prometheus
▶ Grafana Dashboard
kind: DeploymentapiVersion: apps/v1metadata: name: grafana namespace: kube-systemspec: replicas: 1 selector: matchLabels: k8s-app: grafana template: metadata: labels: k8s-app: grafana spec: containers: - name: grafana image: grafana/grafana:6.2.5 env: - name: GF_SECURITY_ADMIN_PASSWORD value: <your-password> - name: GF_SECURITY_ADMIN_USER value: <your-username> ports: - containerPort: 3000 protocol: TCP
kind: ServiceapiVersion: v1metadata: labels: k8s-app: grafana name: grafana-service namespace: kube-systemspec: ports: - port: 3000 targetPort: 3000 nodePort: 31111 selector: k8s-app: grafana type: NodePort
setting
-> DateSource
-> Add data source
-> Prometheus
. Config example:Prometheus
Default: Yes
URL: http://prometheus-service:9090
Access: Server
Http Method: Get
Save & Test
. OK, you can access prometheus data now.sum(dcgm_gpu_temp{gpu=~".*"}) by (gpu)
count(dcgm_board_limit_violation)
total memory usage rate: sum(dcgm_fb_used) / sum(sum(dcgm_fb_free) + sum(dcgm_fb_used))
power draw: sum(dcgm_power_usage{gpu=~".*"}) by (gpu)
memory temperature: sum(dcgm_memory_temp{gpu=~".*"}) by (gpu)