阿八博客
  • 100000+

    文章

  • 23

    评论

  • 20

    友链

  • 最近新加了很多技术文章,大家多来逛逛吧~~~~
  • 喜欢这个网站的朋友可以加一下QQ群,我们一起交流技术。

Kubernetes Nvidia GPU Monitor & Grafana Dashboard

欢迎来到阿八个人博客网站。本 阿八个人博客 网站提供最新的站长新闻,各种互联网资讯。 喜欢本站的朋友可以收藏本站,或者加QQ:我们大家一起来交流技术! URL链接:https://www.abboke.com/jsh/2019/1010/116536.html

More info, please see https://github.com/NVIDIA/gpu-monitoring-tools

4、Create Service

kind: ServiceapiVersion: v1metadata:  labels:    k8s-app: prometheus-gpu  name: prometheus-gpu-service  namespace: kube-systemspec:  ports:    - port: 9100      targetPort: 9100  selector:    k8s-app: prometheus-gpu

5、Test Metrics

curl prometheus-gpu-service.kube-system:9100/metrics

then you will see some metrics like this:

# HELP dcgm_board_limit_violation Throttling duration due to board limit constraints (in us).# TYPE dcgm_board_limit_violation counterdcgm_board_limit_violation{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0dcgm_board_limit_violation{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0dcgm_board_limit_violation{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0dcgm_board_limit_violation{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0# HELP dcgm_dec_utilization Decoder utilization (in %).# TYPE dcgm_dec_utilization gaugedcgm_dec_utilization{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0dcgm_dec_utilization{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0dcgm_dec_utilization{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0dcgm_dec_utilization{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0.....

▶ Using Prometheus Collect Metrics

1、Create ConfigMap

apiVersion: v1kind: ConfigMapmetadata:  name: prometheus-config  namespace: kube-systemdata:  prometheus.yml: |    scrape_configs:    - job_name: 'gpu'      honor_labels: true      static_configs:        - targets: ['prometheus-gpu-service.kube-system:9100']

2、Create Deployment

apiVersion: apps/v1kind: Deploymentmetadata:  name: prometheus  namespace: kube-systemspec:  replicas: 1  revisionHistoryLimit: 3  selector:    matchLabels:      k8s-app: prometheus  template:    metadata:      labels:        k8s-app: prometheus    spec:      volumes:        - name: prometheus          configMap:            name: prometheus-config      serviceAccountName: admin-user      containers:        - name: prometheus          image: "prom/prometheus:latest"          volumeMounts:            - name: prometheus              mountPath: /etc/prometheus/          imagePullPolicy: Always          ports:            - containerPort: 9090              protocol: TCP

3、Create Service

kind: ServiceapiVersion: v1metadata:  labels:    k8s-app: prometheus  name: prometheus-service  namespace: kube-systemspec:  ports:    - port: 9090      targetPort: 9090  selector:    k8s-app: prometheus

▶ Grafana Dashboard

1、Deploy grafana in your kubernetes cluster

kind: DeploymentapiVersion: apps/v1metadata:  name: grafana  namespace: kube-systemspec:  replicas: 1  selector:    matchLabels:      k8s-app: grafana  template:    metadata:      labels:        k8s-app: grafana    spec:      containers:        - name: grafana          image: grafana/grafana:6.2.5          env:            - name: GF_SECURITY_ADMIN_PASSWORD              value: <your-password>            - name: GF_SECURITY_ADMIN_USER              value: <your-username>          ports:            - containerPort: 3000              protocol: TCP

2、Create Service Expose Your Grafana Service

kind: ServiceapiVersion: v1metadata:  labels:    k8s-app: grafana  name: grafana-service  namespace: kube-systemspec:  ports:    - port: 3000      targetPort: 3000      nodePort: 31111  selector:    k8s-app: grafana  type: NodePort

3、Access Grafana

grafana address may be http://<kubernetes-node-ip>:31111/ , username and password is that you config in step 1.

4、Add New DataSource

Click setting -> DateSource -> Add data source -> Prometheus. Config example:

Name: PrometheusDefault: YesURL: http://prometheus-service:9090Access: ServerHttp Method: Get

Then click Save & Test. OK, you can access prometheus data now.

5、Custom GPU Monitoring Dashboard

Get each gpu temperature by query sum(dcgm_gpu_temp{gpu=~".*"}) by (gpu)

extra query:

gpu number: count(dcgm_board_limit_violation)total memory usage rate: sum(dcgm_fb_used) / sum(sum(dcgm_fb_free) + sum(dcgm_fb_used))power draw: sum(dcgm_power_usage{gpu=~".*"}) by (gpu)memory temperature: sum(dcgm_memory_temp{gpu=~".*"}) by (gpu)

相关文章

暂住......别动,不想说点什么吗?
  • 全部评论(0
    还没有评论,快来抢沙发吧!