场景
K3s + Docker
前置条件
安装好nvidia-container-toolkit,确保有nvidia运行时
NVIDIA DCGM
已在K3s上部署Prometheus + Grafana
传递
NVIDIA GPU —> NVIDIA Driver —> DCGM —> DCGM-Exporter —> Prometheus —> Grafana
步骤
为K3s增加runtimeclass
注意handler要为docker,设置为nvidia是不对的,会出现
RuntimeHandler "nvidia" not supported报错
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: docker手动下载dcgm-exporter
git clone https://github.com/NVIDIA/dcgm-exporter.git修改deployment/values.yaml
暴露给Prometheus
配置更多资源,解决OOMKilled问题
podAnnotations:
# Using this annotation which is required for prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
resources:
limits:
cpu: 512m
memory: 512Mi
requests:
cpu: 512m
memory: 512Mi安装
helm install dcgm-exporter -n dcgm-exporter --create-namespace ./引入面板
Dashboards -> New -> Import -> 12239 Load -> prometheus default data source
效果

坑
如果按照官方建议的方法安装Nvidia GPU Operator,当使用helm安装时,会自动触发hook,修改docker的daemon.json文件并重新加载
评论区