配置docker私有仓库
创建仓库
docker run -d -p 5123:5000 -v $(pwd):/var/lib/registry --name dlrover_registry registry:3推送镜像
docker pull <image>
$ docker tag <image> <registry-ip>:5123/<image>
$ docker push <registry-ip>:5123/<image>DLRover配置
配置elasticjob
配置master节点镜像
/docker/release/master.dockerfile 需要指定 VERSION=0.5.0.dev
docker build -t easydl/dlrover-master:test -f docker/release/master.dockerfile .配置弹性作业worker节点镜像
docker build -t ${IMAGE_NAME} -f examples/pytorch/nanogpt/nanogpt.dockerfile .启动弹性作业
两节点16卡训练
kubectl -n dlrover apply -f examples/pytorch/nanogpt/elastic_job.yaml---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
name: torch-nanogpt
namespace: dlrover
spec:
distributionStrategy: AllreduceStrategy
optimizeMode: single-job
replicaSpecs:
worker:
replicas: 2
template:
spec:
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs-pvc
containers:
- name: main
# yamllint disable-line rule:line-length
# image: registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:pytorch-example
image: dlrover-local:nanogpt
imagePullPolicy: Never
volumeMounts:
- name: dshm
mountPath: /dev/shm
- name: nfs-volume
mountPath: /mnt
command:
- /bin/bash
- -c
- "dlrover-run --network-check --nnodes=$NODE_NUM \
--nproc_per_node=2 --max_restarts=10 \
./examples/pytorch/nanogpt/train.py \
--data_dir /mnt/data/nanogpt/ \
--save_dir /mnt/checkpoint/"
resources:
limits:
cpu: "32"
memory: 64Gi
nvidia.com/gpu: 8 # optional
requests:
cpu: "32"
memory: 64Gi
nvidia.com/gpu: 8 # optional
dlrover-master:
template:
spec:
restartPolicy: Never
containers:
- name: main
imagePullPolicy: Never
image: easydl/dlrover-master:test
坑
GPU适配
H20 GPU需要torch2.1.0 cuda118以上
官方样例
官方云镜像版本低,需要本地构建
国内镜像
所有dockerfile中都需要添加pip国内镜像源,否则无法构建镜像
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
评论区