侧边栏壁纸
博主头像
蜉蝣的博客博主等级

行动起来,活在当下

  • 累计撰写 39 篇文章
  • 累计创建 6 个标签
  • 累计收到 0 条评论

目 录CONTENT

文章目录

DLRover部署

蜉蝣
2025-07-03 / 0 评论 / 0 点赞 / 6 阅读 / 4061 字

配置docker私有仓库

  • 创建仓库

docker run -d -p 5123:5000 -v $(pwd):/var/lib/registry --name dlrover_registry registry:3
  • 推送镜像

docker pull <image>
$ docker tag <image> <registry-ip>:5123/<image>
$ docker push <registry-ip>:5123/<image>

DLRover配置

配置elasticjob

配置master节点镜像

  • /docker/release/master.dockerfile 需要指定 VERSION=0.5.0.dev

docker build -t easydl/dlrover-master:test -f docker/release/master.dockerfile .

配置弹性作业worker节点镜像

docker build -t ${IMAGE_NAME} -f examples/pytorch/nanogpt/nanogpt.dockerfile .

启动弹性作业

  • 两节点16卡训练

kubectl -n dlrover apply -f examples/pytorch/nanogpt/elastic_job.yaml
---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
  name: torch-nanogpt
  namespace: dlrover
spec:
  distributionStrategy: AllreduceStrategy
  optimizeMode: single-job
  replicaSpecs:
    worker:
      replicas: 2
      template:
        spec:
          restartPolicy: Never
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs-pvc
          containers:
            - name: main
              # yamllint disable-line rule:line-length
              # image: registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:pytorch-example
              image: dlrover-local:nanogpt
              imagePullPolicy: Never
              volumeMounts:
                - name: dshm
                  mountPath: /dev/shm
                - name: nfs-volume
                  mountPath: /mnt
              command:
                - /bin/bash
                - -c
                - "dlrover-run --network-check --nnodes=$NODE_NUM \
                  --nproc_per_node=2 --max_restarts=10  \
                  ./examples/pytorch/nanogpt/train.py  \
                  --data_dir /mnt/data/nanogpt/ \
                  --save_dir /mnt/checkpoint/"
              resources:
                limits:
                  cpu: "32"
                  memory: 64Gi
                  nvidia.com/gpu: 8 # optional
                requests:
                  cpu: "32"
                  memory: 64Gi
                  nvidia.com/gpu: 8 # optional
    dlrover-master:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: main
              imagePullPolicy: Never
              image: easydl/dlrover-master:test

GPU适配

  • H20 GPU需要torch2.1.0 cuda118以上

官方样例

  • 官方云镜像版本低,需要本地构建

国内镜像

  • 所有dockerfile中都需要添加pip国内镜像源,否则无法构建镜像

    RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/

0

评论区