Ascend Job

Ascend Job:简称acjob,是MindCluster自定义的一种任务类型,当前支持通过环境变量配置资源信息及文件配置资源信息两种方式拉起训练或推理任务。

支持的AI框架

  • MindSpore
  • PyTorch

样例

pytorch_multinodes_acjob_910b.yaml示例如下。

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
  name: default-test-pytorch
  labels:
    framework: pytorch
    ring-controller.atlas: ascend-910b
    tor-affinity: "null" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不适用。large-model-schema表示大模型任务,normal-schema 普通任务
  annotations:
      huawei.com/schedule_policy: "chip8-node8"
spec:
  schedulerName: volcano   # work when enableGangScheduling is true
  runPolicy:
    schedulingPolicy:      # work when enableGangScheduling is true
      minAvailable: 2
      queue: default
  successPolicy: AllWorkers
  replicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-910b
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: job-name
                        operator: In
                        values:
                          - default-test-pytorch
                  topologyKey: kubernetes.io/hostname
          nodeSelector:
            example-key: example-value    # 示例值,用户可根据调度意图自行配置nodeSelector
          containers:
          - name: ascend # do not modify
            image: pytorch-test:latest         # training framework image, which can be modified
            imagePullPolicy: IfNotPresent
            env:
              - name: XDL_IP                                       # IP address of the physical node, which is used to identify the node where the pod is running
                valueFrom:
                  fieldRef:
                    fieldPath: status.hostIP
              # ASCEND_VISIBLE_DEVICES env variable is used by ascend-docker-runtime when in the whole card scheduling scene with volcano scheduler.
              # Please delete it when in the static vNPU scheduling, dynamic vNPU scheduling, volcano without Ascend-volcano-plugin, without volcano scenes.
              - name: ASCEND_VISIBLE_DEVICES
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be the same as resources.requests
            command:                           # training command, which can be modified
              - /bin/bash
              - -c
            args: [ "cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=4096" ]
            ports:                          # default value containerPort: 2222 name: ascendjob-port if not set
              - containerPort: 2222         # determined by user
                name: ascendjob-port        # do not modify
            resources:
              limits:
                huawei.com/Ascend910: 2
              requests:
                huawei.com/Ascend910: 2
            volumeMounts:
            - name: code
              mountPath: /job/code
            - name: data
              mountPath: /job/data
            - name: output
              mountPath: /job/output
            - name: ascend-driver
              mountPath: /usr/local/Ascend/driver
            - name: ascend-add-ons
              mountPath: /usr/local/Ascend/add-ons
            - name: dshm
              mountPath: /dev/shm
            - name: localtime
              mountPath: /etc/localtime
          volumes:
          - name: code
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
          - name: data
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/dataset/"
          - name: output
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/output/"
          - name: ascend-driver
            hostPath:
              path: /usr/local/Ascend/driver
          - name: ascend-add-ons
            hostPath:
              path: /usr/local/Ascend/add-ons
          - name: dshm
            emptyDir:
              medium: Memory
          - name: localtime
            hostPath:
              path: /etc/localtime
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-910b
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: job-name
                        operator: In
                        values:
                          - default-test-pytorch
                  topologyKey: kubernetes.io/hostname
          nodeSelector:
            example-key: example-value    # 示例值,用户可根据调度意图自行配置nodeSelector
          containers:
          - name: ascend # do not modify
            image: pytorch-test:latest                # training framework image, which can be modified
            imagePullPolicy: IfNotPresent
            env:
              - name: XDL_IP                                       # IP address of the physical node, which is used to identify the node where the pod is running
                valueFrom:
                  fieldRef:
                    fieldPath: status.hostIP
              # ASCEND_VISIBLE_DEVICES env variable is used by ascend-docker-runtime when in the whole card scheduling scene with volcano scheduler.
          # Please delete it when in the static vNPU scheduling, dynamic vNPU scheduling, volcano without Ascend-volcano-plugin, without volcano scenes.
              - name: ASCEND_VISIBLE_DEVICES
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be the same as resources.requests
            command:                                  # training command, which can be modified
              - /bin/bash
              - -c
            args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=4096"]
            ports:                          # default value containerPort: 2222 name: ascendjob-port if not set
              - containerPort: 2222         # determined by user
                name: ascendjob-port        # do not modify
            resources:
              limits:
                huawei.com/Ascend910: 2
              requests:
                huawei.com/Ascend910: 2
            volumeMounts:
            - name: code
              mountPath: /job/code
            - name: data
              mountPath: /job/data
            - name: output
              mountPath: /job/output
            - name: ascend-driver
              mountPath: /usr/local/Ascend/driver
            - name: ascend-add-ons
              mountPath: /usr/local/Ascend/add-ons
            - name: dshm
              mountPath: /dev/shm
            - name: localtime
              mountPath: /etc/localtime
          volumes:
          - name: code
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
          - name: data
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/dataset/"
          - name: output
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/output/"
          - name: ascend-driver
            hostPath:
              path: /usr/local/Ascend/driver
          - name: ascend-add-ons
            hostPath:
              path: /usr/local/Ascend/add-ons
          - name: dshm
            emptyDir:
              medium: Memory
          - name: localtime
            hostPath:
              path: /etc/localtime

任务状态说明

拉起训练任务后,用户可以通过kubectl get acjob命令查看acjob任务的运行状态,当前运行状态有以下几种。

表 2 acjob任务运行状态说明

状态名称 说明
Created Job已经创建,但其中一个或多个子资源(Pod/Service)尚未就绪。
Running Job的所有子资源(Pod/Service)已经调度并启动。
Restarting Job的一个或多个子资源(Pod/Service)运行失败,但是根据重启策略正在重新启动。
Succeeded Job的所有子资源(Pod/Service)处于成功终止阶段。
Failed Job的一个或多个子资源(Pod/Service)运行失败。

任务异常条件说明

当任务出现异常时,AscendJob 的 status.conditions 字段会记录详细的异常信息。每个 condition 包含以下字段:

字段 类型 说明
type 字符串 条件类型,如 Failed、Restarting、Running、Succeeded、Created
status 字符串 条件状态:True、False、Unknown
lastTransitionTime 字符串 条件状态转换的时间(RFC3339格式)
lastUpdateTime 字符串 条件更新后的最终时间(RFC3339格式)
message 字符串 条件的详细描述信息
reason 字符串 条件转换的原因代码

常见异常原因(reason)说明

原因代码 说明
JobFailed 任务失败,通常是因为有 Pod 失败
jobRestarting 任务正在重启,根据重启策略重新启动失败的 Pod
SyncPodGroupFailed 同步 PodGroup 失败
PodGroupNotInitialized PodGroup 未初始化,通常是因为 volcano-scheduler 未运行
PodGroupPending PodGroup 处于等待状态,通常是因为集群资源不足
SyncServiceFailed 同步 Service 失败
PodCreateFailed 创建 Pod 失败
JobValidFailed 任务验证失败

异常条件示例

status:
  conditions:
  - type: Failed
    status: "True"
    lastTransitionTime: "2024-01-01T10:00:00Z"
    lastUpdateTime: "2024-01-01T10:00:00Z"
    message: "Job default/test-job has failed because has pod failed."
    reason: "JobFailed"
  - type: Restarting
    status: "True"
    lastTransitionTime: "2024-01-01T10:00:00Z"
    lastUpdateTime: "2024-01-01T10:00:00Z"
    message: "Job default/test-job is unconditional retry job and remain retry times is <3>."
    reason: "jobRestarting"
  - type: Failed
    status: "True"
    lastTransitionTime: "2024-01-01T10:00:00Z"
    lastUpdateTime: "2024-01-01T10:00:00Z"
    message: "Job test-job has failed because it has reached the specified backoff limit"
    reason: "JobFailed"

查看任务异常信息

使用以下命令查看任务的详细状态和异常信息:

# 查看 AscendJob 的状态
kubectl get acjob -n <namespace> <job-name> -o yaml

# 查看 AscendJob 的状态摘要
kubectl get acjob -n <namespace> <job-name> -o jsonpath={.status.conditions}

# 查看 AscendJob 的最新状态
kubectl get acjob -n <namespace> <job-name> -o jsonpath={.status.conditions[-1]}