Ascend Job
Ascend Job:简称acjob,是MindCluster自定义的一种任务类型,当前支持通过环境变量配置资源信息及文件配置资源信息两种方式拉起训练或推理任务。
支持的AI框架
- MindSpore
- TensorFlow
- PyTorch
样例
pytorch_multinodes_acjob_910b.yaml示例如下。
apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
name: default-test-pytorch
labels:
framework: pytorch
ring-controller.atlas: ascend-910b
tor-affinity: "null" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不适用。large-model-schema表示大模型任务,normal-schema 普通任务
spec:
schedulerName: volcano # work when enableGangScheduling is true
runPolicy:
schedulingPolicy: # work when enableGangScheduling is true
minAvailable: 2
queue: default
successPolicy: AllWorkers
replicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
labels:
ring-controller.atlas: ascend-910b
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values:
- default-test-pytorch
topologyKey: kubernetes.io/hostname
nodeSelector:
host-arch: huawei-arm
accelerator-type: card-910b-2 # depend on your device model, 910bx8 is module-910b-8 ,910bx16 is module-910b-16
containers:
- name: ascend # do not modify
image: pytorch-test:latest # training framework image, which can be modified
imagePullPolicy: IfNotPresent
env:
- name: XDL_IP # IP address of the physical node, which is used to identify the node where the pod is running
valueFrom:
fieldRef:
fieldPath: status.hostIP
# ASCEND_VISIBLE_DEVICES env variable is used by ascend-docker-runtime when in the whole card scheduling scene with volcano scheduler.
# Please delete it when in the static vNPU scheduling, dynamic vNPU scheduling, volcano without Ascend-volcano-plugin, without volcano scenes.
- name: ASCEND_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/Ascend910'] # The value must be the same as resources.requests
command: # training command, which can be modified
- /bin/bash
- -c
args: [ "cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=4096" ]
ports: # default value containerPort: 2222 name: ascendjob-port if not set
- containerPort: 2222 # determined by user
name: ascendjob-port # do not modify
resources:
limits:
huawei.com/Ascend910: 2
requests:
huawei.com/Ascend910: 2
volumeMounts:
- name: code
mountPath: /job/code
- name: data
mountPath: /job/data
- name: output
mountPath: /job/output
- name: ascend-driver
mountPath: /usr/local/Ascend/driver
- name: ascend-add-ons
mountPath: /usr/local/Ascend/add-ons
- name: dshm
mountPath: /dev/shm
- name: localtime
mountPath: /etc/localtime
volumes:
- name: code
nfs:
server: 127.0.0.1
path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
- name: data
nfs:
server: 127.0.0.1
path: "/data/atlas_dls/public/dataset/"
- name: output
nfs:
server: 127.0.0.1
path: "/data/atlas_dls/output/"
- name: ascend-driver
hostPath:
path: /usr/local/Ascend/driver
- name: ascend-add-ons
hostPath:
path: /usr/local/Ascend/add-ons
- name: dshm
emptyDir:
medium: Memory
- name: localtime
hostPath:
path: /etc/localtime
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
labels:
ring-controller.atlas: ascend-910b
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values:
- default-test-pytorch
topologyKey: kubernetes.io/hostname
nodeSelector:
host-arch: huawei-arm
accelerator-type: card-910b-2 # depend on your device model, 910bx8 is module-910b-8 ,910bx16 is module-910b-16
containers:
- name: ascend # do not modify
image: pytorch-test:latest # training framework image, which can be modified
imagePullPolicy: IfNotPresent
env:
- name: XDL_IP # IP address of the physical node, which is used to identify the node where the pod is running
valueFrom:
fieldRef:
fieldPath: status.hostIP
# ASCEND_VISIBLE_DEVICES env variable is used by ascend-docker-runtime when in the whole card scheduling scene with volcano scheduler.
# Please delete it when in the static vNPU scheduling, dynamic vNPU scheduling, volcano without Ascend-volcano-plugin, without volcano scenes.
- name: ASCEND_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/Ascend910'] # The value must be the same as resources.requests
command: # training command, which can be modified
- /bin/bash
- -c
args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=4096"]
ports: # default value containerPort: 2222 name: ascendjob-port if not set
- containerPort: 2222 # determined by user
name: ascendjob-port # do not modify
resources:
limits:
huawei.com/Ascend910: 2
requests:
huawei.com/Ascend910: 2
volumeMounts:
- name: code
mountPath: /job/code
- name: data
mountPath: /job/data
- name: output
mountPath: /job/output
- name: ascend-driver
mountPath: /usr/local/Ascend/driver
- name: ascend-add-ons
mountPath: /usr/local/Ascend/add-ons
- name: dshm
mountPath: /dev/shm
- name: localtime
mountPath: /etc/localtime
volumes:
- name: code
nfs:
server: 127.0.0.1
path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
- name: data
nfs:
server: 127.0.0.1
path: "/data/atlas_dls/public/dataset/"
- name: output
nfs:
server: 127.0.0.1
path: "/data/atlas_dls/output/"
- name: ascend-driver
hostPath:
path: /usr/local/Ascend/driver
- name: ascend-add-ons
hostPath:
path: /usr/local/Ascend/add-ons
- name: dshm
emptyDir:
medium: Memory
- name: localtime
hostPath:
path: /etc/localtime
关键字段
表 1 acjob关键字段说明
| 字段路径 | 类型 | 格式 | 描述 |
|---|---|---|---|
| apiVersion | 字符串 (string) | - | 定义对象表示的版本化资源模式。服务器会转换为最新内部值,拒绝不识别的版本。更多信息请参见Types。 |
| kind | 字符串 (string) | - | 表示此对象对应的REST资源类型。值通过端点推断,不可更新,采用驼峰命名。更多信息请参见Resources。 |
| metadata | 对象 (object) | - | Kubernetes元数据(如命名空间、标签等)。更多信息请参见Metadata。 |
| spec | 对象 (object) | - | AscendJob期望状态的规格描述。必填字段:replicaSpecs。 |
| spec.replicaSpecs | 对象 (object) | - | ReplicaType到ReplicaSpec的映射,指定MS集群配置。示例:{ "Scheduler": ReplicaSpec, "Worker": ReplicaSpec }。 |
| spec.replicaSpecs.[ReplicaType] | 对象 (object) | - | 副本的描述。 |
| spec.replicaSpecs.[ReplicaType].replicas | 整数 (integer) | int32 | 副本数量,表示给定模板所需的副本数。默认为1。 |
| spec.replicaSpecs.[ReplicaType].restartPolicy | 字符串 (string) | - | 重启策略:Always、OnFailure、Never、ExitCode。默认为Never。 |
| spec.replicaSpecs.[ReplicaType].template | 对象 (object) | - | Kubernetes Pod模板,更多信息请参见Kubernetes Pod模板。 |
| spec.runPolicy | 对象 (object) | - | 封装分布式训练作业的运行时策略(如资源清理、活动时间)。 |
| spec.runPolicy.backoffLimit | 整数 (integer) | int32 | 作业失败前允许的重试次数(可选)。 |
| spec.runPolicy.activeDeadlineSeconds | 整数 (integer) | int64 | 作业保持活动的最长时间(秒),值必须为正整数。当前无意义,后续版本将会删除。 |
| spec.runPolicy.cleanPodPolicy | 字符串 (string) | - | 作业完成后清理Pod的策略。默认值为Running。当前无意义,后续版本将会删除。 |
| spec.runPolicy.ttlSecondsAfterFinished | 整数 (integer) | int32 | 作业完成后的TTL(生存时间)。默认为无限,实际删除可能延迟。当前无意义,后续版本将会删除。 |
| spec.runPolicy.schedulingPolicy | 对象 (object) | - | 调度策略(如gang-scheduling)。 |
| spec.runPolicy.schedulingPolicy.minAvailable | 整数 (integer) | int32 | 最小可用资源数。 |
| spec.runPolicy.schedulingPolicy.minResources | 对象 (object) | - | 按资源名称分配的最小资源集合(支持整数或字符串格式)。 |
| spec.runPolicy.schedulingPolicy.priorityClass | 字符串 (string) | - | 优先级类名称。 |
| spec.runPolicy.schedulingPolicy.queue | 字符串 (string) | - | 调度队列名称。 |
| spec.schedulerName | 字符串 (string) | - | 指定在开启gang-scheduling情况下的调度器,当前仅支持Volcano。 |
| spec.successPolicy | 字符串 (string) | - | 标记AscendJob成功的标准,当前无意义,仅当所有Pod成功时,才会判定任务成功。后续版本将会删除。 |
| status | 对象 (object) | - | AscendJob的最新观察状态(只读)。必填字段:conditions、replicaStatuses。 |
| status.completionTime | 字符串 (string) | date-time | 作业完成时间(RFC3339格式,UTC)。 |
| status.conditions | 数组 (array) | - | 当前作业条件数组。 |
| status.conditions[type] | 字符串 (string) | - | 作业条件的类型(如 "Complete")。 |
| status.conditions[status] | 字符串 (string) | - | 条件状态:True、False、Unknown。 |
| status.conditions[lastTransitionTime] | 字符串 (string) | date-time | 条件状态转换的时间。 |
| status.conditions[lastUpdateTime] | 字符串 (string) | date-time | 条件更新后的最终时间。 |
| status.conditions[message] | 字符串 (string) | - | 条件的详细描述。 |
| status.conditions[reason] | 字符串 (string) | - | 条件转换的原因。 |
| status.lastReconcileTime | 字符串 (string) | date-time | 作业最后一次调和的时间(RFC3339格式,UTC)。 |
| status.replicaStatuses | 对象 (object) | - | 副本类型到副本状态的映射。 |
| status.replicaStatuses.[ReplicaType].active | 整数 (integer) | int32 | 正在运行的Pod数量。 |
| status.replicaStatuses.[ReplicaType].failed | 整数 (integer) | int32 | 已失败的Pod数量。 |
| status.replicaStatuses.[ReplicaType].succeeded | 整数 (integer) | int32 | 已成功的Pod数量。 |
| status.replicaStatuses.[ReplicaType].labelSelector | 对象 (object) | - | Pod标签选择器(定义如何筛选Pod)。 |
| status.replicaStatuses.[ReplicaType].labelSelector.matchExpressions | 数组 (array) | - | 标签匹配规则(支持In、NotIn、Exists、DoesNotExist等操作符)。 |
| status.replicaStatuses.[ReplicaType].labelSelector.matchLabels | 对象 (object) | - | 标签匹配的键值对(等价于matchExpressions条件)。 |
| status.startTime | 字符串 (string) | date-time | 作业开始时间(RFC3339格式,UTC)。 |
任务状态说明
拉起训练任务后,用户可以通过kubectl get acjob命令查看acjob任务的运行状态,当前运行状态有以下几种。
表 2 acjob任务运行状态说明
| 状态名称 | 说明 |
|---|---|
| Created | Job已经创建,但其中一个或多个子资源(Pod/Service)尚未就绪。 |
| Running | Job的所有子资源(Pod/Service)已经调度并启动。 |
| Restarting | Job的一个或多个子资源(Pod/Service)运行失败,但是根据重启策略正在重新启动。 |
| Succeeded | Job的所有子资源(Pod/Service)处于成功终止阶段。 |
| Failed | Job的一个或多个子资源(Pod/Service)运行失败。 |
任务异常条件说明
当任务出现异常时,AscendJob 的 status.conditions 字段会记录详细的异常信息。每个 condition 包含以下字段:
| 字段 | 类型 | 说明 |
|---|---|---|
| type | 字符串 | 条件类型,如 Failed、Restarting、Running、Succeeded、Created |
| status | 字符串 | 条件状态:True、False、Unknown |
| lastTransitionTime | 字符串 | 条件状态转换的时间(RFC3339格式) |
| lastUpdateTime | 字符串 | 条件更新后的最终时间(RFC3339格式) |
| message | 字符串 | 条件的详细描述信息 |
| reason | 字符串 | 条件转换的原因代码 |
常见异常原因(reason)说明
| 原因代码 | 说明 |
|---|---|
| JobFailed | 任务失败,通常是因为有 Pod 失败 |
| jobRestarting | 任务正在重启,根据重启策略重新启动失败的 Pod |
| SyncPodGroupFailed | 同步 PodGroup 失败 |
| PodGroupNotInitialized | PodGroup 未初始化,通常是因为 volcano-scheduler 未运行 |
| PodGroupPending | PodGroup 处于等待状态,通常是因为集群资源不足 |
| SyncServiceFailed | 同步 Service 失败 |
| PodCreateFailed | 创建 Pod 失败 |
| JobValidFailed | 任务验证失败 |
异常条件示例
status:
conditions:
- type: Failed
status: "True"
lastTransitionTime: "2024-01-01T10:00:00Z"
lastUpdateTime: "2024-01-01T10:00:00Z"
message: "Job default/test-job has failed because has pod failed."
reason: "JobFailed"
- type: Restarting
status: "True"
lastTransitionTime: "2024-01-01T10:00:00Z"
lastUpdateTime: "2024-01-01T10:00:00Z"
message: "Job default/test-job is unconditional retry job and remain retry times is <3>."
reason: "jobRestarting"
- type: Failed
status: "True"
lastTransitionTime: "2024-01-01T10:00:00Z"
lastUpdateTime: "2024-01-01T10:00:00Z"
message: "Job test-job has failed because it has reached the specified backoff limit"
reason: "JobFailed"
查看任务异常信息
使用以下命令查看任务的详细状态和异常信息:
# 查看 AscendJob 的状态
kubectl get acjob -n <namespace> <job-name> -o yaml
# 查看 AscendJob 的状态摘要
kubectl get acjob -n <namespace> <job-name> -o jsonpath={.status.conditions}
# 查看 AscendJob 的最新状态
kubectl get acjob -n <namespace> <job-name> -o jsonpath={.status.conditions[-1]}