快速入门

本文档提供两种快速入门场景,帮助您快速上手Ascend NPU集群调度:

  • 10分钟极简入门:仅部署Ascend Device Plugin和Ascend Docker Runtime,使用Kubernetes原生调度器调度普通Pod,快速验证NPU资源调度能力,适合初学者快速体验。
  • 完整训练业务入门:部署完整的集群调度组件(NodeD、Ascend Device Plugin、Ascend Docker Runtime、Volcano、ClusterD、Ascend Operator),以PyTorch训练任务为例,体验端到端的训练流程。

您可以根据实际需求选择合适的入门路径。

环境准备

入门示例需要确保集群环境已经搭建完成。

10分钟快速入门

本教程将指导您在 10分钟内 完成最简化的Ascend NPU集群调度环境搭建,仅使用:

  • Ascend Device Plugin - NPU设备发现与资源上报
  • Ascend Docker Runtime - NPU设备等资源挂载能力
  • Kubernetes原生调度器 - 无需额外调度组件
  • 普通Pod - 快速验证NPU调度能力

安装组件

  1. 环境要求

    要求 说明
    计算节点 以Altlas 800T A2 arm64训练服务器为例
    驱动版本 配套服务器的Ascend驱动已安装
  2. 前置检查

    在开始前,请确保NPU驱动已正确安装:

    # 检查NPU状态,预期输出示例(显示芯片信息)
    npu-smi info
    
  3. 为NPU节点添加标签

    # 为NPU节点添加必要标签
    kubectl label nodes -A workerselector=dls-worker-node
    
  4. 部署Ascend Docker Runtime和Ascend Device Plugin

    1. 部署Ascend Docker Runtime

      VERSION=26.1.0
      mkdir -p /tmp/Ascend-docker-runtime
      cd /tmp/Ascend-docker-runtime
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v${VERSION}/Ascend-docker-runtime_${VERSION}_linux-aarch64.run
      chmod +x Ascend-docker-runtime_${VERSION}_linux-aarch64.run
      ./Ascend-docker-runtime_${VERSION}_linux-aarch64.run --install
      systemctl daemon-reload && systemctl restart docker
      

      回显示例如下则表示安装成功

      Uncompressing ascend-docker-runtime  100%
      [INFO]: installing ascend-docker-runtime
      ...
      [INFO] ascend-docker-runtime install success
      
    2. 拉取Device Plugin镜像

      # 从华为云镜像仓拉取Device Plugin镜像
      docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v${VERSION}
      
      # 为镜像添加本地标签
      docker tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v${VERSION} ascend-k8sdeviceplugin:v${VERSION}
      
    3. 部署Ascend Device Plugin

      # 拉取配置文件
      mkdir -p /tmp/devicePlugin
      cd /tmp/devicePlugin
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v${VERSION}/Ascend-mindxdl-device-plugin_${VERSION}_linux-aarch64.zip
      unzip Ascend-mindxdl-device-plugin_${VERSION}_linux-aarch64.zip
      
      # 部署Device Plugin,若VERSION低于26.1.0版本,yaml文件为device-plugin-910-v${VERSION}.yaml
      kubectl apply -f device-plugin-v${VERSION}.yaml
      

      查看Device Plugin Pod状态

      kubectl get pod -n kube-system
      
      # 预期输出
      NAME                                  READY   STATUS    RESTARTS   AGE
      ...
      ascend-device-plugin-daemonset-d5ctz  1/1     Running   0          11s
      ...
      
    4. 验证NPU资源上报

      # 查看节点的NPU资源
      kubectl describe node -A | grep "huawei.com/Ascend910"
      
      # 预期输出(显示可用的NPU数量)
      huawei.com/Ascend910:     8
      huawei.com/Ascend910:     8
      

调度NPU Pod

  1. 创建测试Pod配置文件

    创建 npu-test-pod.yaml

    apiVersion: v1
    kind: Pod
    metadata:
      name: npu-test
    spec:
      containers:
      - name: npu-container
        image: ubuntu:22.04          # 测试pod镜像,可以自定义
        command: ["/bin/bash", "-c", "sleep 3600"]
        resources:
          limits:
            huawei.com/Ascend910: 1  # 请求1个NPU卡
          requests:
            huawei.com/Ascend910: 1
    
  2. 部署测试Pod

    kubectl apply -f npu-test-pod.yaml
    
  3. 验证Pod调度

    # 查看Pod状态
    kubectl get pods npu-test -o wide
    
    # 预期输出(STATUS为Running表示调度成功)
    NAME      READY   STATUS    RESTARTS   AGE   IP           NODE      NOMINATED NODE
    npu-test  1/1     Running   0          10s   10.244.1.2   worker01  <none>
    
  4. 验证NPU访问

    # 进入容器验证NPU可用性
    kubectl exec -it npu-test  -- /bin/bash
    
    # 在容器内执行npu-smi info正确回显卡信息
    export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:${LD_LIBRARY_PATH}
    npu-smi info
    
  5. 清理测试资源

    # 删除测试Pod
    kubectl delete pod npu-test
    
    # 删除Device Plugin,若VERSION低于26.1.0版本,yaml文件为device-plugin-910-v${VERSION}.yaml
    kubectl delete -f device-plugin-v${VERSION}.yaml
    
  6. 常见问题

    问题 原因 解决方法
    Pod一直Pending NPU资源不足或节点标签不匹配 检查kubectl describe pod和节点标签
    Device Plugin启动失败 驱动路径不正确 检查/usr/local/Ascend/driver是否存在

训练业务快速入门

本章节以依然以一台Atlas 800T A2 AArch64训练服务器为例,指导开发者快速完成NodeD、Ascend Device Plugin、Ascend Docker Runtime、Volcano、ClusterD、Ascend Operator组件的安装及使用整卡调度特性快速下发训练任务。

操作说明

表 1 关键步骤说明

操作步骤 操作说明 更多参考
安装组件 以Atlas 800T A2 训练服务器为例,手把手带您在昇腾设备上快速安装集群调度组件。 更多安装集群调度组件的参数说明和操作步骤,请参考安装部署章节。
下发训练任务 以一个简单的PyTorch训练任务为例,让您快速了解训练任务下发的操作流程。 更多下发训练任务的参数说明和操作步骤,请参考基础调度章节。

安装组件

以下步骤命令均以一台Atlas 800T A2 训练服务器为例,如需了解所有组件的详细安装步骤和参数说明请参见安装

  1. 创建节点标签。

    1. 依次执行以下命令,为计算节点创建节点标签(如节点名称为“worker01”)。

      kubectl label nodes -A node-role.kubernetes.io/worker=worker workerselector=dls-worker-node masterselector=dls-master-node --overwrite
      
  2. 安装组件。以AArch64架构为例,用户需根据实际情况下载对应架构的软件包。

    Note

    快速入门以helm快捷部署为例,要求MindCluster版本为26.1.0及以上,可以参考安装部署章节的helm部署。

    1. 安装Ascend Docker Runtime。

      VERSION=26.1.0
      cd /tmp/Ascend-docker-runtime
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v${VERSION}/Ascend-docker-runtime_${VERSION}_linux-aarch64.run
      chmod +x Ascend-docker-runtime_${VERSION}_linux-aarch64.run
      ./Ascend-docker-runtime_${VERSION}_linux-aarch64.run --install
      systemctl daemon-reload && systemctl restart docker
      
    2. 通过helm安装NodeD、Ascend Device Plugin、Volcano、ClusterD、Ascend Operator组件。

      mkdir /tmp/helm
      cd /tmp/helm
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v${VERSION}/Ascend-helm-deploy-tool_${VERSION}_linux.zip
      unzip Ascend-helm-deploy-tool_${VERSION}_linux.zip
      helm install mindcluster-crds mindcluster-crds-deploy-tool-*.tgz
      helm install mindcluster mindcluster-deploy-tool-*.tgz
      
    3. 验证安装结果。

      1. helm安装回显如下,则安装成功。

        Release "mindcluster-crds" does not exist. Installing it now.
        NAME: mindcluster-crds
        LAST DEPLOYED: ...
        NAMESPACE: mindx-dl
        STATUS: deployed
        REVISION: 1
        TEST SUITE: None
        
        Release "mindcluster" does not exist. Installing it now.
        NAME: mindcluster
        LAST DEPLOYED: ...
        NAMESPACE: mindx-dl
        STATUS: deployed
        REVISION: 1
        TEST SUITE: None
        
      2. 验证组件是否正常运行,以NodeD组件为例。

        # 查看NodeD Pod状态
        kubectl get pod -n mindx-dl
        
        # 预期输出
        NAME                                  READY   STATUS    RESTARTS   AGE
        ...
        noded-694474f599-54w6b                1/1     Running   0          11s
        ...
        

下发训练任务

  1. 准备镜像。

    昇腾镜像仓库下载24.0.X版本的ascend-pytorch训练镜像。镜像中不包含训练脚本、代码等文件,训练时通常使用挂载的方式将训练脚本、代码等文件映射到容器内。

    docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.0-A2-2.1.0-ubuntu20.04
    docker tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.0-A2-2.1.0-ubuntu20.04 ascend-pytorch:24.0.0-A2-2.1.0-ubuntu20.04
    
  2. 训练任务准备。

    1. 通过如下命令下载PyTorch代码仓中master分支的“ResNet50_ID4149_for_PyTorch”作为训练代码,并解压到“/data/atlas_dls/public/code/”路径下。

      mkdir -p /data/atlas_dls/public/code/
      cd /data/atlas_dls/public/code/
      wget https://raw.gitcode.com/Ascend/ModelZoo-PyTorch/archive/refs/heads/master.zip?path=PyTorch/built-in/cv/classification/ResNet50_ID4149_for_PyTorch -O ResNet50_ID4149_for_PyTorch.zip
      unzip ResNet50_ID4149_for_PyTorch.zip
      mv ModelZoo-PyTorch-master-PyTorch-built-in-cv-classification-ResNet50_ID4149_for_PyTorch ResNet50_ID4149_for_PyTorch
      
    2. 自行准备ResNet-50对应的数据集,使用时请遵守对应规范,将数据集上传到“/data/atlas_dls/public/dataset/resnet50/imagenet”。

      mkdir /data/atlas_dls/public/dataset/resnet50/imagenet
      cd /data/atlas_dls/public/dataset/resnet50/imagenet
      
    3. 通过如下命令获取mindcluster-deploy仓库的“samples/train/basic-training/without-ranktable/pytorch”目录中的train_start.sh,放在“/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts”路径下。

      mkdir /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts
      cd /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts
      wget https://raw.gitcode.com/Ascend/mindcluster-deploy/raw/master/samples/train/basic-training/without-ranktable/pytorch/train_start.sh
      
    4. 通过如下命令获取mindcluster-deploy仓库“samples/train/basic-training/without-ranktable/pytorch”目录下的“pytorch_standalone_acjob_quickstart.yaml”文件。示例默认为单机单卡任务。

      cd /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts
      wget https://raw.gitcode.com/Ascend/mindcluster-deploy/raw/master/samples/train/basic-training/without-ranktable/pytorch/pytorch_standalone_acjob_quickstart.yaml
      
  3. 执行以下命令,下发单机单卡任务。

    kubectl apply -f /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts/pytorch_standalone_acjob_quickstart.yaml
    
  4. 执行以下命令,查看Pod运行情况。

    kubectl get pod -A -o wide
    

    回显示例如下,出现Running表示任务正常运行。

    NAMESPACE        NAME                                       READY   STATUS    RESTARTS   AGE     IP                NODE      NOMINATED NODE   READINESS GATES
    default          default-test-pytorch-master-0              1/1     Running   0          6s      192.168.244.xxx   worker01   <none>           <none>
    

    Note

    若下发训练任务后,任务一直处于Pending状态,可以参见训练任务处于Pending状态,原因:nodes are unavailable或者资源不足时,任务处于Pending状态章节进行处理。

  5. 查看训练结果。

    1. 在任意节点执行如下命令,查看训练结果。

      kubectl logs -n default default-test-pytorch-master-0
      
    2. 查看训练日志,如果出现如下内容表示训练成功。

      [20251218-20:31:57] [MindXDL Service Log]server id is: 0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=7 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=6 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=5 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=4 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=3 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=2 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=1 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
        warn(
      [2025-12-18 20:32:02] [WARNING] [470] profiler.py: Invalid parameter export_type: None, reset it to text.
      /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:201: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
      warnings.warn('You have chosen to seed training. '
      /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:208: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
      warnings.warn('You have chosen a specific GPU. This will completely '
      Use GPU: 0 for training
      => creating model 'resnet50'