Cluster Scheduling Component Ascend Device Plugin
English | 中文
Quick Reference
- Ascend Device Plugin is maintained by MindCluster Code Repository
- Where to get help
Ascend Device Plugin
Ascend Device Plugin is one of the core components of the MindCluster cluster scheduling suite, deployed on compute nodes to provide resource discovery and reporting strategies tailored for Ascend devices.
Use Cases
Kubernetes needs to be aware of resource information for scheduling. Beyond basic CPU and memory information, the Kubernetes device plugin mechanism allows users to define custom resource types and customize resource discovery and reporting strategies. MindCluster provides the Ascend Device Plugin service deployed on compute nodes to offer resource discovery and reporting strategies suitable for Ascend devices.
Features
-
Device Discovery: Obtains chip type and model information from the driver and reports it to kubelet and the upper-level ClusterD service. Supports discovering the number of devices from the Ascend device driver and reporting the count to the Kubernetes system. Supports discovering virtual devices split from physical devices and reporting them to the Kubernetes system.
-
Health Check: Subscribes to chip fault information from the driver, reports chip status to kubelet, and reports chip status along with specific fault details to the upper-level scheduling service. Supports detecting the health status of Ascend devices. When a device is in an unhealthy state, it is reported to the Kubernetes system, which automatically removes the unhealthy device from the available list. The health status of virtual devices is determined by the physical devices from which they are split.
-
Device Allocation: Supports allocating Ascend devices in the Kubernetes system. Supports NPU device rescheduling — when a device fails, a new container is automatically started, a healthy device is mounted, and the training task is rebuilt. During the resource mounting phase, it retrieves the chip information selected by the cluster scheduler and passes it to Ascend Docker Runtime via environment variables for mounting.
-
Fault Handling: Configurable fault handling levels, with the ability to escalate fault handling levels when faults recur or persist for extended periods. If a faulty chip is idle and can recover after a restart, a hot reset is performed on the chip.
-
Network Fault Monitoring: Subscribes to Lingqu network fault information from the Lingqu driver, reports network status to kubelet, and reports Lingqu network status along with specific fault details to the upper-level scheduling service.
Upstream and Downstream Dependencies
- Obtains chip type, quantity, and health status information from DCMI, or issues chip reset commands.
- Reports chip type, quantity, and status to kubelet.
- Reports chip type, quantity, and specific fault information to ClusterD.
- Passes the chip information selected by the scheduler to Ascend Docker Runtime via environment variables.
Supported Tags and Dockerfile Links
Tag Convention
Tags follow this format:
<version>-<os>
| Field | Example | Description |
|---|---|---|
version |
v26.1.0 |
Ascend Device Plugin component version |
os |
ubuntu22.04 |
Ascend Device Plugin image operating system |
Ascend Device Plugin 26.1.0
| Tag | Dockerfile | Image Content |
|---|---|---|
v26.1.0-ubuntu22.04 |
Dockerfile.ubuntu | Ascend Device Plugin v26.1.0 image for Ubuntu 22.04 |
v26.1.0-openeuler24.03 |
Dockerfile.openeuler | Ascend Device Plugin v26.1.0 image for openEuler 24.03 |
Quick Start
Prerequisites
Software Dependencies
| Software | Supported Versions | Installation Location | Description |
|---|---|---|---|
| Kubernetes | 1.17.x~1.34.x (1.19.x or later recommended) | All nodes | See Kubernetes Documentation |
| Docker | 18.09.x~28.5.1 | All nodes | Available from Docker |
| Containerd | 1.4.x~2.1.4 (1.6.x recommended) | All nodes | Available from Containerd |
| Ascend AI Processor Driver and Firmware | See version compatibility table | Compute nodes | See "Installing NPU Driver and Firmware" in the CANN Software Installation Guide |
Hardware Requirements
| Resource | Requirement |
|---|---|
| CPU | 0.5 cores |
| Memory | 0.5 GB |
Install Driver
The host machine must have the driver and firmware installed. For details, see "Installing NPU Driver and Firmware" in the CANN Software Installation Guide (Commercial Edition) or CANN Software Installation Guide (Community Edition).
How to Build Locally
docker build --no-cache -t ascend-k8sdeviceplugin:{tag} ./ -f Dockerfile.{os}
Note:
- TARGETPLATFORM is a global built-in parameter provided by Docker BuildKit, used to obtain the target platform of the current build, such as linux/amd64 and linux/arm64.
- This variable is automatically injected only when BuildKit is enabled. It will not be available in older Docker versions or environments with BuildKit disabled by default. Run export DOCKER_BUILDKIT=1 to enable it temporarily before executing build commands.
Deploy Ascend Device Plugin
- Pull the image
docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:{tag}
- Retag the image
docker tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:{tag} ascend-k8sdeviceplugin:{tag}
- Label the nodes
Label nodes according to the Ascend processor model:
# Example: Label an Ascend 910 node
kubectl label nodes <node-name> accelerator=huawei-Ascend910
kubectl label nodes <node-name> host-arch=huawei-arm
- Start Ascend Device Plugin
Select the appropriate YAML file based on the Ascend processor model:
Replace {tag} in the YAML file with the actual image tag.
# Configuration file for products excluding Atlas 200I SoC A1 core board without Volcano.
kubectl apply -f device-plugin-{version}.yaml
# Configuration file for products excluding Atlas 200I SoC A1 core board with Volcano.
kubectl apply -f device-plugin-volcano-{version}.yaml
- Verify deployment
kubectl get pods -A | grep device-plugin
- Check node resources
kubectl describe node <npu-node-name> | grep "huawei.com/Ascend"
Supported Hardware
For descriptions of supported Ascend hardware models, please refer to the official documentation: Supported Product Formats and OS List
License
View the license information for the Mind series software contained in these images.
As with all container images, pre-installed software packages (Python, system libraries, etc.) may be subject to their respective license agreements.