b8c48733创建于 18 天前历史提交

Ascend DRA Driver for NPUs

This driver is specifically designed for Ascend NPU devices, based on Kubernetes Dynamic Resource Allocation (DRA) mechanism, providing full lifecycle management of NPU resources. Core functions are as follows:

Supports automatic resource discovery and cluster reporting of Ascend NPU devices, synchronizing device status in real-time.
Supports fine-grained filtering of NPU devices through DeviceClass combined with Common Expression Language (CEL).
Completes precise binding of business Pods and ResourceSlices through ResourceClaim/ResourceClaimTemplate to complete resource application, ensuring efficient resource allocation Applicable environment: Kubernetes 1.34 or higher version (need to enable dynamic resource allocation DRA function in advance). For detailed operations on enabling DRA, please refer to Kubernetes official documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation.

1. Overview

Dynamic Resource Allocation (DRA) is the core mechanism in Kubernetes for flexibly managing dedicated devices (such as NPU, GPU, TPU). Its core advantage lies in opening resource configuration and scheduling permissions to third-party device vendors, breaking the limitations of traditional device plugins, achieving fine-grained and flexible resource allocation. If you need to deeply understand DRA core concepts and principles, you can refer to the following official documentation:

Kubernetes official DRA documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
GKE DRA application documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/about-dynamic-resource-allocation Ascend DRA driver serves as the bridge between Ascend NPU and Kubernetes cluster. Through DRA Kubelet plugin and validation admission webhook, it implements full-process management of NPU resources, covering core capabilities such as device discovery, resource allocation, CDI (Container Device Interface) injection, etc., providing one-stop resource management solution for Ascend NPU in Kubernetes cluster.

2. Component Description

Ascend DRA driver contains only one core component (ascend-npu-dra-kubeletplugin), with simple deployment and convenient maintenance. It is deployed in the form of DaemonSet on all cluster nodes equipped with Ascend NPU. Core responsibilities include:

Device discovery: Automatically scans Ascend NPU devices on the node (based on DCMI or sysfs method) and reports device details to kubelet.
Resource scheduling: Responds to ResourceClaim requests, completes NPU resource allocation and release, ensuring efficient resource utilization.
CDI injection: Injects CDI configuration into containers running NPU tasks, ensuring containers can normally access NPU devices.
Health check: Real-time monitors NPU device running status, timely reports device exceptions, ensuring business stability.

3. Feature Capabilities

3.1 NPU Resource Management

Automatic device discovery: Automatically identifies Ascend NPU devices on the node through DCMI (Device Management Interface) or sysfs file system, without manual configuration of device information.
Flexible resource allocation: Based on DRA mechanism, supports flexible allocation of NPU resources according to business needs, adapting to workloads with different computing power requirements.
CDI specification compatibility: Injects device access configuration into containers through CDI (Container Device Interface), ensuring compatibility between containers and NPU devices, simplifying container deployment process.
Bus-Id device recognition: Automatically obtains Bus-Id information of NPU devices as the unique device identifier, avoiding device confusion and improving resource allocation accuracy.

4. Usage Guide

The following are the complete deployment and usage steps for Ascend DRA driver. They must be executed in strict order to ensure the driver runs normally.

4.1 Install Dependencies

Before using Ascend DRA driver, ensure that the cluster environment meets the following dependency conditions. It is recommended to use the versions recommended by the openFuyao community to improve compatibility:

Kubernetes: 1.34 or higher version, and DRA function has been enabled (recommended version: v1.34.3).
Container runtime: containerd v1.7.0 or higher version (recommended version: v2.1.1).
Hardware and driver: The cluster node has Ascend NPU hardware installed, and the corresponding version of Ascend NPU driver has been deployed (ensure that the driver matches the hardware model).

4.2 Build Image

4.2.1 Clone Code Repository

Clone the openFuyao community NPU DRA plugin repository and switch to the specified version branch:

git clone -b release-1.0.x https://gitcode.com/openFuyao/npu-dra-plugin.git
cd npu-dra-plugin

4.2.2 Build Configuration Parameter Description

The driver supports the following parameters for custom configuration to adapt to different cluster environments. Default values can meet the requirements of most scenarios:

--cdi-root: Storage directory for CDI specification files (default value: /etc/cdi).
--device-profile: Device configuration file type (optional values: gpu / npu, default value: npu).
--driver-name: DRA driver name (needs to be consistent with subsequent DeviceClass configuration).
--kubelet-registrar-directory-path: Kubelet registration directory (used for plugin communication with kubelet).
--kubelet-plugins-directory-path: Kubelet plugin directory (plugin deployment path).

4.2.3 Build Container Image

Enter the driver directory and execute the following command to build the container image (you can modify the image name and tag according to actual needs):

# Enter driver source code directory
cd Ascend-npu-dra-plugin
# Build image (image name can be customized, such as ascend-npu-dra-plugin:v1.0.0)
docker build -f build/Dockerfile -t ascend-npu-dra-plugin:latest .

4.3 Deploy Driver Component

The driver component is deployed through Kubernetes resource manifests. You need to first modify the image address in the manifests to ensure it is consistent with the image built in step 2, then execute the deployment command:

# Deploy DRA Kubelet plugin (DaemonSet)
kubectl apply -f manifests/ascend-npu-dra-kubeletplugin-daemonset-template.yaml
# Deploy NPU DeviceClass (used for device filtering)
kubectl apply -f manifests/npu-deviceclass-template.yaml

Note: Before deployment, you need to edit the above two manifest files and modify the image field to the actual path of the image built in step 2 (such as private repository address + image name + tag), otherwise deployment will fail.

4.4 Run Workload

After the driver deployment is completed, you can submit a workload through the example manifest to verify whether the driver works normally:

kubectl apply -f manifests/test-resourceclaimtemplate-pod-template.yaml

After submission, you can verify whether resource allocation is successful through the commands in the "Related Operations" section below.

The following commands are used to query information related to DRA custom resources (CRs), helping to troubleshoot driver running status and resource allocation situations.

5.1 Query ResourceSlice

ResourceSlice is used to store NPU device details. Query it through the following command:

# Query all ResourceSlices in the cluster
kubectl get resourceslices

# Query detailed information of a specified ResourceSlice (including device attributes, capacity, etc.)
kubectl get resourceslices <resourceslice_name> -o yaml

At this point, you can view information of all discovered devices. This information can be used for device filtering with CEL expressions. An example is as follows.

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  creationTimestamp: "2026-02-27T01:55:37Z"
  generateName: master-npu.huawei.com-
  generation: 1
  name: master-npu.huawei.com-9gv8l
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: master
    uid: 6ef76e72-da36-44e3-b9c3-93f44684a859
  resourceVersion: "2225369"
  uid: 0c1b399e-4fa8-4279-93f8-b92a1faeff6f
spec:
  devices:
  - attributes:
      chipName:
        string: 910B4
      numaNode:
        int: 6
      physicalId:
        int: 0
      topologyGroup:
        string: ring-0
    capacity:
      memCapacity:
        value: 32Gi
    name: npu-0
  driver: npu.huawei.com
  nodeName: master
  pool:
    generation: 1
    name: master
    resourceSliceCount: 1

5.2 Query DeviceClass

DeviceClass is used to define NPU device categories. Query it through the following command:

kubectl get deviceclasses

5.3 Query ResourceClaim

ResourceClaim is used to record resource application situations. Specify the namespace to query:

kubectl get resourceclaims -n <namespace>

6. Contribution and Support

6.1 Contribution Guide

Developers are welcome to participate in driver optimization and iteration. If you need to contribute code, please submit it through the following methods:

Visit GitCode repository: https://gitcode.com/openFuyao/npu-dra-plugin.git.
Submit a pull request (Pull Request) and attach detailed function description or problem fix description.
Submit an issue (Issue) to report bugs encountered during use or functional requirements.

6.2 Support Channels

If you encounter problems during driver deployment or use, please open an Issue on the GitCode project page. We will respond in a timely manner and provide technical support. Thank you for your use and feedback, helping to continuously optimize the Ascend DRA driver!