Cluster Scheduling Component NodeD
English | 中文
Quick Reference
- NodeD is maintained by MindCluster Code Repository
- Where to get help
NodeD
NodeD is a MindCluster cluster scheduling component deployed on compute nodes. It detects node abnormal states, retrieves CPU, memory, and disk fault information from IPMI, and reports it to ClusterD.
Use Cases
When a node's CPU, memory, or disk experiences certain faults, training tasks will fail. To allow training tasks to exit quickly when a node fault occurs and prevent new tasks from being scheduled to faulty nodes, MindCluster provides the NodeD component for detecting node abnormalities.
Features
- Retrieves node abnormalities from IPMI and reports them to the upper-level scheduling service.
- Periodically sends node fault information to the upper-level scheduling service.
Upstream and Downstream Dependencies
- Retrieves CPU, memory, and disk fault information from IPMI on compute nodes.
- Reports CPU, memory, and disk fault information of compute nodes to ClusterD.
Supported Tags and Dockerfile Links
Tag Convention
Tags follow this format:
<version>-<os>
| Field | Example | Description |
|---|---|---|
version |
v26.1.0 |
NodeD component version |
os |
ubuntu22.04 |
NodeD image operating system |
NodeD 26.1.0
| Tag | Dockerfile | Image Content |
|---|---|---|
v26.1.0-ubuntu22.04 |
Dockerfile.ubuntu | NodeD v26.1.0 image for Ubuntu 22.04 |
v26.1.0-openeuler24.03 |
Dockerfile.openeuler | NodeD v26.1.0 image for openEuler 24.03 |
Quick Start
Prerequisites
Software Dependencies
| Software | Supported Versions | Installation Location | Description |
|---|---|---|---|
| Kubernetes | 1.17.x~1.34.x (1.19.x or later recommended) | All nodes | See Kubernetes Documentation |
| ClusterD | Same version as NodeD | Management nodes | Fault information reported by NodeD is aggregated by ClusterD |
Hardware Requirements
| Resource | Requirement |
|---|---|
| CPU | 0.5 cores |
| Memory | 0.3 GB |
How to Build Locally
docker build --no-cache -t noded:{tag} ./ -f Dockerfile.{os}
Note:
- TARGETPLATFORM is a global built-in parameter provided by Docker BuildKit, used to obtain the target platform of the current build, such as linux/amd64 and linux/arm64.
- This variable is automatically injected only when BuildKit is enabled. It will not be available in older Docker versions or environments with BuildKit disabled by default. Run export DOCKER_BUILDKIT=1 to enable it temporarily before executing build commands.
Deploy NodeD
- Pull the image
docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:{tag}
- Retag the image
docker tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:{tag} noded:{version}
- Start NodeD
Replace {tag} in the noded-{version}.yaml file with the actual image tag.
kubectl apply -f noded-{version}.yaml
- Verify deployment
kubectl get pods -A | grep noded
Supported Hardware
For descriptions of supported Ascend hardware models, please refer to the official documentation: Supported Product Formats and OS List
License
View the license information for the Mind series software contained in these images.
As with all container images, pre-installed software packages (Python, system libraries, etc.) may be subject to their respective license agreements.