Cluster Scheduling Component ClusterD
English | 中文
Quick Reference
- ClusterD is maintained by MindCluster Code Repository
- Where to get help
ClusterD
ClusterD is a MindCluster cluster scheduling component deployed on management nodes. It collects and aggregates cluster task, resource, and fault information along with their impact scope, performs statistical analysis from task, chip, and fault dimensions, and uniformly determines fault handling levels and strategies.
Use Cases
A single node may experience multiple faults. If each node handles faults independently, tasks may be subject to conflicting recovery strategies simultaneously. To coordinate fault handling levels, MindCluster provides the ClusterD service deployed on management nodes. ClusterD collects and aggregates cluster task, resource, and fault information along with their impact scope, performs statistical analysis from task, chip, and fault dimensions, and uniformly determines fault handling levels and strategies.
Features
- Obtains chip, node, and network information from Ascend Device Plugin and NodeD components, and retrieves public fault information from ConfigMap or gRPC.
- Aggregates the above fault information for upper-level cluster scheduling services to query.
- Establishes connections with training containers to control training processes for recomputation.
- Interacts with out-of-band services to transmit task information.
Upstream and Downstream Dependencies
- Obtains chip information from Ascend Device Plugin on each compute node.
- Obtains CPU, memory, and disk health status information, DPC shared storage fault information, and Lingqu network fault information from NodeD on each compute node.
- Retrieves public fault information from ConfigMap or gRPC.
- Aggregates resource information across the entire cluster and reports it to Ascend-volcano-plugin.
- Monitors cluster task information and reports task status, resource usage, and other information to CCAE.
- Interacts with in-container processes to control training processes for recomputation.
Supported Tags and Dockerfile Links
Tag Convention
Tags follow this format:
<version>-<os>
| Field | Example | Description |
|---|---|---|
version |
v26.1.0 |
ClusterD component version |
os |
ubuntu22.04 |
ClusterD image operating system |
ClusterD 26.1.0
| Tag | Dockerfile | Image Content |
|---|---|---|
v26.1.0-ubuntu22.04 |
Dockerfile.ubuntu | ClusterD v26.1.0 image for Ubuntu 22.04 |
v26.1.0-openeuler24.03 |
Dockerfile.openeuler | ClusterD v26.1.0 image for openEuler 24.03 |
Quick Start
Prerequisites
Software Dependencies
| Software | Supported Versions | Installation Location | Description |
|---|---|---|---|
| Kubernetes | 1.17.x~1.34.x (1.19.x or later recommended) | All nodes | See Kubernetes Documentation |
| Ascend Device Plugin | Same version as ClusterD | Compute nodes | ClusterD depends on Ascend Device Plugin to report chip information |
| NodeD | Same version as ClusterD | Compute nodes | ClusterD depends on NodeD to report node fault information |
Hardware Requirements
| Resource | Up to 100 Nodes | 500 Nodes | 1000 Nodes |
|---|---|---|---|
| CPU | 1 core | 2 cores | 4 cores |
| Memory | 1 GB | 2 GB | 8 GB |
How to Build Locally
docker build --no-cache -t ascend-k8sclusterd:{tag} ./ -f Dockerfile.{os}
Note:
- TARGETPLATFORM is a global built-in parameter provided by Docker BuildKit, used to obtain the target platform of the current build, such as linux/amd64 and linux/arm64.
- This variable is automatically injected only when BuildKit is enabled. It will not be available in older Docker versions or environments with BuildKit disabled by default. Run export DOCKER_BUILDKIT=1 to enable it temporarily before executing build commands.
Deploy ClusterD
- Pull the image
docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:{tag}
- Retag the image
docker tag swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:{tag} ascend-k8sclusterd:{tag}
- Start ClusterD
Replace {tag} in the clusterd-{version}.yaml file with the actual image tag.
kubectl apply -f clusterd-{version}.yaml
- Verify deployment
kubectl get pods -A | grep clusterd
Supported Hardware
For descriptions of supported Ascend hardware models, please refer to the official documentation: Supported Product Formats and OS List
License
View the license information for the Mind series software contained in these images.
As with all container images, pre-installed software packages (Python, system libraries, etc.) may be subject to their respective license agreements.