ascend-robotfix(qos): support A5 checks and QoS injection

文件	最后提交记录	最后更新时间
example	add_comment Co-authored-by: Zhaiwenxuan<zhaiwenxuan4@h-partners.com> # message auto-generated for no-merge-commit merge: !3354 merge master into master add_comment Created-by: Zhaiwenxuan Commit-by: Zhaiwenxuan Merged-by: ascend-robot Description: tests_extend\example\train_distributed_ms.sh和tests_extend\example\train_distributed.sh增加注释 See merge request: Ascend/MindSpeed!3354	4 个月前
mindspore	[mindspore] [ut][master]add uts for mindspore Co-authored-by: kongdeshuo<1670690897@qq.com> # message auto-generated for no-merge-commit merge: !2891 merge master into master [mindspore] [ut][master]add uts for mindspore Created-by: kongdeshuo Commit-by: kongdeshuo Merged-by: ascend-robot Description: add uts for mindspore See merge request: Ascend/MindSpeed!2891	9 个月前
model_tests	!1981 fix oom script Merge pull request !1981 from wangyuansheng8/master	1 年前
system_tests	docs: fix Chinese documentation issues from AIDD review Co-authored-by: fanlu5<fanlu5@huawei.com> # message auto-generated for no-merge-commit merge: !3704 merge master into master docs: fix Chinese documentation issues from AIDD review Created-by: fanlu5 Commit-by: fanlu5 Merged-by: ascend-robot Description: ## What this PR does / why we need it? docs: fix Chinese documentation issues from AIDD review 本次修改根据AIDD文档评审反馈，修复了中文文档中的多处问题，包括： - 术语表述不准确（如"高保序性"、"低保真数据"等） - 公式错误（如分布式归一化中的方差计算公式） - 参数依赖关系描述不清（如 ulysses-degree-in-cp 的整除关系） - 特性描述与代码实现不一致（如 swap-optimizer 的 D2H/H2D 时序） - 性能提升概念混淆（如训练吞吐与收敛速度的区别） ## Does this PR introduce any user-facing change? NA ## How was this patch tested? NA ![image.png](https://raw.gitcode.com/user-images/assets/7404741/7ef3410f-56c1-4843-8657-01e8cdf2a218/image.png 'image.png') See merge request: Ascend/MindSpeed!3704	14 天前
unit_tests	fix(qos): support A5 checks and QoS injection Co-authored-by: 2500_94447092<1109332012@qq.com> # message auto-generated for no-merge-commit merge: !3675 merge master into master fix(qos): support A5 checks and QoS injection Created-by: 2500_94447092 Commit-by: unknown;2500_94447092 Merged-by: ascend-robot Description: What this PR does / why we need it? 本 PR 完成 QoS 功能在 A5 代际上的适配，并重构 QoS 配置注入流程，使 QoS 相关逻辑能够同时支持 A3 和 A5 代际 NPU，同时减少对 Megatron 原生并行初始化逻辑的重复实现。主要修改如下： 1. 更新 A5 代际 QoS 适配逻辑将原先基于 `is_a3_version` 的判断方式，调整为基于 `get_npu_version()` 和 `NPUVersion` 的统一代际判断。 QoS 相关逻辑支持同时识别 A3 和 A5 代际。新增/调整 QoS 相关单元测试，覆盖不同 NPU 代际下的判断逻辑。 2. 通过 `get_nccl_options` 包装器注入 QoS 配置删除重复的 `initialize_model_parallel` 实现，避免与 Megatron 原生初始化流程重复维护。建立通信组名称与 QoS 通信域之间的映射关系。在创建通信组时通过 `get_nccl_options` 包装器注入 QoS 配置。注入 QoS 时保留原有 HCCL 配置，避免覆盖已有通信参数。覆盖扩展并行通信组中直接调用 `new_group` 的场景，保证扩展通信组同样能够应用 QoS 配置。 3. 更新 QoS 相关单元测试调整 `test_domain_info.py` 中 A3/A5 代际判断相关测试。删除原有仅面向 A3 的测试类，将 `is_a3_version` mock 方式替换为对 `get_npu_version()` 的 mock。使用 `NPUVersion.A3`、`NPUVersion.A5`、`NPUVersion.A2` 等返回值验证不同代际下的 QoS 判断逻辑。同步更新 `test_qos.py` 中相关导入、变量名和方法调用，适配新的 QoS 注入流程。 Does this PR introduce any user-facing change? NA How was this patch tested? NA See merge request: Ascend/MindSpeed!3675	12 天前
README.md	test: skip slow ut in default CI runs Co-authored-by: 郭鹏<guopeng34@huawei.com> # message auto-generated for no-merge-commit merge: !3522 merge master into master test: skip slow ut in default CI runs Created-by: gp513 Commit-by: 郭鹏 Merged-by: ascend-robot Description: ## What this PR does / why we need it? 优化CI运行时长（1h11min->33min），将耗时长、优先级低的用例标记为slow，CI默认执行pytest tests_extend/unit_tests对这些用例skip。可执行pytest tests_extend/unit_tests --run-all（增加--run-all参数）执行全量用例 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed!3522	1 个月前
__init__.py	!2114 实现pipeline parallel的noop layer的重构 Merge pull request !2114 from liurong1995/feature_noop	1 年前
commons.py	!2664 Add verl document Merge pull request !2664 from Jializheng/master	1 年前
conftest.py	test: skip slow ut in default CI runs Co-authored-by: 郭鹏<guopeng34@huawei.com> # message auto-generated for no-merge-commit merge: !3522 merge master into master test: skip slow ut in default CI runs Created-by: gp513 Commit-by: 郭鹏 Merged-by: ascend-robot Description: ## What this PR does / why we need it? 优化CI运行时长（1h11min->33min），将耗时长、优先级低的用例标记为slow，CI默认执行pytest tests_extend/unit_tests对这些用例skip。可执行pytest tests_extend/unit_tests --run-all（增加--run-all参数）执行全量用例 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed!3522	1 个月前

Tests Usage

Install mindspeed
```
pip install -e .
```
Copy the entire tests_extend to the root path of Megatron-LM
```
cp -r tests_extend {PATH_TO_MEGATRON_LM}
```

Run a single test by pytest command line under Megatron-LM root path

cd {PATH_TO_MEGATRON_LM}
pytest tests_extend/unit_tests/megatron/test_distrib_optimizer.py

Run the default CI unit tests

cd {PATH_TO_MEGATRON_LM}
pytest tests_extend/unit_tests

Run the complete unit test suite, including slow tests

cd {PATH_TO_MEGATRON_LM}
pytest tests_extend/unit_tests --run-all

Marking Slow Unit Tests

The default CI gate skips tests marked with pytest.mark.slow. The complete suite runs them when --run-all is specified.

Mark a test as slow when it is intended for full regression rather than the default CI gate. Typical examples include large tensor shapes, long sequences, repeated parameter combinations, and expensive distributed or integration scenarios. Keep at least one representative smoke case in the default CI gate when possible.

Mark a whole test file as slow:

import pytest

pytestmark = pytest.mark.slow

Mark a single test as slow:

@pytest.mark.slow
def test_large_shape():
    ...

Mark only selected parameter combinations as slow:

@pytest.mark.parametrize(
    "seq_len",
    [
        1024,
        pytest.param(32768, marks=pytest.mark.slow),
    ],
)
def test_attention(seq_len):
    ...

All tests under tests_extend/unit_tests/ops/triton are marked as slow automatically. New Triton tests do not need to add the marker explicitly.