文件最后提交记录最后更新时间
feat(pytorch): add gdn ascend C kernel of qwen3_next Co-authored-by: cxy-thinkbook<xuanyuchen@seu.edu.cn> # message auto-generated for no-merge-commit merge: !4335 merge master into master feat(pytorch): add gdn ascend C kernel of qwen3_next Created-by: 2402_84360594 Commit-by: cxy-thinkbook Merged-by: ascend-robot Description: 将ascend C编写的qwen3_next的gdn算子作为特性加入项目 ## What this PR does / why we need it? add gdn ascend C kernel of qwen3_next ## Does this PR introduce any user-facing change? The parameter use_flash_gdn can be enabled in docs/zh/pytorch/features/fsdp2/arguments.md ### the way to use ascend c kernel After testing, the following versions work properly. Theoretically, all CANN packages version 8.5.0 and above are fully compatible. FrameworkPTAdapter    FrameworkPTAdapter 26.1.0.B030 CANN (A2/A3)    CANN 9.1.0.B020 You can install and test the ASC kernels by following these steps.Please source the CANN package in advance and ensure network connectivity. The compilation of the flash-linear-attention-npu repository runs locally. Run packages from different machines are generally not interchangeable.: ```bash git clone https://github.com/flashserve/flash-linear-attention-npu.git # It is recommended to use the version tagged v26.1.0 git checkout v26.1.0 cd flash-linear-attention-npu-main # Use the --soc parameter to accurately specify the current device chip model. Example configuration: --soc=ascend910_93. bash build.sh --soc=ascend910_93 --pkg --ops=chunk_bwd_dv_local,chunk_bwd_dqkwg,chunk_gated_delta_rule_bwd_dhu,prepare_wy_repr_bwd_da,prepare_wy_repr_bwd_full,chunk_fwd_o,chunk_gated_delta_rule_fwd_h,recurrent_gated_delta_rule,recompute_wu_fwd,causal_conv1d ./build_out/cann-ops-transformer-custom_linux-aarch64.run # Fix for installation hanging: clear conflicting CANN vendor files # rm -rf cann-8.5.0/opp/vendors/ # Reinstall: ./build_out/cann-ops-transformer-custom_linux-aarch64.run cd torch_custom/fla_npu bash gen.sh npu_custom.yaml # The gen.sh script will generate the following contents, which can be verified using the ls command: # op_plugin/config/v2r7/**: Configuration files # torch_npu/csrc/aten/**: ATen layer adaptation code # torch_npu/utils/**: Utility functions python setup.py bdist_wheel pip install dist/fla_npu*.whl --force-reinstall --no-deps # then, you could test kernels cd torch_custom/fla_npu/test bash test.sh # Some libraries are used only for testing and are not required by the model itself. They can be installed on demand. ``` ## How was this patch tested? The final section of The way to use Ascend C kernel covers the testing process. See merge request: Ascend/MindSpeed-LLM!433512 天前
feat(pytorch): Add MindSpeed Muon feature Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4549 merge master into master feat(pytorch): Add MindSpeed Muon feature Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Replaces the in-repo self-maintained Muon optimizer with MindSpeed's native Muon implementation, removing the legacy code and adapting the patch registration accordingly. ## Does this PR introduce any user-facing change? No change to the Muon usage interface; existing Muon training scripts and arguments continue to work. The underlying implementation is switched to MindSpeed's native version. ## How was this patch tested? Precision has been verified: training with the native Muon optimizer was aligned against the previous self-maintained implementation, with consistent loss and grad-norm behavior. See merge request: Ascend/MindSpeed-LLM!454915 小时前
feat(pytorch): Add MindSpeed Muon feature Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4549 merge master into master feat(pytorch): Add MindSpeed Muon feature Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Replaces the in-repo self-maintained Muon optimizer with MindSpeed's native Muon implementation, removing the legacy code and adapting the patch registration accordingly. ## Does this PR introduce any user-facing change? No change to the Muon usage interface; existing Muon training scripts and arguments continue to work. The underlying implementation is switched to MindSpeed's native version. ## How was this patch tested? Precision has been verified: training with the native Muon optimizer was aligned against the previous self-maintained implementation, with consistent loss and grad-norm behavior. See merge request: Ascend/MindSpeed-LLM!454915 小时前
README.md

训练方案与特性说明


MindSpeed LLM包含分布式预训练、分布式微调等训练方案。

分布式预训练

基于MindSpeed LLM的实测预训练性能如下:

模型系列 实验模型 硬件信息 集群规模 MFU
LLAMA2 LLAMA2-7B Atlas 900 A2 PODc 1x8 69.0%
LLAMA2-13B Atlas 900 A2 PODc 1x8 64.7%
LLAMA2-70B Atlas 900 A2 PODc 4x8 44.1%
Mixtral Mixtral-8x7B Atlas 900 A2 PODc 8x8 31.7%

预训练方案

方案类别 Mcore Released 贡献方
多样本集预训练 【Ascend】
多样本pack模式预训练

加速特性

场景 特性名称 Mcore Released 贡献方
SPTD并行 张量并行 【Ascend】
流水线并行
虚拟流水并行
序列并行
noop layers
长序列并行 Ascend Ring Attention 长序列并行
Ulysses 长序列并行
混合长序列并行
MOE MOE 专家并行
MOE 重排通信优化
显存优化 参数副本复用
分布式优化器
Swap Attention
重计算
Norm重计算
O2 BF16 Optimizer
融合算子 Flash attention
Flash attention variable length
Fused rmsnorm
Fused swiglu
Fused rotary position embedding
GMM
Matmul Add
通信优化 梯度reduce通算掩盖
Recompute in advance
权重all-gather通算掩盖
MC2
CoC
Ascend Gloo 存档落盘优化
优化器 Muon优化器

分布式微调

基于MindSpeed LLM的实测指令微调性能如下:

模型 硬件 集群 方案 序列 性能 MFU
Llama2-7B Atlas 900 A2 PODc 1x8 全参 dynamic 15.87 samples/s -
全参 16K 1.14 samples/s 37.4%
全参 32K 0.51 samples/s 48.4%
Llama2-13B Atlas 900 A2 PODc 1x8 全参 dynamic 50.4 samples/s -
Llama2-70B Atlas 900 A2 PODc 1x8 LoRA dynamic 15.2 samples/s -

微调方案

方案名称 Mcore LoRA QLoRA Released 贡献方
单样本微调 【Ascend】
多样本pack微调 【NAIE】
多轮对话微调 【Ascend】

加速特性

场景 特性 Mcore Released 贡献方
LoRA微调 CCLoRA 【Ascend】
QLoRA微调 CCLoRA 【NAIE】
长序列微调 长序列CP 【Ascend】