ascend-robotfeat(pytorch): Add MindSpeed Muon feature

文件	最后提交记录	最后更新时间
fsdp2	feat(pytorch): add gdn ascend C kernel of qwen3_next Co-authored-by: cxy-thinkbook<xuanyuchen@seu.edu.cn> # message auto-generated for no-merge-commit merge: !4335 merge master into master feat(pytorch): add gdn ascend C kernel of qwen3_next Created-by: 2402_84360594 Commit-by: cxy-thinkbook Merged-by: ascend-robot Description: 将ascend C编写的qwen3_next的gdn算子作为特性加入项目 ## What this PR does / why we need it? add gdn ascend C kernel of qwen3_next ## Does this PR introduce any user-facing change? The parameter use_flash_gdn can be enabled in docs/zh/pytorch/features/fsdp2/arguments.md ### the way to use ascend c kernel After testing, the following versions work properly. Theoretically, all CANN packages version 8.5.0 and above are fully compatible. FrameworkPTAdapter FrameworkPTAdapter 26.1.0.B030 CANN （A2/A3） CANN 9.1.0.B020 You can install and test the ASC kernels by following these steps.Please source the CANN package in advance and ensure network connectivity. The compilation of the flash-linear-attention-npu repository runs locally. Run packages from different machines are generally not interchangeable.： ```bash git clone https://github.com/flashserve/flash-linear-attention-npu.git # It is recommended to use the version tagged v26.1.0 git checkout v26.1.0 cd flash-linear-attention-npu-main # Use the --soc parameter to accurately specify the current device chip model. Example configuration: --soc=ascend910_93. bash build.sh --soc=ascend910_93 --pkg --ops=chunk_bwd_dv_local,chunk_bwd_dqkwg,chunk_gated_delta_rule_bwd_dhu,prepare_wy_repr_bwd_da,prepare_wy_repr_bwd_full,chunk_fwd_o,chunk_gated_delta_rule_fwd_h,recurrent_gated_delta_rule,recompute_wu_fwd,causal_conv1d ./build_out/cann-ops-transformer-custom_linux-aarch64.run # Fix for installation hanging: clear conflicting CANN vendor files # rm -rf cann-8.5.0/opp/vendors/ # Reinstall: ./build_out/cann-ops-transformer-custom_linux-aarch64.run cd torch_custom/fla_npu bash gen.sh npu_custom.yaml # The gen.sh script will generate the following contents, which can be verified using the ls command: # op_plugin/config/v2r7/: Configuration files # torch_npu/csrc/aten/: ATen layer adaptation code # torch_npu/utils/*: Utility functions python setup.py bdist_wheel pip install dist/fla_npu.whl --force-reinstall --no-deps # then, you could test kernels cd torch_custom/fla_npu/test bash test.sh # Some libraries are used only for testing and are not required by the model itself. They can be installed on demand. ``` ## How was this patch tested? The final section of The way to use Ascend C kernel covers the testing process. See merge request: Ascend/MindSpeed-LLM!4335	12 天前
mcore	feat(pytorch): Add MindSpeed Muon feature Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4549 merge master into master feat(pytorch): Add MindSpeed Muon feature Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Replaces the in-repo self-maintained Muon optimizer with MindSpeed's native Muon implementation, removing the legacy code and adapting the patch registration accordingly. ## Does this PR introduce any user-facing change? No change to the Muon usage interface; existing Muon training scripts and arguments continue to work. The underlying implementation is switched to MindSpeed's native version. ## How was this patch tested? Precision has been verified: training with the native Muon optimizer was aligned against the previous self-maintained implementation, with consistent loss and grad-norm behavior. See merge request: Ascend/MindSpeed-LLM!4549	15 小时前
README.md	feat(pytorch): Add MindSpeed Muon feature Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4549 merge master into master feat(pytorch): Add MindSpeed Muon feature Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Replaces the in-repo self-maintained Muon optimizer with MindSpeed's native Muon implementation, removing the legacy code and adapting the patch registration accordingly. ## Does this PR introduce any user-facing change? No change to the Muon usage interface; existing Muon training scripts and arguments continue to work. The underlying implementation is switched to MindSpeed's native version. ## How was this patch tested? Precision has been verified: training with the native Muon optimizer was aligned against the previous self-maintained implementation, with consistent loss and grad-norm behavior. See merge request: Ascend/MindSpeed-LLM!4549	15 小时前

训练方案与特性说明

MindSpeed LLM包含分布式预训练、分布式微调等训练方案。

分布式预训练

基于MindSpeed LLM的实测预训练性能如下：

模型系列	实验模型	硬件信息	集群规模	MFU
LLAMA2	LLAMA2-7B	Atlas 900 A2 PODc	1x8	69.0%
	LLAMA2-13B	Atlas 900 A2 PODc	1x8	64.7%
	LLAMA2-70B	Atlas 900 A2 PODc	4x8	44.1%
Mixtral	Mixtral-8x7B	Atlas 900 A2 PODc	8x8	31.7%

预训练方案

方案类别	Mcore	Released	贡献方
多样本集预训练	✅	✅	【Ascend】
多样本pack模式预训练	✅	❌	【Ascend】

加速特性

场景	特性名称	Mcore	Released	贡献方
SPTD并行	张量并行	✅	✅	【Ascend】
	流水线并行	✅	✅
	虚拟流水并行	✅	✅
	序列并行	✅	✅
	noop layers	✅	✅
长序列并行	Ascend Ring Attention 长序列并行	✅	✅
	Ulysses 长序列并行	✅	✅
	混合长序列并行	✅	✅
MOE	MOE 专家并行	✅	✅
MOE	MOE 重排通信优化	✅	✅
显存优化	参数副本复用	✅	✅
	分布式优化器	✅	✅
	Swap Attention	✅	✅
	重计算	✅	✅
	Norm重计算	✅	✅
	O2 BF16 Optimizer	✅	❌
融合算子	Flash attention	✅	✅
	Flash attention variable length	✅	✅
	Fused rmsnorm	✅	✅
	Fused swiglu	✅	✅
	Fused rotary position embedding	✅	✅
	GMM	✅	✅
	Matmul Add	✅	✅
通信优化	梯度reduce通算掩盖	✅	✅
	Recompute in advance	✅	✅
	权重all-gather通算掩盖	✅	✅
	MC2	✅	❌
	CoC	✅	❌
	Ascend Gloo 存档落盘优化	✅	✅
优化器	Muon优化器	✅	❌

分布式微调

基于MindSpeed LLM的实测指令微调性能如下：

模型	硬件	集群	方案	序列	性能	MFU
Llama2-7B	Atlas 900 A2 PODc	1x8	全参	dynamic	15.87 samples/s	-
			全参	16K	1.14 samples/s	37.4%
			全参	32K	0.51 samples/s	48.4%
Llama2-13B	Atlas 900 A2 PODc	1x8	全参	dynamic	50.4 samples/s	-
Llama2-70B	Atlas 900 A2 PODc	1x8	LoRA	dynamic	15.2 samples/s	-

微调方案

方案名称	Mcore	LoRA	QLoRA	Released	贡献方
单样本微调	✅	✅	✅	✅	【Ascend】
多样本pack微调	✅	✅	❌	❌	【NAIE】
多轮对话微调	✅	✅	❌	❌	【Ascend】

加速特性

场景	特性	Mcore	Released	贡献方
LoRA微调	CCLoRA	✅	✅	【Ascend】
QLoRA微调	CCLoRA	❌	❌	【NAIE】
长序列微调	长序列CP	✅	❌	【Ascend】