文件最后提交记录最后更新时间
feat(pytorch): add muon optimizer feature Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4378 merge muon into master feat(pytorch): add muon optimizer feature Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Add TensorParallel Muon optimizer support. Muon replaces standard SGD momentum with Newton-Schulz orthogonalized updates, offering faster convergence than AdamW on matrix parameters. ## Does this PR introduce any user-facing change? Yes. Users can enable Muon via --optimizer muon and configure it with --muon-momentum-beta, --muon-num-ns-steps, --muon-scale-mode etc. Non-matrix parameters (embeddings, norms, biases) automatically fall back to AdamW via ChainedOptimizer. Note: MoE expert weights must not use GMM fusion kernels when Muon is enabled, as fused parameters require per-expert NS orthogonalization which is not yet supported. ## How was this patch tested? Verified loss alignment between GPU (Megatron-Core 0.16.0) and NPU (MindSpeed-LLM master branch) on Qwen3 models. See merge request: Ascend/MindSpeed-LLM!43781 个月前
feat(pytorch): Add MindSpeed Muon feature Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4549 merge master into master feat(pytorch): Add MindSpeed Muon feature Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Replaces the in-repo self-maintained Muon optimizer with MindSpeed's native Muon implementation, removing the legacy code and adapting the patch registration accordingly. ## Does this PR introduce any user-facing change? No change to the Muon usage interface; existing Muon training scripts and arguments continue to work. The underlying implementation is switched to MindSpeed's native version. ## How was this patch tested? Precision has been verified: training with the native Muon optimizer was aligned against the previous self-maintained implementation, with consistent loss and grad-norm behavior. See merge request: Ascend/MindSpeed-LLM!454920 小时前