昇腾大模型加速库

25b60baa创建于 2025年4月15日602次提交

文件	最后提交记录	最后更新时间
ci	initial	2 年前
docs	!2179 update readme Merge pull request !2179 from wangyuansheng8/1.0.0_core_r0.6.0	1 年前
mindspeed	!2142 [BUGFIX!] 增加legacy分支下选择性重计算相关说明及warning提示 Merge pull request !2142 from yangjie/1.0.0_core_r0.6.0	1 年前
sources	!1413 tp-2d适配num_query_group(1.0.core.0.6) Merge pull request !1413 from liujianxing/cherry-pick-1732105505	1 年前
tests_extend	!1986 fix oom script Merge pull request !1986 from wangyuansheng8/1.0.0_core_r0.6.0	1 年前
.gitignore	!352 [Fix] fix order of getting batch of ulysses Merge pull request !352 from 郭鹏/master	1 年前
LICENSE	!1140 【安全】冗余代码删除，Licnse添加 Merge pull request !1140 from jiangzhihan1/core_r0.6.0	1 年前
OWNERS	!1889 owners 定期清理人员 Merge pull request !1889 from yangcheng/1.0.0_core_r0.6.0	1 年前
README.md	!1900 安全矩阵更新，并对PP多参数及动态shape隔离 Merge pull request !1900 from yangjie/1.0.0_core_r0.6.0	1 年前
SECURITYNOTE.md	!1900 安全矩阵更新，并对PP多参数及动态shape隔离 Merge pull request !1900 from yangjie/1.0.0_core_r0.6.0	1 年前
Third_Party_Open_Source_Software_Notice	!2108 update OPEN SOURCE SOFTWARE NOTICE Merge pull request !2108 from yangcheng/1.0.0_core_r0.6.0	1 年前
requirements.txt	!1524 [1.0.0 060]readme和拦截修正 Merge pull request !1524 from 赵一帆/zyfrc4060	1 年前
setup.py	!885 remove ci requirements installation and limit transformers version Merge pull request !885 from Jializheng/core_r0.6.0	1 年前

简介

MindSpeed 是针对华为昇腾设备的大模型加速库。

大模型训练是一种非常复杂的过程，涉及到许多技术和挑战，其中大模型训练需要大量的显存资源是一个难题，对计算卡提出了不小的挑战。为了在单个计算卡显存资源不足时，可以通过多张计算卡进行计算，业界出现了类似 Megatron、DeepSpeed 等第三方大模型加速库，对模型、输入数据等进行切分并分配到不同的计算卡上，最后再通过集合通信对结果进行汇总。

昇腾提供 MindSpeed 加速库，使能客户大模型业务快速迁移至昇腾设备，并且支持昇腾专有算法，确保开箱可用。

安装

1. 安装依赖

在安装MindSpeed之前，请参考版本配套表，安装最新昇腾软件栈。

依赖软件	软件安装指南
昇腾NPU驱动	《驱动固件安装指南》
昇腾NPU固件	《驱动固件安装指南》
Toolkit（开发套件）	《CANN 软件安装指南》
Kernel（算子包）
NNAL（Ascend Transformer Boost加速库）
PyTorch	《Ascend Extension for PyTorch 配置与安装》
torch_npu插件
apex

2. 安装 MindSpeed

下载源码安装：

git clone -b 1.0.0_core_r0.6.0 https://gitee.com/ascend/MindSpeed.git
pip install -e MindSpeed

如需使用Ascend Transformer Boost（ATB）加速库算子，请先安装CANN-NNAL并初始化添加环境，例如：

#CANN-NNAL默认安装路径为：/usr/local/Ascend/nnal
#运行CANN-NNAL默认安装路径下atb文件夹中的环境配置脚本set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

3. 获取 Megatron-LM 并指定分支

# 目前版本基于core_r0.6.0的release版本
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0

快速上手

以 GPT 模型为例：

在 Megatron-LM 目录下修改pretrain_gpt.py文件，在import torch下新增一行import mindspeed.megatron_adaptor

 import os
 import torch
+import mindspeed.megatron_adaptor
 from functools import partial
 from typing import Union

在 Megatron-LM 目录下修改pretrain_gpt.py文件，在model_provider函数中删除assert(args.context_parallel_size == 1), "Context parallelism is only supported with Megatron Core!"。

else:
-   assert(args.context_parallel_size == 1), "Context parallelism is only supported with Megatron Core!"

    model = megatron.model.GPTModel(
        config,
        num_tokentypes=0,
        parallel_output=True,
        pre_process=pre_process,
        post_process=post_process
    )

在 Megatron-LM 目录下，准备好训练数据，并在示例脚本中填写对应路径，然后执行。
```
bash examples/pretrain_gpt_distributed.sh
```

自定义优化级别

MindSpeed提供了多层次的优化解决方案，分为三个层级，用户可根据实际需求灵活启用任意层级。高层级兼容低层级的能力，确保了整个系统的稳定性和扩展性。用户可以通过设置--optimization-level {优化层级}参数来自定义开启的优化层级。该参数支持以下值：

0：基础兼容层L0，提供Megatron-LM框架对NPU的支持，确保无缝集成。该层包含基础功能集patch，保证可靠性和稳定性，为高级优化奠定基础。
1：亲和性增强层L1（兼容L0能力），集成高性能融合算子库，结合昇腾亲和的计算优化，充分释放昇腾算力，显著提升计算效率。
2（默认值）：自研加速算法层L2（兼容L1,L0能力），集成了多项自主研发的核心技术成果，提供全面的性能优化。

特性介绍

MindSpeed特性由六大模块组成，分别为：megetron特性支持、并行策略特性、内存优化特性、亲和计算特性、通信优化特性以及关键场景特性。其中【Prototype】表示原型特性，暂未商用发布。

当前大模型训练主要使用bf16数据类型，以下特性若无特殊声明原则上兼容fp16，如遇到fp16场景下相关问题请联系MindSpeed团队或提交issue,我们会快速响应。

Megatron特性支持

特性	介绍
Megatron 数据并行	link
Megatron 张量并行	link
Megatron 流水并行	link
Megatron 虚拟流水并行	link
Megatron 分布式优化器	link
Megatron 序列并行	link
Megatron 异步DDP	link
Megatron 权重更新通信隐藏	link
Megatron 重计算	link

并行策略特性

特性	介绍
Ascend Ulysses 长序列并行	link
Ascend Ring Attention 长序列并行	link
Ascend 混合长序列并行	link
Ascend 自定义空操作层	link
【Prototype】Ascend PP自动并行	link

内存优化特性

特性	介绍
Ascend 自适应选择重计算	link
Ascend 激活函数重计算	link
Ascend 重计算流水线独立调度	link
Ascend Mask归一	link
Ascend BF16 参数副本复用	link
Ascend swap_attention	link
Ascend Norm重计算	link

亲和计算特性

特性	介绍
Ascend rms_norm 融合算子	link
Ascend swiglu 融合算子	link
Ascend rotary_embedding 融合算子	link
Ascend flash attention 融合算子	link
Ascend Moe Token Permute and Unpermute 融合算子	link
Ascend npu_matmul_add_fp32梯度累加融合算子	link
【Prototype】Ascend 计算通信并行优化	link
【Prototype】Ascend MC2	link
【Prototype】Ascend fusion_attention_v2	link

通信优化特性

特性	介绍
Ascend nano-pipe流水线并行	link
Ascend Gloo 存档落盘优化	link
【Prototype】Ascend 高维张量并行	link

Mcore MoE特性

特性	介绍
Ascend Megatron MoE GMM	link
Ascend Megatron MoE Allgather Dispatcher 性能优化	link
Ascend Megatron MoE Alltoall Dispatcher 性能优化	link
Ascend Megatron MoE TP拓展EP	link
Ascend 共享专家	link
【Prototype】Ascend Megatron MoE 负载感知内存均衡算法	link

DeepSpeed MoE特性

特性	介绍
DeepSpeed MoE	link
Ascend DeepSpeed MoE token 重排性能优化	link
Ascend DeepSpeed MoE dropless 性能优化	link
Ascend MLP 通信隐藏	link

关键场景特性

特性	介绍
Ascend EOD Reset训练场景	link
【Prototype】Ascend alibi	link

其它特性

特性	介绍
Ascend TFLOPS计算	link
【Prototype】Ascend 确定性计算	link
高可用特性	link

自定义算子

部分自定义算子设置为公开接口，公开接口设置说明请参照MindSpeed安全声明中的公开接口声明，具体对外接口细节参照以下算子对应的手册链接。

算子	介绍
npu_dropout_add_layer_norm	link
npu_rotary_position_embedding	link
fusion_attention	link
rms_norm	link
swiglu	link
npu_mm_all_reduce_add_rms_norm	link
npu_mm_all_reduce_add_rms_norm_	link
npu_gmm	link
npu_grouped_mat_mul_all_reduce	link
【Prototype】lcal_coc	link
【Prototype】ffn	link
【Prototype】npu_fused_moe_token_permute	link
【Prototype】npu_fused_moe_token_unpermute	link
【Prototype】npu_ring_attention_update	link
【Prototype】npu_matmul_add_fp32	link
【Prototype】npu_groupmatmul_add_fp32	link
【Prototype】npu_all_to_all_all_gather_bmm	link
【Prototype】npu_bmm_reduce_scatter_all_to_all	link

MindSpeed中采集Profile数据

MindSpeed支持命令式开启Profile采集数据，命令配置介绍如下：

配置命令	命令含义
--profile	打开profile开关
--profile-step-start	配置开始采集步, 未配置时默认为10, 配置举例: --profile-step-start 30
--profile-step-end	配置结束采集步, 未配置时默认为12, 配置举例: --profile-step-end 35
--profile-level	配置采集等级, 未配置时默认为level0, 可选配置: level0, level1, level2, 配置举例: --profile-level level1
--profile-with-cpu	打开cpu信息采集开关
--profile-with-stack	打开stack信息采集开关
--profile-with-memory	打开memory信息采集开关, 配置本开关时需打开--profile-with-cpu
--profile-record-shapes	打开shapes信息采集开关
--profile-save-path	配置采集信息保存路径, 未配置时默认为./profile_dir, 配置举例: --profile-save-path ./result_dir
--profile-ranks	配置待采集的ranks，未配置时默认为-1，表示采集所有rank的profiling数据，配置举例: --profile-ranks 0 1 2 3, 需注意: 该配置值为每个rank在单机/集群中的全局值

版本配套表

PyTorch Extension版本号采用{PyTorch版本}-{昇腾版本}命名规则，前者为PyTorch Extension匹配的PyTorch版本，后者用于匹配CANN版本，详细匹配如下：

MindSpeed版本	Megatron版本	PyTorch版本	torch_npu版本	CANN版本	Python版本	硬件型态
1.0.0_core_r0.7.0（商用）	Core 0.7.0	2.1.0	6.0.0	8.0.0	Python3.8.x, Python3.9.x, Python3.10.x	Atlas 200T A2 Box16, Atlas 800T A2, Atlas 900 A2 PODc
1.0.0_core_r0.6.0（商用）	Core 0.6.0	2.1.0	6.0.0	8.0.0	Python3.8.x, Python3.9.x, Python3.10.x	Atlas 200T A2 Box16, Atlas 800T A2, Atlas 900 A2 PODc
1.0.RC3_core_r0.7.0（商用）	Core 0.7.0	2.1.0	6.0.RC3	8.0.RC3	Python3.8.x, Python3.9.x, Python3.10.x	Atlas 200T A2 Box16, Atlas 800T A2, Atlas 900 A2 PODc
1.0.RC3_core_r0.6.0（商用）	Core 0.6.0	2.1.0	6.0.RC3	8.0.RC3	Python3.8.x, Python3.9.x, Python3.10.x	Atlas 200T A2 Box16, Atlas 800T A2, Atlas 900 A2 PODc
1.0.RC2（商用）	Core 0.6.0	2.1.0	6.0.RC2	8.0.RC2	Python3.8.x, Python3.9.x, Python3.10.x	Atlas 200T A2 Box16, Atlas 800T A2, Atlas 900 A2 PODc
1.0.RC1（商用）	commitid bcce6f	2.1.0	6.0.RC1	8.0.RC1	Python3.8.x, Python3.9.x, Python3.10.x	Atlas 200T A2 Box16, Atlas 800T A2, Atlas 900 A2 PODc

昇腾辅助软件中有更多关于PyTorch和CANN的版本信息。

分支维护策略

MindSpeed版本分支的维护阶段如下：

状态	时间	说明
计划	1—3 个月	计划特性
开发	3 个月	开发特性
维护	6-12 个月	合入所有已解决的问题并发布版本，针对不同的MindSpeed版本采取不同的维护策略，常规版本和长期支持版本维护周期分别为6个月和12个月
无维护	0—3 个月	合入所有已解决的问题，无专职维护人员，无版本发布
生命周期终止（EOL）	N/A	分支不再接受任何修改

MindSpeed版本维护策略

MindSpeed版本	维护策略	当前状态	发布时间	后续状态
1.0.0_core_r0.7.0	常规版本	开发	2024/12/30	预计2025/6/30起无维护
1.0.0_core_r0.6.0	常规版本	开发	2024/12/30	预计2025/6/30起无维护
1.0.RC3_core_r0.7.0	常规版本	维护	2024/09/30	预计2025/3/30起无维护
1.0.RC3_core_r0.6.0	常规版本	维护	2024/09/30	预计2025/3/30起无维护
1.0.RC2	常规版本	维护	2024/06/30	预计2024/12/30起无维护
1.0.RC1	常规版本	停止维护	2024/03/30	2024/9/30起无维护