MindSpeed/mindspeed/te/pytorch · Ascend/MindSpeed - AtomGit

ascend-robotfix(quant): only hif8 add dst_type_max args

文件	最后提交记录	最后更新时间
attention	hamilton attention implementation in te Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com> # message auto-generated for no-merge-commit merge: !3430 merge add-HA-implement-on-new-master into master hamilton attention implementation in te Created-by: Xiaoda_zhang Commit-by: Xiaoda Zhang Merged-by: ascend-robot Description: What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. 本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) （参考https://github.com/infinigence/HamiltonAttention），包括正反向实现。 HA的优势：相比于ring attention只利用到机内的单条链路，HA能够将机内的full mesh网络全部利用起来，有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。本PR实现了SBH和TND两种格式的CP，并且通过UT已经验过了正确性。性能上，在WAN和Qwen3-vl模型上验证了性能提升情况： WAN2.2, seq_len=18K \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \| 3.6ms \| 1.2ms \| \| 整个core attention时间(正向) \|33.5ms \| 16ms \| \| 整个core attention时间(反向) \|45.9ms \| 28.9ms \| \|一次迭代E2E时间 \| 8.3s \| 6.0s \| WAN2.2, seq_len=37K \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \| 8.6ms \| 3.8ms \| \| 整个core attention时间(正向) \|65.6ms \| 45.3ms \| \| 整个core attention时间(反向) \|106.7ms \| 90.8ms \| \|一次迭代E2E时间 \| 15.5s \| 12.9s \| Qwen3-vl, TND格式，每张图片seq_len=1024，62张图片，总seq_len=62K，CP切分后seq_len=7936 \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \| 2.1ms \| 1ms \| \| 整个core attention时间(正向) \| 104ms \| 104ms \| \| 整个core attention时间(反向) \|34ms \| 14ms \| Qwen3-vl, TND格式，每张图片seq_len=4096，62张图片，总seq_len=248K，CP切分后seq_len=31744 \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \|8.9ms \| 2.6ms \| \| 整个core attention时间(正向) \| 112ms \| 112ms \| \| 整个core attention时间(反向) \|135ms \| 86ms \| Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. 使能HA，需要用户配置enable_ha参数，以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的，以及在TND格式下重组各个seq所需的permute_index。 How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. 已通过UT测试了正确性。 See merge request: Ascend/MindSpeed!3430	24 天前
fp8	fix(quant): only hif8 add dst_type_max args Co-authored-by: Muu<koimuu@163.com> # message auto-generated for no-merge-commit merge: !3514 merge fix-hif8-tensorwise into master fix(quant): only hif8 add dst_type_max args Created-by: Muuyo Commit-by: Muu Merged-by: ascend-robot Description: fix(quant): only hif8 add dst_type_max args See merge request: Ascend/MindSpeed!3514	3 天前
module	feat: mxfp8-32x32 quant Co-authored-by: kyle_zhangchi<zhangchi158@huawei.com> # message auto-generated for no-merge-commit merge: !3471 merge feat_mxfp8-32x32 into master feat: mxfp8-32x32 quant Created-by: kyle_zhangchi Commit-by: kyle_zhangchi Merged-by: ascend-robot Description: ## What this PR does / why we need it? 在Megatron框架下新增mxfp8-32x32量化算子，降低权重显存占用 ## Does this PR introduce any user-facing change? --fp8-recipe新增mxfp8-32x32选项 https://gitcode.com/Ascend/MindSpeed/commit/e065cbca6873bfc02661d088b07d90224333e87d?ref=feat_mxfp8-32x32&prId=3471 ## How was this patch tested? 验证文档 https://wiki.huawei.com/domains/170864/wiki/367830/WIKI2026051111046509 See merge request: Ascend/MindSpeed!3471	7 天前
__init__.py	!2791 [feat!!!]te support v2 Merge pull request !2791 from yangjie/master	8 个月前
module_typing.py	feat: mxfp8-32x32 quant Co-authored-by: kyle_zhangchi<zhangchi158@huawei.com> # message auto-generated for no-merge-commit merge: !3471 merge feat_mxfp8-32x32 into master feat: mxfp8-32x32 quant Created-by: kyle_zhangchi Commit-by: kyle_zhangchi Merged-by: ascend-robot Description: ## What this PR does / why we need it? 在Megatron框架下新增mxfp8-32x32量化算子，降低权重显存占用 ## Does this PR introduce any user-facing change? --fp8-recipe新增mxfp8-32x32选项 https://gitcode.com/Ascend/MindSpeed/commit/e065cbca6873bfc02661d088b07d90224333e87d?ref=feat_mxfp8-32x32&prId=3471 ## How was this patch tested? 验证文档 https://wiki.huawei.com/domains/170864/wiki/367830/WIKI2026051111046509 See merge request: Ascend/MindSpeed!3471	7 天前
permutation.py	docs:fix docs/zh mistakes Co-authored-by: Keilo_W<wangkaiyu11@h-partners.com> # message auto-generated for no-merge-commit merge: !3318 merge master into master docs:fix docs/zh mistakes Created-by: Keilo_W Commit-by: Keilo_W Merged-by: ascend-robot Description: 修改了一些被误操作的注释及代码 See merge request: Ascend/MindSpeed!3318	2 个月前
utils.py	feature(fp8): te checkpoint Co-authored-by: Muu<koimuu@163.com> # message auto-generated for no-merge-commit merge: !3162 merge feature_checkpoint into master feature(fp8): te checkpoint Created-by: Muuyo Commit-by: Muu Merged-by: ascend-robot Description: 1. 引入 te checkpoint消除重计算中冗余的量化操作 2. refactor(blockwise): 删除128128的blockwise策略, 保留1 128\|128 * 128策略替换 3. perf(hif8): 删除多余的cast 4. fix(delayed): 修复delayed算法 5. refactor(recipe 2x): 重构blockwise和mxfp8策略数据存取, 简化后续算子适配 6. 消除字符串字面量, 采用枚举替代验证报告: https://wiki.huawei.com/domains/76578/wiki/233229/WIKI202601139775970 See merge request: Ascend/MindSpeed!3162	4 个月前