文件最后提交记录最后更新时间
hamilton attention implementation in te Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com> # message auto-generated for no-merge-commit merge: !3430 merge add-HA-implement-on-new-master into master hamilton attention implementation in te Created-by: Xiaoda_zhang Commit-by: Xiaoda Zhang Merged-by: ascend-robot Description: What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. 本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) (参考https://github.com/infinigence/HamiltonAttention),包括正反向实现。 HA的优势:相比于ring attention只利用到机内的单条链路,HA能够将机内的full mesh网络全部利用起来,有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。 本PR实现了SBH和TND两种格式的CP,并且通过UT已经验过了正确性。性能上,在WAN和Qwen3-vl模型上验证了性能提升情况: WAN2.2, seq_len=18K | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 3.6ms | 1.2ms | | 整个core attention时间(正向) |33.5ms | 16ms | | 整个core attention时间(反向) |45.9ms | 28.9ms | |一次迭代E2E时间 | 8.3s | 6.0s | WAN2.2, seq_len=37K | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 8.6ms | 3.8ms | | 整个core attention时间(正向) |65.6ms | 45.3ms | | 整个core attention时间(反向) |106.7ms | 90.8ms | |一次迭代E2E时间 | 15.5s | 12.9s | Qwen3-vl, TND格式,每张图片seq_len=1024,62张图片,总seq_len=62K,CP切分后seq_len=7936 | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 2.1ms | 1ms | | 整个core attention时间(正向) | 104ms | 104ms | | 整个core attention时间(反向) |34ms | 14ms | Qwen3-vl, TND格式,每张图片seq_len=4096,62张图片,总seq_len=248K,CP切分后seq_len=31744 | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 |8.9ms | 2.6ms | | 整个core attention时间(正向) | 112ms | 112ms | | 整个core attention时间(反向) |135ms | 86ms | Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. 使能HA,需要用户配置enable_ha参数,以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的,以及在TND格式下重组各个seq所需的permute_index。 How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. 已通过UT测试了正确性。 See merge request: Ascend/MindSpeed!343024 天前
fix(quant): only hif8 add dst_type_max args Co-authored-by: Muu<koimuu@163.com> # message auto-generated for no-merge-commit merge: !3514 merge fix-hif8-tensorwise into master fix(quant): only hif8 add dst_type_max args Created-by: Muuyo Commit-by: Muu Merged-by: ascend-robot Description: fix(quant): only hif8 add dst_type_max args See merge request: Ascend/MindSpeed!35143 天前
feat: mxfp8-32x32 quant Co-authored-by: kyle_zhangchi<zhangchi158@huawei.com> # message auto-generated for no-merge-commit merge: !3471 merge feat_mxfp8-32x32 into master feat: mxfp8-32x32 quant Created-by: kyle_zhangchi Commit-by: kyle_zhangchi Merged-by: ascend-robot Description: ## What this PR does / why we need it? 在Megatron框架下新增mxfp8-32x32量化算子,降低权重显存占用 ## Does this PR introduce *any* user-facing change? --fp8-recipe新增mxfp8-32x32选项 https://gitcode.com/Ascend/MindSpeed/commit/e065cbca6873bfc02661d088b07d90224333e87d?ref=feat_mxfp8-32x32&prId=3471 ## How was this patch tested? 验证文档 https://wiki.huawei.com/domains/170864/wiki/367830/WIKI2026051111046509 See merge request: Ascend/MindSpeed!34717 天前
!2791 [feat!!!]te support v2 Merge pull request !2791 from yangjie/master 8 个月前
feat: mxfp8-32x32 quant Co-authored-by: kyle_zhangchi<zhangchi158@huawei.com> # message auto-generated for no-merge-commit merge: !3471 merge feat_mxfp8-32x32 into master feat: mxfp8-32x32 quant Created-by: kyle_zhangchi Commit-by: kyle_zhangchi Merged-by: ascend-robot Description: ## What this PR does / why we need it? 在Megatron框架下新增mxfp8-32x32量化算子,降低权重显存占用 ## Does this PR introduce *any* user-facing change? --fp8-recipe新增mxfp8-32x32选项 https://gitcode.com/Ascend/MindSpeed/commit/e065cbca6873bfc02661d088b07d90224333e87d?ref=feat_mxfp8-32x32&prId=3471 ## How was this patch tested? 验证文档 https://wiki.huawei.com/domains/170864/wiki/367830/WIKI2026051111046509 See merge request: Ascend/MindSpeed!34717 天前
docs:fix docs/zh mistakes Co-authored-by: Keilo_W<wangkaiyu11@h-partners.com> # message auto-generated for no-merge-commit merge: !3318 merge master into master docs:fix docs/zh mistakes Created-by: Keilo_W Commit-by: Keilo_W Merged-by: ascend-robot Description: 修改了一些被误操作的注释及代码 See merge request: Ascend/MindSpeed!33182 个月前
feature(fp8): te checkpoint Co-authored-by: Muu<koimuu@163.com> # message auto-generated for no-merge-commit merge: !3162 merge feature_checkpoint into master feature(fp8): te checkpoint Created-by: Muuyo Commit-by: Muu Merged-by: ascend-robot Description: 1. 引入 te checkpoint消除重计算中冗余的量化操作 2. refactor(blockwise): 删除128*128的blockwise策略, 保留1 * 128|128 * 128策略替换 3. perf(hif8): 删除多余的cast 4. fix(delayed): 修复delayed算法 5. refactor(recipe 2x): 重构blockwise和mxfp8策略数据存取, 简化后续算子适配 6. 消除字符串字面量, 采用枚举替代 验证报告: https://wiki.huawei.com/domains/76578/wiki/233229/WIKI202601139775970 See merge request: Ascend/MindSpeed!31624 个月前