文件最后提交记录最后更新时间
fix: mc2_validate_args Co-authored-by: clc2025<chenlucong@huawei.com> # message auto-generated for no-merge-commit merge: !3394 merge fix_mc2_validate_args into master fix: mc2_validate_args Created-by: clc2025 Commit-by: clc2025 Merged-by: ascend-robot Description: What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. DTS2026040735953 Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed!33941 个月前
feature(fp8): te checkpoint Co-authored-by: Muu<koimuu@163.com> # message auto-generated for no-merge-commit merge: !3162 merge feature_checkpoint into master feature(fp8): te checkpoint Created-by: Muuyo Commit-by: Muu Merged-by: ascend-robot Description: 1. 引入 te checkpoint消除重计算中冗余的量化操作 2. refactor(blockwise): 删除128*128的blockwise策略, 保留1 * 128|128 * 128策略替换 3. perf(hif8): 删除多余的cast 4. fix(delayed): 修复delayed算法 5. refactor(recipe 2x): 重构blockwise和mxfp8策略数据存取, 简化后续算子适配 6. 消除字符串字面量, 采用枚举替代 验证报告: https://wiki.huawei.com/domains/76578/wiki/233229/WIKI202601139775970 See merge request: Ascend/MindSpeed!31624 个月前
feat: fp8 reuse quant w with te_gmm_mode compatible Co-authored-by: Jia_Austin<dengjia6@huawei.com> # message auto-generated for no-merge-commit merge: !3371 merge fp8_reuse_perf_v2 into master feat: fp8 reuse quant w with te_gmm_mode compatible Created-by: Jia_Austin Commit-by: Jia_Austin Merged-by: ascend-robot Description: What this PR does / why we need it? feat: fp8 reuse quant w with te_gmm_mode compatible; perf/fix: fp8 reuse quant w with te_gmm_mode perf Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed!33711 个月前
hamilton attention implementation in te Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com> # message auto-generated for no-merge-commit merge: !3430 merge add-HA-implement-on-new-master into master hamilton attention implementation in te Created-by: Xiaoda_zhang Commit-by: Xiaoda Zhang Merged-by: ascend-robot Description: What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. 本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) (参考https://github.com/infinigence/HamiltonAttention),包括正反向实现。 HA的优势:相比于ring attention只利用到机内的单条链路,HA能够将机内的full mesh网络全部利用起来,有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。 本PR实现了SBH和TND两种格式的CP,并且通过UT已经验过了正确性。性能上,在WAN和Qwen3-vl模型上验证了性能提升情况: WAN2.2, seq_len=18K | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 3.6ms | 1.2ms | | 整个core attention时间(正向) |33.5ms | 16ms | | 整个core attention时间(反向) |45.9ms | 28.9ms | |一次迭代E2E时间 | 8.3s | 6.0s | WAN2.2, seq_len=37K | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 8.6ms | 3.8ms | | 整个core attention时间(正向) |65.6ms | 45.3ms | | 整个core attention时间(反向) |106.7ms | 90.8ms | |一次迭代E2E时间 | 15.5s | 12.9s | Qwen3-vl, TND格式,每张图片seq_len=1024,62张图片,总seq_len=62K,CP切分后seq_len=7936 | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 2.1ms | 1ms | | 整个core attention时间(正向) | 104ms | 104ms | | 整个core attention时间(反向) |34ms | 14ms | Qwen3-vl, TND格式,每张图片seq_len=4096,62张图片,总seq_len=248K,CP切分后seq_len=31744 | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 |8.9ms | 2.6ms | | 整个core attention时间(正向) | 112ms | 112ms | | 整个core attention时间(反向) |135ms | 86ms | Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. 使能HA,需要用户配置enable_ha参数,以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的,以及在TND格式下重组各个seq所需的permute_index。 How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. 已通过UT测试了正确性。 See merge request: Ascend/MindSpeed!343024 天前