文件最后提交记录最后更新时间
hamilton attention implementation in te Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com> # message auto-generated for no-merge-commit merge: !3430 merge add-HA-implement-on-new-master into master hamilton attention implementation in te Created-by: Xiaoda_zhang Commit-by: Xiaoda Zhang Merged-by: ascend-robot Description: What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. 本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) (参考https://github.com/infinigence/HamiltonAttention),包括正反向实现。 HA的优势:相比于ring attention只利用到机内的单条链路,HA能够将机内的full mesh网络全部利用起来,有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。 本PR实现了SBH和TND两种格式的CP,并且通过UT已经验过了正确性。性能上,在WAN和Qwen3-vl模型上验证了性能提升情况: WAN2.2, seq_len=18K | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 3.6ms | 1.2ms | | 整个core attention时间(正向) |33.5ms | 16ms | | 整个core attention时间(反向) |45.9ms | 28.9ms | |一次迭代E2E时间 | 8.3s | 6.0s | WAN2.2, seq_len=37K | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 8.6ms | 3.8ms | | 整个core attention时间(正向) |65.6ms | 45.3ms | | 整个core attention时间(反向) |106.7ms | 90.8ms | |一次迭代E2E时间 | 15.5s | 12.9s | Qwen3-vl, TND格式,每张图片seq_len=1024,62张图片,总seq_len=62K,CP切分后seq_len=7936 | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 | 2.1ms | 1ms | | 整个core attention时间(正向) | 104ms | 104ms | | 整个core attention时间(反向) |34ms | 14ms | Qwen3-vl, TND格式,每张图片seq_len=4096,62张图片,总seq_len=248K,CP切分后seq_len=31744 | | ring attn (send/recv)| HA (4条ring) (alltoall)| |--|--|--| | 单个通信算子时间 |8.9ms | 2.6ms | | 整个core attention时间(正向) | 112ms | 112ms | | 整个core attention时间(反向) |135ms | 86ms | Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. 使能HA,需要用户配置enable_ha参数,以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的,以及在TND格式下重组各个seq所需的permute_index。 How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. 已通过UT测试了正确性。 See merge request: Ascend/MindSpeed!343024 天前
add TEDotProductAttention for master Co-authored-by: wuweiqiang24<wuweiqiang11@huawei.com> # message auto-generated for no-merge-commit merge: !3058 merge add_te_dpa_master into master add TEDotProductAttention for master Created-by: wuweiqiang24 Commit-by: wuweiqiang24 Merged-by: ascend-robot Description: 1. add TEDotProductAttention 2. add flash attention backend 精度已对齐local DotProductAttention实现 https://wiki.huawei.com/domains/63703/wiki/220407/WIKI202511289199170 See merge request: Ascend/MindSpeed!30585 个月前