MindSpeed/mindspeed/te/pytorch/attention · Ascend/MindSpeed - AtomGit

ascend-robothamilton attention implementation in te

文件	最后提交记录	最后更新时间
dot_product_attention	hamilton attention implementation in te Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com> # message auto-generated for no-merge-commit merge: !3430 merge add-HA-implement-on-new-master into master hamilton attention implementation in te Created-by: Xiaoda_zhang Commit-by: Xiaoda Zhang Merged-by: ascend-robot Description: What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. 本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) （参考https://github.com/infinigence/HamiltonAttention），包括正反向实现。 HA的优势：相比于ring attention只利用到机内的单条链路，HA能够将机内的full mesh网络全部利用起来，有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。本PR实现了SBH和TND两种格式的CP，并且通过UT已经验过了正确性。性能上，在WAN和Qwen3-vl模型上验证了性能提升情况： WAN2.2, seq_len=18K \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \| 3.6ms \| 1.2ms \| \| 整个core attention时间(正向) \|33.5ms \| 16ms \| \| 整个core attention时间(反向) \|45.9ms \| 28.9ms \| \|一次迭代E2E时间 \| 8.3s \| 6.0s \| WAN2.2, seq_len=37K \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \| 8.6ms \| 3.8ms \| \| 整个core attention时间(正向) \|65.6ms \| 45.3ms \| \| 整个core attention时间(反向) \|106.7ms \| 90.8ms \| \|一次迭代E2E时间 \| 15.5s \| 12.9s \| Qwen3-vl, TND格式，每张图片seq_len=1024，62张图片，总seq_len=62K，CP切分后seq_len=7936 \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \| 2.1ms \| 1ms \| \| 整个core attention时间(正向) \| 104ms \| 104ms \| \| 整个core attention时间(反向) \|34ms \| 14ms \| Qwen3-vl, TND格式，每张图片seq_len=4096，62张图片，总seq_len=248K，CP切分后seq_len=31744 \| \| ring attn (send/recv)\| HA (4条ring) (alltoall)\| \|--\|--\|--\| \| 单个通信算子时间 \|8.9ms \| 2.6ms \| \| 整个core attention时间(正向) \| 112ms \| 112ms \| \| 整个core attention时间(反向) \|135ms \| 86ms \| Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. 使能HA，需要用户配置enable_ha参数，以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的，以及在TND格式下重组各个seq所需的permute_index。 How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. 已通过UT测试了正确性。 See merge request: Ascend/MindSpeed!3430	24 天前
__init__.py	add TEDotProductAttention for master Co-authored-by: wuweiqiang24<wuweiqiang11@huawei.com> # message auto-generated for no-merge-commit merge: !3058 merge add_te_dpa_master into master add TEDotProductAttention for master Created-by: wuweiqiang24 Commit-by: wuweiqiang24 Merged-by: ascend-robot Description: 1. add TEDotProductAttention 2. add flash attention backend 精度已对齐local DotProductAttention实现 https://wiki.huawei.com/domains/63703/wiki/220407/WIKI202511289199170 See merge request: Ascend/MindSpeed!3058	5 个月前