| hamilton attention implementation in te
Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com>
# message auto-generated for no-merge-commit merge:
!3430 merge add-HA-implement-on-new-master into master
hamilton attention implementation in te
Created-by: Xiaoda_zhang
Commit-by: Xiaoda Zhang
Merged-by: ascend-robot
Description: What this PR does / why we need it?
Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue.
本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) (参考https://github.com/infinigence/HamiltonAttention),包括正反向实现。
HA的优势:相比于ring attention只利用到机内的单条链路,HA能够将机内的full mesh网络全部利用起来,有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。
本PR实现了SBH和TND两种格式的CP,并且通过UT已经验过了正确性。性能上,在WAN和Qwen3-vl模型上验证了性能提升情况:
WAN2.2, seq_len=18K
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 | 3.6ms | 1.2ms |
| 整个core attention时间(正向) |33.5ms | 16ms |
| 整个core attention时间(反向) |45.9ms | 28.9ms |
|一次迭代E2E时间 | 8.3s | 6.0s |
WAN2.2, seq_len=37K
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 | 8.6ms | 3.8ms |
| 整个core attention时间(正向) |65.6ms | 45.3ms |
| 整个core attention时间(反向) |106.7ms | 90.8ms |
|一次迭代E2E时间 | 15.5s | 12.9s |
Qwen3-vl, TND格式,每张图片seq_len=1024,62张图片,总seq_len=62K,CP切分后seq_len=7936
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 | 2.1ms | 1ms |
| 整个core attention时间(正向) | 104ms | 104ms |
| 整个core attention时间(反向) |34ms | 14ms |
Qwen3-vl, TND格式,每张图片seq_len=4096,62张图片,总seq_len=248K,CP切分后seq_len=31744
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 |8.9ms | 2.6ms |
| 整个core attention时间(正向) | 112ms | 112ms |
| 整个core attention时间(反向) |135ms | 86ms |
Does this PR introduce any user-facing change?
Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path.
使能HA,需要用户配置enable_ha参数,以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的,以及在TND格式下重组各个seq所需的permute_index。
How was this patch tested?
Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations.
已通过UT测试了正确性。
See merge request: Ascend/MindSpeed!3430 | 24 天前 |