hamilton attention implementation in te
Co-authored-by: Xiaoda Zhang<zhangxiaoda@huawei.com>
# message auto-generated for no-merge-commit merge:
!3430 merge add-HA-implement-on-new-master into master
hamilton attention implementation in te
Created-by: Xiaoda_zhang
Commit-by: Xiaoda Zhang
Merged-by: ascend-robot
Description: What this PR does / why we need it?
Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue.
本PR在MindSpeed现有的代码基础上实现了Hamilton attention(HA) (参考https://github.com/infinigence/HamiltonAttention),包括正反向实现。
HA的优势:相比于ring attention只利用到机内的单条链路,HA能够将机内的full mesh网络全部利用起来,有效地减缓了ring atten中可能存在的通信未被计算掩盖时的通信瓶颈。
本PR实现了SBH和TND两种格式的CP,并且通过UT已经验过了正确性。性能上,在WAN和Qwen3-vl模型上验证了性能提升情况:
WAN2.2, seq_len=18K
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 | 3.6ms | 1.2ms |
| 整个core attention时间(正向) |33.5ms | 16ms |
| 整个core attention时间(反向) |45.9ms | 28.9ms |
|一次迭代E2E时间 | 8.3s | 6.0s |
WAN2.2, seq_len=37K
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 | 8.6ms | 3.8ms |
| 整个core attention时间(正向) |65.6ms | 45.3ms |
| 整个core attention时间(反向) |106.7ms | 90.8ms |
|一次迭代E2E时间 | 15.5s | 12.9s |
Qwen3-vl, TND格式,每张图片seq_len=1024,62张图片,总seq_len=62K,CP切分后seq_len=7936
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 | 2.1ms | 1ms |
| 整个core attention时间(正向) | 104ms | 104ms |
| 整个core attention时间(反向) |34ms | 14ms |
Qwen3-vl, TND格式,每张图片seq_len=4096,62张图片,总seq_len=248K,CP切分后seq_len=31744
| | ring attn (send/recv)| HA (4条ring) (alltoall)|
|--|--|--|
| 单个通信算子时间 |8.9ms | 2.6ms |
| 整个core attention时间(正向) | 112ms | 112ms |
| 整个core attention时间(反向) |135ms | 86ms |
Does this PR introduce any user-facing change?
Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path.
使能HA,需要用户配置enable_ha参数,以及传入HA涉及到的in_mapping_list/out_mapping_list表明多条ring是如何收发数据的,以及在TND格式下重组各个seq所需的permute_index。
How was this patch tested?
Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations.
已通过UT测试了正确性。
See merge request: Ascend/MindSpeed!3430
fix docs error
Co-authored-by: Keilo_W<wangkaiyu11@h-partners.com>
# message auto-generated for no-merge-commit merge:
!3450 merge master into master
fix docs error
Created-by: Keilo_W
Commit-by: Keilo_W
Merged-by: ascend-robot
Description: What this PR does / why we need it?
Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue.
Does this PR introduce any user-facing change?
Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path.
How was this patch tested?
Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations.
See merge request: Ascend/MindSpeed!3450