| BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels
Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com>
# message auto-generated for no-merge-commit merge:
!2517 merge block-sparse-pfa-v1 into master
BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels
Created-by: kostyab
Commit-by: Konstantin Berestizshevsky
Merged-by: cann-robot
Description: ## 描述
We introduce **BlitzSparseAttention** - a modified PromptFlashAttentionV3, to which we added **block-sparsity support** to speed up the prefill when the user knows that the attention is sparse. We enable passing **1 new "sabi" argument** to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and **performance is achieved**.

### Advantages:
1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel.
2. We provide our custom **pytorch interface** for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support.
3. **pytests** and **kernel speed benchmarks** are also included.
4. Our block sparse prompt flash attention has already showed **great speedups end-to-end** in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have!
5. at 118k tokens, 3 attention heads, the attention kernels speedup is **1.84x at 50% sparsity**; and **2.95x at 70% sparsity** compared to dense npu_fusion_attention:

If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4
## 关联的Issue
[Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953)
## 测试
run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface:
```shell
bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention
./build/cann-ops-transformer-custom_linux-"$(uname -i)".run
(cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom)
```
Testing and Benchmarking
```shell
cd experimental/attention/blitz_sparse_attention/benchmark
pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads
python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script
```
the benchmarking at 118k tokens sequence length shows amazing **1.84x speedup at 50% sparsity** (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix).
```
==========================================================================================
DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched'
==========================================================================================
H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec]
------------------------------------------------------------------------------------------
3 1 118806 118806 128 0.00 yes 157663.17 169537.33
3 1 118806 118806 128 0.05 N/A N/A 155995.83
3 1 118806 118806 128 0.10 N/A N/A 148569.81
3 1 118806 118806 128 0.20 N/A N/A 132693.53
3 1 118806 118806 128 0.30 N/A N/A 116889.01
3 1 118806 118806 128 0.40 N/A N/A 101534.06
3 1 118806 118806 128 0.50 N/A N/A 84899.79
3 1 118806 118806 128 0.60 N/A N/A 69480.71
3 1 118806 118806 128 0.70 N/A N/A 53176.09
3 1 118806 118806 128 0.80 N/A N/A 38088.18
3 1 118806 118806 128 0.90 N/A N/A 21708.31
==========================================================================================
```
## 文档更新
Readme files and docs are updated under the
## 类型标签
<!-- [x] 表示选中 -->
- [ ] 🐛 Bug 修复
- [x] ✨ 新特性
- [ ] ⚡ 性能优化
- [ ] ♻️ 重构
- [ ] 🧪 测试
- [ ] 📦 构建/CI
- [ ] 🔧 配置变更
- [ ] 📝 文档更新
- [ ] ⬆️ 依赖升级
- [ ] 🔒 安全修复
- [ ] 🧹 代码清理
- [ ] ❓ 其他,请描述:
See merge request: cann/ops-transformer!2517 | 2 个月前 |