| [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output
Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com>
Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com>
# message auto-generated for no-merge-commit merge:
!5498 merge blitz-sparse-attention-128x128 into master
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output
Created-by: kostyab
Commit-by: Konstantin Berestizshevsky
Merged-by: cann-robot
Description: ## 描述
1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API.
2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag.
## 关联的Issue
https://gitcode.com/cann/ops-transformer/issues/2509
## 测试
Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages:
```
pushd /tmp
# install important pre-requisites
apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \
libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \
libbz2-dev ninja-build
# Python 3.11.10 - 4 minutes
wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz
tar -xf Python-3.11.10.tgz
pushd Python-3.11.10
./configure --prefix=/opt/python311 --enable-optimizations
make -j$(nproc)
make install
pip install --upgrade pip
ln -sf /opt/python311/bin/python3 /usr/bin/python
ln -sf /opt/python311/bin/python3 /usr/bin/python3
ln -sf /opt/python311/bin/pip3 /usr/bin/pip
ln -sf /opt/python311/bin/pip3 /usr/bin/pip3
echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
popd
# python packages
pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \
pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \
requests==2.32.5 absl-py==2.4.0 pytest==9.0.2
pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu
pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
python3 -m pip install --upgrade pip setuptools wheel
popd
```
**Then run the container on your machine and cloe the repo, go to the repo.**
Once inside the container, and cloned the repo:
**1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package)
```shell
bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention
./build/cann-ops-transformer-custom_linux-"$(uname -i)".run
(cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom)
```
**2nd step - testing:**
```shell
cd experimental/attention/blitz_sparse_attention/benchmark
pytest .
```
This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green.
**3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark):
```shell
python benchmark.py | tee >(python plot.py)
```
This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions:

## 类型标签
<!-- [x] 表示选中 -->
- [ ] 🐛 Bug 修复
- [x] ✨ 新特性
- [ ] ⚡ 性能优化
- [ ] ♻️ 重构
- [ ] 🧪 测试
- [ ] 📦 构建/CI
- [ ] 🔧 配置变更
- [ ] 📝 文档更新
- [ ] ⬆️ 依赖升级
- [ ] 🔒 安全修复
- [x] 🧹 代码清理
- [ ] ❓ 其他,请描述:
See merge request: cann/ops-transformer!5498 | 5 天前 |