cann-robot[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output

08c90e7e创建于 1 天前历史提交

文件	最后提交记录	最后更新时间
benchmark	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前
docs	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前
examples	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前
op_graph	move fallback files to op graph lib Co-authored-by: liusixia<liusixia@h-partners.com> # message auto-generated for no-merge-commit merge: !4133 merge master into master move fallback files to op graph lib Created-by: liusixia_gitcode Commit-by: liusixia Merged-by: cann-robot Description: ## 描述动态图相关：仓内aclnn回调的fallback文件，在内置工程（built-in pkg）下，由ophost.so 改为编入opgraph.so中；自定义工程（custom pkg）下，保持不变。其中，mc2算子的fallback文件当前均include了依赖tiling的头文件（mc2_log.h），统一将其与tiling解耦，使用mc2_common_log.h。 ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/1844 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [x] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4133	1 个月前
op_host	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前
op_kernel	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前
torch_interface	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前
CMakeLists.txt	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
README.md	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	1 天前

BlitzSparseAttention — Prompt Flash Attention with Block-Sparsity

This kernel is based on PromptFlashAttentionV3, extending it with a new argument sabi to enable block-sparse attention computation during prefill. We provide a torch interface to quickly try out our kernel in end-to-end Python pipelines that may benefit from sparse computation (e.g. Hunyuan-video). Documentation of the sabi argument is in docs/aclnnBlitzSparseAttention.md.

Known limitations / TODOs

TODO 1: 128×128 sabi granularity speedup. The 128×512 sabi granularity has shown great speedups (1.89× at 50% sparsity). However, the current 128×128 version still uses 128×512 matmul tiles internally without first compacting only the sabi-selected 128×128 sub-tiles into them. In the current kernel design the cube therefore performs redundant matmuls which are then masked out during softmax — only fully empty 512-long tiles (i.e. 4 consecutive non-selected blocks) are truly skipped. As a result the speedup ramps up from sparsity ≥ 10%, in proportion to the probability of 4 consecutive non-selected blocks. Multiple attempts to rewrite the matmul scheduling have shown that a proper fix requires a full kernel rewrite; the sibling attention/block_sparse_attention kernel handles this better with its bottom-up CATLASS-based design.

TODO 2: batch size > 1 is broken. Only batch_size=1 (B=1) is currently known to produce correct results. Runs with B>1 produce incorrect outputs. All tests and benchmarks must be run with B=1 until the multi-batch issue is diagnosed and fixed.

Quick test and benchmark in python:

build the kernel as a custom experimental package, install it, then install our "torch_bsa" torch interface package

bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention
./build/cann-ops-transformer-custom_linux-"$(uname -i)".run
(cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom)

test and benchmark run times:

cd experimental/attention/blitz_sparse_attention/benchmark
pytest test_attn.py # attention_out correctness tests for sequence lengths 10k-30k 1-4 attention heads, compares our block-sparse BSA against npu_fusion_attention kernel and our own python implementation
pytest test_lse.py # softmax_lse correctness tests for sequence lengths 10k-30k 1-4 attention heads, compares our block-sparse BSA against npu_fused_infer_attention_score kernel
pytest test_joint.py # simultaneously check correctness of both kernel outputs
python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script

The tests should all be green. benchmark.py sweeps every sparsity at every pair in BLOCK_SHAPES and labels each row with its active block_shape. The frame(L,R,T,B) column shows the per-shape frame as a compact 4-tuple, or - when no frame applies (sparsity 0 ⇒ every block kept ⇒ frame irrelevant). Trimmed sample on Ascend910B2:

========================================================================================================================
  DTYPE=torch.bfloat16  INPUT_LAYOUT='BNSD'  SABI_SORTED=True  TORCH_REFERENCE='npu_fusion_attention'
========================================================================================================================
  block_shape   H   B    s_q   s_kv    D  frame(L,R,T,B)  sparsity   Outputs_equal Ref_Latency_[usec] Our_Latency_[usec]
------------------------------------------------------------------------------------------------------------------------
      128x128   3   1 118806 118806  128               -      0.00             yes          160647.57          186419.60
      128x128   3   1 118806 118806  128     (29,1,29,1)      0.50             N/A                N/A          126072.34
      128x128   3   1 118806 118806  128     (29,1,29,1)      0.90             N/A                N/A           23824.46
      128x256   3   1 118806 118806  128               -      0.00             yes          162336.54          187044.41
      128x256   3   1 118806 118806  128     (15,1,29,1)      0.50             N/A                N/A           94882.15
      128x256   3   1 118806 118806  128     (15,1,29,1)      0.90             N/A                N/A           18682.85
      128x512   3   1 118806 118806  128               -      0.00             yes          163961.43          186603.97
      128x512   3   1 118806 118806  128      (8,1,29,1)      0.50             N/A                N/A           85956.16
      128x512   3   1 118806 118806  128      (8,1,29,1)      0.90             N/A                N/A           17384.59
========================================================================================================================

Rows are trimmed to a few sparsities per shape; the actual sweep emits one row per sparsity for each block_shape. Narrow BLOCK_SHAPES at the top of benchmark.py to benchmark a single granularity. At S=118806 D=128 BF16, 128×256 breaks even with the dense PFA reference at sparsity ≈ 0.05, 128×512 at ≈ 0.1; the historic 1.89× speedup at sparsity 0.5 still holds for 128×512 (and 128×256 is within ~10% of it while keeping a 2× finer sabi resolution). See benchmark/README.md for the full table and a per-sparsity PFA-speedup summary.

To invoke our block-sparse prompt flash attention kernel from python, use our provided torch_bsa interface. The call is compatible with torch_npu conventions:

import torch
import torch_bsa

# Sabi granularity. Both values must be in {128, 256, 512, 1024}; smaller
# values give finer per-block control at the cost of a larger sabi tensor.
# Default (when block_shape is omitted) is [128, 128].
BLOCK_SIZE_Q, BLOCK_SIZE_KV = 128, 128

# sabi: torch.uint16, shape [B, N, ceil(S/BLOCK_SIZE_Q), ceil(S/BLOCK_SIZE_KV)].
# Each row lists the kept KV-block column indices for that Q-block, padded on
# the right with 0xFFFF (the uint16 "skip" sentinel).
sabi = ...  # build from your sparsity pattern

# Returns a tuple (attention_out, softmax_lse).
# softmax_lse is a [B, N, S] float32 tensor when softmax_lse_flag=True,
# or an empty tensor ({0}-shaped) when softmax_lse_flag=False (default).
attention_out, softmax_lse = torch_bsa.blitz_sparse_attention(
    q, k, v,
    sabi=sabi,
    actual_seq_lengths=actseqlen,
    actual_seq_lengths_kv=actseqlenkv,
    num_heads=h,
    num_key_value_heads=h,
    input_layout='BNSD',
    scale_value=scale,
    sparse_mode=0,
    softmax_lse_flag=False,                   # set True to also return the log-sum-exp output
    block_shape=[BLOCK_SIZE_Q, BLOCK_SIZE_KV],
)

softmax_lse output

Property	Value
Controlled by	`softmax_lse_flag` (bool attr, default `False`)
Output index	1 (always returned; empty when flag is `False`)
Shape when enabled	`[B, N, S]`
Dtype	`float32` (regardless of Q/K/V dtype)
Layout	Non-TND layouts only (`BNSD`, `BSH`, `BSND`); TND returns `{0}`
Semantics	Per-query log-sum-exp: `log(Σ exp(q·kᵀ / √d))` over all attended KV tokens

The LSE is computed during the same kernel pass as the attention output at no additional memory-bandwidth cost. It is useful for ring attention, speculative decoding rescaling, and any application that needs to merge partial attention results across segments.

When softmax_lse_flag=False the kernel skips the LSE write-out path and returns a zero-element placeholder tensor; the caller does not need to allocate memory for it.

Example run in C++:

A plain, pure C++, example is provided in examples subdirectory. Run it using:

bash build.sh --experimental --run_example blitz_sparse_attention eager cust --soc=ascend910b --vendor_name=custom

the output should be (click to expand):

``` shell [2026-05-20 10:44:46] Warning: The current environment is configured for ascend910b, Please use Atlas A2 series hardware for optimal performance. [2026-05-20 10:44:47] [2026-05-20 10:44:47] Start to run example,name:blitz_sparse_attention mode:eager [2026-05-20 10:44:47] Start compile and run example file: ../experimental/attention/blitz_sparse_attention/examples/test_aclnn_blitz_sparse_attention.cpp [2026-05-20 10:44:47] pkg_mode:cust vendor_name:custom [2026-05-20 10:44:51] Initializing ACL... [2026-05-20 10:44:51] Initializing tensors... [2026-05-20 10:44:51] Tensor shapes: [2026-05-20 10:44:51] query: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] key: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] value: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] sabi: [1, 8, 4, 4] (B, N, Q_tiles, KV_tiles) uint16 [2026-05-20 10:44:51] out: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] lse: [1, 8, 512] (B, N, S) float32 [2026-05-20 10:44:51] Executing BlitzSparseAttention... [2026-05-20 10:44:51] Synchronizing stream... [2026-05-20 10:44:51] Processing results... [2026-05-20 10:44:51] Output results (first 10 values as raw fp16 hex): [2026-05-20 10:44:51] output[0] = 0x3C00 [2026-05-20 10:44:51] output[1] = 0x3C00 [2026-05-20 10:44:51] output[2] = 0x3C00 [2026-05-20 10:44:51] output[3] = 0x3C00 [2026-05-20 10:44:51] output[4] = 0x3C00 [2026-05-20 10:44:51] output[5] = 0x3C00 [2026-05-20 10:44:51] output[6] = 0x3C00 [2026-05-20 10:44:51] output[7] = 0x3C00 [2026-05-20 10:44:51] output[8] = 0x3C00 [2026-05-20 10:44:51] output[9] = 0x3C00 [2026-05-20 10:44:51] LSE results (first 10 values, expect ~17.5520 for all-ones input): [2026-05-20 10:44:51] lse[0] = 17.550825 [2026-05-20 10:44:51] lse[1] = 17.550825 [2026-05-20 10:44:51] lse[2] = 17.550825 [2026-05-20 10:44:51] lse[3] = 17.550825 [2026-05-20 10:44:51] lse[4] = 17.550825 [2026-05-20 10:44:51] lse[5] = 17.550825 [2026-05-20 10:44:51] lse[6] = 17.550825 [2026-05-20 10:44:51] lse[7] = 17.550825 [2026-05-20 10:44:51] lse[8] = 17.550825 [2026-05-20 10:44:51] lse[9] = 17.550825 [2026-05-20 10:44:51] Cleaning up resources... [2026-05-20 10:44:51] Test completed successfully! [2026-05-20 10:44:51] run test_aclnn_blitz_sparse_attention, execute samples success [2026-05-20 10:44:51] Example completed successfully ```

Kernel integration plan

If this block-sparse kernel is of interest, please consider merging it with the official attention/prompt_flash_attention. The source is based on attention/prompt_flash_attention at git commit a574b5d71faa7c360934a6c7d1b4aa85e1a49147.

产品支持情况

产品	是否支持
Atlas A3 训练系列产品/Atlas A3 推理系列产品	√
Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件	√

功能说明

算子功能：全量推理场景的FlashAttention算子，支持sparse优化、支持actualSeqLengthsKv优化、支持INT8量化功能，支持高精度或者高性能模式选择。
计算公式：

self-attention（自注意力）利用输入样本自身的关系构建了一种注意力模型。其原理是假设有一个长度为 $n$ 的输入样本序列 $x$ ， $x$ 的每个元素都是一个 $d$ 维向量，可以将每个 $d$ 维向量看作一个token embedding，将这样一条序列经过3个权重矩阵变换得到3个维度为 $n * d$ 的矩阵。

self-attention的计算公式一般定义如下，其中 $Q$ 、 $K$ 、 $V$ 为输入样本的重要属性元素，是输入样本经过空间变换得到，且可以统一到一个特征空间中。公式及算子名称中的"Attention"为"self-attention"的简写。
$A t t e n t i o n (Q, K, V) = S c o r e (Q, K) V$
本算子中Score函数采用Softmax函数，self-attention计算公式为：
$Attention(Q,K,V)=Softmax(QKTd)VAttention(Q,K,V)=Softmax(\frac{QK^T}{\sqrt{d}})V$
其中： $Q$ 和 $K^T$ 的乘积代表输入 $x$ 的注意力，为避免该值变得过大，通常除以 $d$ 的开根号进行缩放，并对每行进行softmax归一化，与 $V$ 相乘后得到一个 $n * d$ 的矩阵。

参数说明

参数名	输入/输出	描述	数据类型	数据格式
query	输入	公式中的输入Q。	FLOAT16、BFLOAT16、INT8	ND
key	输入	公式中的输入K。	FLOAT16、BFLOAT16、INT8	ND
value	输入	公式中的输入V。	FLOAT16、BFLOAT16、INT8	ND
attentionOut	输出	公式中的输出。	FLOAT16、BFLOAT16、INT8	ND
softmax_lse	输出	每个query token对应的log-sum-exp值：log(Σ exp(q·kᵀ/√d))，用于ring attention等需要合并partial attention结果的场景。softmax_lse_flag为False时返回空tensor（numel=0）。	FLOAT32	ND，shape [B, N, S]
softmax_lse_flag	属性（输入）	是否输出softmax_lse。不需要LSE时建议传入False（默认）。	BOOL	-

Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件：数据类型支持FLOAT16、BFLOAT16、INT8。
Atlas 推理系列加速卡产品：仅支持FLOAT16。

约束说明

该接口与PyTorch配合使用时，需要保证CANN相关包与PyTorch相关包的版本匹配。
入参为空的处理：算子内部需要判断参数query是否为空，如果是空则直接返回。参数query不为空Tensor，参数key、value为空tensor，则attentionOut填充为全零。attentionOut为空Tensor时，AscendCLNN框架会处理。其余在上述参数说明中标注了“可传入nullptr”的入参为空指针时，不进行处理。
query，key，value输入，功能使用限制如下：
- Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件：
  - 支持B轴小于等于65536（64k），输入类型包含INT8时D轴非32对齐或输入类型为FLOAT16或BFLOAT16时D轴非16对齐时，B轴仅支持到128。
  - 支持N轴小于等于256。
  - S支持小于等于20971520（20M）。部分长序列场景下，如果计算量过大可能会导致bsa算子执行超时（aicore error类型报错，errorStr为：timeout or trap error），此场景下建议做S切分处理，注：这里计算量会受B、S、N、D等的影响，值越大计算量越大。典型的会超时的长序列（即B、S、N、D的乘积较大）场景包括但不限于：
    
    B Q_N Q_S D KV_N KV_S
    
    1 20 2097152 256 1 2097152
    
    1 2 20971520 256 2 20971520
    
    20 1 2097152 256 1 2097152
    
    1 10 2097152 512 1 2097152
  - 支持D轴小于等于512。inputLayout为BSH或者BSND时，要求N*D小于65535。
- Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件：在TND场景下query，key，value输入的综合限制：
  - T小于等于65536。
  - N等于8/16/32/64/128，且Q_N、K_N、V_N相等。
  - Q_D、K_D等于192，V_D等于128/192。
  - 数据类型仅支持BFLOAT16。
  - sparse模式仅支持sparse=0且不传mask，或sparse=3且传入mask。
  - 当sparse=3时，要求每个batch单独的actualSeqLengths < actualSeqLengthsKv。
当inputLayout为BNSD_BSND时，输入query的shape是BNSD，输出attentionOut的shape为BSND；其余情况attentionOut的shape需要与入参query的shape保持一致。

B	Q_N	Q_S	D	KV_N	KV_S
1	20	2097152	256	1	2097152
1	2	20971520	256	2	20971520
20	1	2097152	256	1	2097152
1	10	2097152	512	1	2097152

调用说明

调用方式	样例代码	说明
aclnn接口	test_aclnn_BlitzSparseAttention	通过aclnnBlitzSparseAttention调用BlitzSparseAttention算子