ops-transformer_8242/experimental/attention/blitz_sparse_attention/op_kernel · zhuzemao/ops-transformer_8242 - AtomGit

cann-robot[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output

08c90e7e创建于 5 天前历史提交

文件	最后提交记录	最后更新时间
blitz_sparse_attention.cpp	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_base.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
blitz_sparse_attention_base_common.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
blitz_sparse_attention_s1s2_bns1_x910.h	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_s1s2_bns1_x910_base.h	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_template_tiling_key.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
blitz_sparse_attention_tiling_data.h	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_tilingkey.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
kernel_data_copy_transpose.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
kernel_operator_softmax_compute_nz.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前