ops-transformer_8242/experimental/attention/blitz_sparse_attention/op_host · zhuzemao/ops-transformer_8242 - AtomGit

cann-robot[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output

08c90e7e创建于 5 天前历史提交

文件	最后提交记录	最后更新时间
op_api	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
CMakeLists.txt	move fallback files to op graph lib Co-authored-by: liusixia<liusixia@h-partners.com> # message auto-generated for no-merge-commit merge: !4133 merge master into master move fallback files to op graph lib Created-by: liusixia_gitcode Commit-by: liusixia Merged-by: cann-robot Description: ## 描述动态图相关：仓内aclnn回调的fallback文件，在内置工程（built-in pkg）下，由ophost.so 改为编入opgraph.so中；自定义工程（custom pkg）下，保持不变。其中，mc2算子的fallback文件当前均include了依赖tiling的头文件（mc2_log.h），统一将其与tiling解耦，使用mc2_common_log.h。 ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/1844 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [x] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4133	1 个月前
blitz_sparse_attention_def.cpp	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_infershape.cpp	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_tiling.cpp	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_tiling.h	common目录整改 Co-authored-by: hello_simida<wangyi206@huawei.com> # message auto-generated for no-merge-commit merge: !4870 merge feature/common_dir_fix_v2 into master common目录整改 Created-by: hello_simida Commit-by: hello_simida Merged-by: cann-robot Description: ## 描述本次修改对 common/ 目录进行整理，分为两个阶段： - Phase 1: 将 `common/include/kernel/` 重命名为 `common/include/op_kernel/` - Phase 2: 将 `common/include/tiling_base/` 和 `common/src/tiling_base/` 合并到 `common/include/op_host/` 和 `common/src/op_host/` 相应的 CMake 配置和所有 `#include` 路径引用已同步更新。影响范围： - 351 个文件修改（include 路径更新） - 8 个文件重命名（tiling_base → op_host） - 2 个 CMakeLists.txt 修改 + 多个 tests 目录 CMakeLists.txt 更新 ## 关联的Issue Closes #2246 ## 测试 - 编译验证通过：`bash build.sh --pkg --soc=ascend910b --ops=all_gather_matmul_v2 -j16` - 编译产物成功生成 `.run` 包 ## 文档更新无 ## 类型标签 - [x] ♻️ 重构 - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4870	26 天前
blitz_sparse_attention_tiling_compile_info.h	common目录整改 Co-authored-by: hello_simida<wangyi206@huawei.com> # message auto-generated for no-merge-commit merge: !4870 merge feature/common_dir_fix_v2 into master common目录整改 Created-by: hello_simida Commit-by: hello_simida Merged-by: cann-robot Description: ## 描述本次修改对 common/ 目录进行整理，分为两个阶段： - Phase 1: 将 `common/include/kernel/` 重命名为 `common/include/op_kernel/` - Phase 2: 将 `common/include/tiling_base/` 和 `common/src/tiling_base/` 合并到 `common/include/op_host/` 和 `common/src/op_host/` 相应的 CMake 配置和所有 `#include` 路径引用已同步更新。影响范围： - 351 个文件修改（include 路径更新） - 8 个文件重命名（tiling_base → op_host） - 2 个 CMakeLists.txt 修改 + 多个 tests 目录 CMakeLists.txt 更新 ## 关联的Issue Closes #2246 ## 测试 - 编译验证通过：`bash build.sh --pkg --soc=ascend910b --ops=all_gather_matmul_v2 -j16` - 编译产物成功生成 `.run` 包 ## 文档更新无 ## 类型标签 - [x] ♻️ 重构 - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4870	26 天前
blitz_sparse_attention_tiling_const.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
blitz_sparse_attention_tiling_context.h	[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports 16 block shapes (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new `block_shape` argument to the python API. 2. The blitz_sparse_attention kernel always emits softmax LSE as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: `softmax_lse_flag`. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` Then run the container on your machine and cloe the repo, go to the repo. Once inside the container, and cloned the repo: 1st step - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` 2nd step - testing: ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. 3rd step -benchmarking (run from within `cd experimental/attention/blitz_sparse_attention/benchmark`): ```shell python benchmark.py \| tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image `benchmark.png` will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5498	5 天前
blitz_sparse_attention_tiling_register.cpp	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前
blitz_sparse_attention_tiling_struct.h	BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce BlitzSparseAttention - a modified PromptFlashAttentionV3, to which we added block-sparsity support to speed up the prefill when the user knows that the attention is sparse. We enable passing 1 new "sabi" argument to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and performance is achieved. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom pytorch interface for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. pytests and kernel speed benchmarks are also included. 4. Our block sparse prompt flash attention has already showed great speedups end-to-end in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is 1.84x at 50% sparsity; and 2.95x at 70% sparsity compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing 1.84x speedup at 50% sparsity (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!2517	2 个月前