文件最后提交记录最后更新时间
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54985 天前
move fallback files to op graph lib Co-authored-by: liusixia<liusixia@h-partners.com> # message auto-generated for no-merge-commit merge: !4133 merge master into master move fallback files to op graph lib Created-by: liusixia_gitcode Commit-by: liusixia Merged-by: cann-robot Description: ## 描述 动态图相关:仓内aclnn回调的fallback文件,在内置工程(built-in pkg)下,由ophost.so 改为编入opgraph.so中;自定义工程(custom pkg)下,保持不变。 其中,mc2算子的fallback文件当前均include了依赖tiling的头文件(mc2_log.h),统一将其与tiling解耦,使用mc2_common_log.h。 ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/1844 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新,请在这里指出。例如:更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [x] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!41331 个月前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54985 天前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54985 天前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54985 天前
common目录整改 Co-authored-by: hello_simida<wangyi206@huawei.com> # message auto-generated for no-merge-commit merge: !4870 merge feature/common_dir_fix_v2 into master common目录整改 Created-by: hello_simida Commit-by: hello_simida Merged-by: cann-robot Description: ## 描述 本次修改对 common/ 目录进行整理,分为两个阶段: - **Phase 1**: 将 common/include/kernel/ 重命名为 common/include/op_kernel/ - **Phase 2**: 将 common/include/tiling_base/common/src/tiling_base/ 合并到 common/include/op_host/common/src/op_host/ 相应的 CMake 配置和所有 #include 路径引用已同步更新。 影响范围: - 351 个文件修改(include 路径更新) - 8 个文件重命名(tiling_base → op_host) - 2 个 CMakeLists.txt 修改 + 多个 tests 目录 CMakeLists.txt 更新 ## 关联的Issue Closes #2246 ## 测试 - 编译验证通过:bash build.sh --pkg --soc=ascend910b --ops=all_gather_matmul_v2 -j16 - 编译产物成功生成 .run 包 ## 文档更新 无 ## 类型标签 - [x] ♻️ 重构 - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!487026 天前
common目录整改 Co-authored-by: hello_simida<wangyi206@huawei.com> # message auto-generated for no-merge-commit merge: !4870 merge feature/common_dir_fix_v2 into master common目录整改 Created-by: hello_simida Commit-by: hello_simida Merged-by: cann-robot Description: ## 描述 本次修改对 common/ 目录进行整理,分为两个阶段: - **Phase 1**: 将 common/include/kernel/ 重命名为 common/include/op_kernel/ - **Phase 2**: 将 common/include/tiling_base/common/src/tiling_base/ 合并到 common/include/op_host/common/src/op_host/ 相应的 CMake 配置和所有 #include 路径引用已同步更新。 影响范围: - 351 个文件修改(include 路径更新) - 8 个文件重命名(tiling_base → op_host) - 2 个 CMakeLists.txt 修改 + 多个 tests 目录 CMakeLists.txt 更新 ## 关联的Issue Closes #2246 ## 测试 - 编译验证通过:bash build.sh --pkg --soc=ascend910b --ops=all_gather_matmul_v2 -j16 - 编译产物成功生成 .run 包 ## 文档更新 无 ## 类型标签 - [x] ♻️ 重构 - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!487026 天前
BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce **BlitzSparseAttention** - a modified PromptFlashAttentionV3, to which we added **block-sparsity support** to speed up the prefill when the user knows that the attention is sparse. We enable passing **1 new "sabi" argument** to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and **performance is achieved**. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom **pytorch interface** for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. **pytests** and **kernel speed benchmarks** are also included. 4. Our block sparse prompt flash attention has already showed **great speedups end-to-end** in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is **1.84x at 50% sparsity**; and **2.95x at 70% sparsity** compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing **1.84x speedup at 50% sparsity** (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!25172 个月前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54985 天前
BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce **BlitzSparseAttention** - a modified PromptFlashAttentionV3, to which we added **block-sparsity support** to speed up the prefill when the user knows that the attention is sparse. We enable passing **1 new "sabi" argument** to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and **performance is achieved**. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom **pytorch interface** for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. **pytests** and **kernel speed benchmarks** are also included. 4. Our block sparse prompt flash attention has already showed **great speedups end-to-end** in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is **1.84x at 50% sparsity**; and **2.95x at 70% sparsity** compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing **1.84x speedup at 50% sparsity** (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!25172 个月前
BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce **BlitzSparseAttention** - a modified PromptFlashAttentionV3, to which we added **block-sparsity support** to speed up the prefill when the user knows that the attention is sparse. We enable passing **1 new "sabi" argument** to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and **performance is achieved**. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom **pytorch interface** for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. **pytests** and **kernel speed benchmarks** are also included. 4. Our block sparse prompt flash attention has already showed **great speedups end-to-end** in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is **1.84x at 50% sparsity**; and **2.95x at 70% sparsity** compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing **1.84x speedup at 50% sparsity** (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!25172 个月前