文件最后提交记录最后更新时间
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
move fallback files to op graph lib Co-authored-by: liusixia<liusixia@h-partners.com> # message auto-generated for no-merge-commit merge: !4133 merge master into master move fallback files to op graph lib Created-by: liusixia_gitcode Commit-by: liusixia Merged-by: cann-robot Description: ## 描述 动态图相关:仓内aclnn回调的fallback文件,在内置工程(built-in pkg)下,由ophost.so 改为编入opgraph.so中;自定义工程(custom pkg)下,保持不变。 其中,mc2算子的fallback文件当前均include了依赖tiling的头文件(mc2_log.h),统一将其与tiling解耦,使用mc2_common_log.h。 ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/1844 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新,请在这里指出。例如:更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [x] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!41331 个月前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !2517 merge block-sparse-pfa-v1 into master BlitzSparseAttention - Add high performance Block-Sparse Prompt-Flash-Attention to experimental kernels Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 We introduce **BlitzSparseAttention** - a modified PromptFlashAttentionV3, to which we added **block-sparsity support** to speed up the prefill when the user knows that the attention is sparse. We enable passing **1 new "sabi" argument** to the kernel, which is a tensor specifying the indices of (128x512)-shaped attention blocks that should be processed. The rest of the attention blocks are discraded and **performance is achieved**. ![image.png](https://raw.gitcode.com/user-images/assets/7673863/8769c139-3cd0-49ee-9ce9-7e76afa26b0f/image.png 'image.png') ### Advantages: 1. We improve the existing code of [attention/prompt_flash_attention](https://gitcode.com/cann/ops-transformer/tree/master/attention/prompt_flash_attention) allowing potential merging of this feature to master pfa kernel. 2. We provide our custom **pytorch interface** for users to be able to immediately test and try our kernel in their python pipelines, without waiting for the torch_npu adapter support. 3. **pytests** and **kernel speed benchmarks** are also included. 4. Our block sparse prompt flash attention has already showed **great speedups end-to-end** in ongoing video generation pipelines and therefore we would like to expose this implementation in the official ops-transformer repo for all users to have! 5. at 118k tokens, 3 attention heads, the attention kernels speedup is **1.84x at 50% sparsity**; and **2.95x at 70% sparsity** compared to dense npu_fusion_attention: ![image.png](https://raw.gitcode.com/user-images/assets/7673863/9e2a4c10-4a1b-4c0a-9544-e6140cbdff1f/image.png 'image.png') If this feature gains attention, please consider merging it into attention/prompt_flash_attention as V4 ## 关联的Issue [Requirement Issue number 953](https://gitcode.com/cann/ops-transformer/issues/953) ## 测试 run these 3 commands in the ops-transformer home directory to build our our kernel and its pytorch interface: ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` Testing and Benchmarking ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest test.py # correctness tests for sequence lengths 10k-30k 1-4 attention heads python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script ``` the benchmarking at 118k tokens sequence length shows amazing **1.84x speedup at 50% sparsity** (compared to the baseline torch_npu.npu_fusion_attention with dense attention matrix). ``` ========================================================================================== DTYPE=torch.bfloat16 INPUT_LAYOUT='BNSD' ATTENTION_MATRIX='blocks_optimized_batched' ========================================================================================== H B s_q s_kv D sparsity Outputs_equal Ref_Latency_[usec] Our_Latency_[usec] ------------------------------------------------------------------------------------------ 3 1 118806 118806 128 0.00 yes 157663.17 169537.33 3 1 118806 118806 128 0.05 N/A N/A 155995.83 3 1 118806 118806 128 0.10 N/A N/A 148569.81 3 1 118806 118806 128 0.20 N/A N/A 132693.53 3 1 118806 118806 128 0.30 N/A N/A 116889.01 3 1 118806 118806 128 0.40 N/A N/A 101534.06 3 1 118806 118806 128 0.50 N/A N/A 84899.79 3 1 118806 118806 128 0.60 N/A N/A 69480.71 3 1 118806 118806 128 0.70 N/A N/A 53176.09 3 1 118806 118806 128 0.80 N/A N/A 38088.18 3 1 118806 118806 128 0.90 N/A N/A 21708.31 ========================================================================================== ``` ## 文档更新 Readme files and docs are updated under the ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!25172 个月前
[experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Co-authored-by: Konstantin Berestizshevsky<konstantin.berestizshevsky@huawei.com> Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !5498 merge blitz-sparse-attention-128x128 into master [experimental/attention/blitz_sparse_attention] - add block_shapes∈{128,256,512,1024}x{128,256,512,1024} and softmax LSE output Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 1. The blitz_sparse_attention supports **16 block shapes** (128,128), (128,256), (128,512), (128,1024), (256,128), (256,256), (256,512), (256,1024), (512,128), (512,256), (512,512), (512,1024), (1024,128), (1024,256), (1024,512), (1024,1024), in addition to the previously only shape (128,512). This is determined by the new block_shape argument to the python API. 2. The blitz_sparse_attention kernel always emits **softmax LSE** as a second output. The actual computation of this tensor is enabled by passing True to the new boolean input to the python api: softmax_lse_flag. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2509 ## 测试 Make sure to get CANN 8.5.0, then follow the following steps to upgrade to Python 3.11 and install the necessary python packages: ``` pushd /tmp # install important pre-requisites apt update && apt install -y wget build-essential libssl-dev zlib1g-dev \ libncurses5-dev libncursesw5-dev libreadline-dev libsqlite3-dev libffi-dev \ libbz2-dev ninja-build # Python 3.11.10 - 4 minutes wget https://www.python.org/ftp/python/3.11.10/Python-3.11.10.tgz tar -xf Python-3.11.10.tgz pushd Python-3.11.10 ./configure --prefix=/opt/python311 --enable-optimizations make -j$(nproc) make install pip install --upgrade pip ln -sf /opt/python311/bin/python3 /usr/bin/python ln -sf /opt/python311/bin/python3 /usr/bin/python3 ln -sf /opt/python311/bin/pip3 /usr/bin/pip ln -sf /opt/python311/bin/pip3 /usr/bin/pip3 echo 'export PATH=/opt/python311/bin:$PATH' >> ~/.bashrc source ~/.bashrc popd # python packages pip install --no-cache-dir attrs==25.4.0 numpy==1.26.4 decorator==5.2.1 sympy==1.14.0 cffi==2.0.0 \ pyyaml pathlib2==2.3.7.post1 psutil==7.2.1 protobuf==6.33.2 scipy==1.15.3 \ requests==2.32.5 absl-py==2.4.0 pytest==9.0.2 pip install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install torch-npu==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu python3 -m pip install --upgrade pip setuptools wheel popd ``` **Then run the container on your machine and cloe the repo, go to the repo.** Once inside the container, and cloned the repo: **1st step** - build the kernel and the torch interface to our blitz_sparse_attention (torch_bsa python package) ```shell bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention ./build/cann-ops-transformer-custom_linux-"$(uname -i)".run (cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom) ``` **2nd step - testing:** ```shell cd experimental/attention/blitz_sparse_attention/benchmark pytest . ``` This should test the attention (test_attn), the lse separately (test_lse) and the joint attention & lse outputs (test_joint). All should be green. **3rd step -benchmarking** (run from within cd experimental/attention/blitz_sparse_attention/benchmark): ```shell python benchmark.py | tee >(python plot.py) ``` This will print the table of latencies of several block_shapes, all with BNSD=(1,3,118k,128), in addition an image benchmark.png will be created to summarize the table.The speedup compared to npu_fusion_attention is on par with the previous blitz_sparse_attention versions: ![benchmark.png](https://raw.gitcode.com/user-images/assets/7673863/45fb9cd4-0cd5-4e00-8122-d7f1d081c7f0/benchmark.png 'benchmark.png') ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [x] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!54981 天前
README.md

BlitzSparseAttention — Prompt Flash Attention with Block-Sparsity

This kernel is based on PromptFlashAttentionV3, extending it with a new argument sabi to enable block-sparse attention computation during prefill. We provide a torch interface to quickly try out our kernel in end-to-end Python pipelines that may benefit from sparse computation (e.g. Hunyuan-video). Documentation of the sabi argument is in docs/aclnnBlitzSparseAttention.md.

Known limitations / TODOs

TODO 1: 128×128 sabi granularity speedup. The 128×512 sabi granularity has shown great speedups (1.89× at 50% sparsity). However, the current 128×128 version still uses 128×512 matmul tiles internally without first compacting only the sabi-selected 128×128 sub-tiles into them. In the current kernel design the cube therefore performs redundant matmuls which are then masked out during softmax — only fully empty 512-long tiles (i.e. 4 consecutive non-selected blocks) are truly skipped. As a result the speedup ramps up from sparsity ≥ 10%, in proportion to the probability of 4 consecutive non-selected blocks. Multiple attempts to rewrite the matmul scheduling have shown that a proper fix requires a full kernel rewrite; the sibling attention/block_sparse_attention kernel handles this better with its bottom-up CATLASS-based design.

TODO 2: batch size > 1 is broken. Only batch_size=1 (B=1) is currently known to produce correct results. Runs with B>1 produce incorrect outputs. All tests and benchmarks must be run with B=1 until the multi-batch issue is diagnosed and fixed.

Quick test and benchmark in python:

build the kernel as a custom experimental package, install it, then install our "torch_bsa" torch interface package

bash build.sh --make_clean --experimental -j96 --pkg --soc=ascend910b --ops=blitz_sparse_attention
./build/cann-ops-transformer-custom_linux-"$(uname -i)".run
(cd experimental/attention/blitz_sparse_attention/torch_interface && bash build.sh custom)

test and benchmark run times:

cd experimental/attention/blitz_sparse_attention/benchmark
pytest test_attn.py # attention_out correctness tests for sequence lengths 10k-30k 1-4 attention heads, compares our block-sparse BSA against npu_fusion_attention kernel and our own python implementation
pytest test_lse.py # softmax_lse correctness tests for sequence lengths 10k-30k 1-4 attention heads, compares our block-sparse BSA against npu_fused_infer_attention_score kernel
pytest test_joint.py # simultaneously check correctness of both kernel outputs
python benchmark.py # performance benchmarking - check the constant inputs shapes defined in the script

The tests should all be green. benchmark.py sweeps every sparsity at every pair in BLOCK_SHAPES and labels each row with its active block_shape. The frame(L,R,T,B) column shows the per-shape frame as a compact 4-tuple, or - when no frame applies (sparsity 0 ⇒ every block kept ⇒ frame irrelevant). Trimmed sample on Ascend910B2:

========================================================================================================================
  DTYPE=torch.bfloat16  INPUT_LAYOUT='BNSD'  SABI_SORTED=True  TORCH_REFERENCE='npu_fusion_attention'
========================================================================================================================
  block_shape   H   B    s_q   s_kv    D  frame(L,R,T,B)  sparsity   Outputs_equal Ref_Latency_[usec] Our_Latency_[usec]
------------------------------------------------------------------------------------------------------------------------
      128x128   3   1 118806 118806  128               -      0.00             yes          160647.57          186419.60
      128x128   3   1 118806 118806  128     (29,1,29,1)      0.50             N/A                N/A          126072.34
      128x128   3   1 118806 118806  128     (29,1,29,1)      0.90             N/A                N/A           23824.46
      128x256   3   1 118806 118806  128               -      0.00             yes          162336.54          187044.41
      128x256   3   1 118806 118806  128     (15,1,29,1)      0.50             N/A                N/A           94882.15
      128x256   3   1 118806 118806  128     (15,1,29,1)      0.90             N/A                N/A           18682.85
      128x512   3   1 118806 118806  128               -      0.00             yes          163961.43          186603.97
      128x512   3   1 118806 118806  128      (8,1,29,1)      0.50             N/A                N/A           85956.16
      128x512   3   1 118806 118806  128      (8,1,29,1)      0.90             N/A                N/A           17384.59
========================================================================================================================

Rows are trimmed to a few sparsities per shape; the actual sweep emits one row per sparsity for each block_shape. Narrow BLOCK_SHAPES at the top of benchmark.py to benchmark a single granularity. At S=118806 D=128 BF16, 128×256 breaks even with the dense PFA reference at sparsity ≈ 0.05, 128×512 at ≈ 0.1; the historic 1.89× speedup at sparsity 0.5 still holds for 128×512 (and 128×256 is within ~10% of it while keeping a 2× finer sabi resolution). See benchmark/README.md for the full table and a per-sparsity PFA-speedup summary.

To invoke our block-sparse prompt flash attention kernel from python, use our provided torch_bsa interface. The call is compatible with torch_npu conventions:

import torch
import torch_bsa

# Sabi granularity. Both values must be in {128, 256, 512, 1024}; smaller
# values give finer per-block control at the cost of a larger sabi tensor.
# Default (when block_shape is omitted) is [128, 128].
BLOCK_SIZE_Q, BLOCK_SIZE_KV = 128, 128

# sabi: torch.uint16, shape [B, N, ceil(S/BLOCK_SIZE_Q), ceil(S/BLOCK_SIZE_KV)].
# Each row lists the kept KV-block column indices for that Q-block, padded on
# the right with 0xFFFF (the uint16 "skip" sentinel).
sabi = ...  # build from your sparsity pattern

# Returns a tuple (attention_out, softmax_lse).
# softmax_lse is a [B, N, S] float32 tensor when softmax_lse_flag=True,
# or an empty tensor ({0}-shaped) when softmax_lse_flag=False (default).
attention_out, softmax_lse = torch_bsa.blitz_sparse_attention(
    q, k, v,
    sabi=sabi,
    actual_seq_lengths=actseqlen,
    actual_seq_lengths_kv=actseqlenkv,
    num_heads=h,
    num_key_value_heads=h,
    input_layout='BNSD',
    scale_value=scale,
    sparse_mode=0,
    softmax_lse_flag=False,                   # set True to also return the log-sum-exp output
    block_shape=[BLOCK_SIZE_Q, BLOCK_SIZE_KV],
)

softmax_lse output

Property Value
Controlled by softmax_lse_flag (bool attr, default False)
Output index 1 (always returned; empty when flag is False)
Shape when enabled [B, N, S]
Dtype float32 (regardless of Q/K/V dtype)
Layout Non-TND layouts only (BNSD, BSH, BSND); TND returns {0}
Semantics Per-query log-sum-exp: log(Σ exp(q·kᵀ / √d)) over all attended KV tokens

The LSE is computed during the same kernel pass as the attention output at no additional memory-bandwidth cost. It is useful for ring attention, speculative decoding rescaling, and any application that needs to merge partial attention results across segments.

When softmax_lse_flag=False the kernel skips the LSE write-out path and returns a zero-element placeholder tensor; the caller does not need to allocate memory for it.

Example run in C++:

A plain, pure C++, example is provided in examples subdirectory. Run it using:

bash build.sh --experimental --run_example blitz_sparse_attention eager cust --soc=ascend910b --vendor_name=custom
the output should be (click to expand): ``` shell [2026-05-20 10:44:46] Warning: The current environment is configured for ascend910b, Please use Atlas A2 series hardware for optimal performance. [2026-05-20 10:44:47] [2026-05-20 10:44:47] Start to run example,name:blitz_sparse_attention mode:eager [2026-05-20 10:44:47] Start compile and run example file: ../experimental/attention/blitz_sparse_attention/examples/test_aclnn_blitz_sparse_attention.cpp [2026-05-20 10:44:47] pkg_mode:cust vendor_name:custom [2026-05-20 10:44:51] Initializing ACL... [2026-05-20 10:44:51] Initializing tensors... [2026-05-20 10:44:51] Tensor shapes: [2026-05-20 10:44:51] query: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] key: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] value: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] sabi: [1, 8, 4, 4] (B, N, Q_tiles, KV_tiles) uint16 [2026-05-20 10:44:51] out: [1, 8, 512, 128] (B, N, S, D) fp16 [2026-05-20 10:44:51] lse: [1, 8, 512] (B, N, S) float32 [2026-05-20 10:44:51] Executing BlitzSparseAttention... [2026-05-20 10:44:51] Synchronizing stream... [2026-05-20 10:44:51] Processing results... [2026-05-20 10:44:51] Output results (first 10 values as raw fp16 hex): [2026-05-20 10:44:51] output[0] = 0x3C00 [2026-05-20 10:44:51] output[1] = 0x3C00 [2026-05-20 10:44:51] output[2] = 0x3C00 [2026-05-20 10:44:51] output[3] = 0x3C00 [2026-05-20 10:44:51] output[4] = 0x3C00 [2026-05-20 10:44:51] output[5] = 0x3C00 [2026-05-20 10:44:51] output[6] = 0x3C00 [2026-05-20 10:44:51] output[7] = 0x3C00 [2026-05-20 10:44:51] output[8] = 0x3C00 [2026-05-20 10:44:51] output[9] = 0x3C00 [2026-05-20 10:44:51] LSE results (first 10 values, expect ~17.5520 for all-ones input): [2026-05-20 10:44:51] lse[0] = 17.550825 [2026-05-20 10:44:51] lse[1] = 17.550825 [2026-05-20 10:44:51] lse[2] = 17.550825 [2026-05-20 10:44:51] lse[3] = 17.550825 [2026-05-20 10:44:51] lse[4] = 17.550825 [2026-05-20 10:44:51] lse[5] = 17.550825 [2026-05-20 10:44:51] lse[6] = 17.550825 [2026-05-20 10:44:51] lse[7] = 17.550825 [2026-05-20 10:44:51] lse[8] = 17.550825 [2026-05-20 10:44:51] lse[9] = 17.550825 [2026-05-20 10:44:51] Cleaning up resources... [2026-05-20 10:44:51] Test completed successfully! [2026-05-20 10:44:51] run test_aclnn_blitz_sparse_attention, execute samples success [2026-05-20 10:44:51] Example completed successfully ```

Kernel integration plan

If this block-sparse kernel is of interest, please consider merging it with the official attention/prompt_flash_attention. The source is based on attention/prompt_flash_attention at git commit a574b5d71faa7c360934a6c7d1b4aa85e1a49147.

产品支持情况

产品 是否支持
Atlas A3 训练系列产品/Atlas A3 推理系列产品
Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件

功能说明

  • 算子功能:全量推理场景的FlashAttention算子,支持sparse优化、支持actualSeqLengthsKv优化、支持INT8量化功能,支持高精度或者高性能模式选择。

  • 计算公式:

    self-attention(自注意力)利用输入样本自身的关系构建了一种注意力模型。其原理是假设有一个长度为nn的输入样本序列xxxx的每个元素都是一个dd维向量,可以将每个dd维向量看作一个token embedding,将这样一条序列经过3个权重矩阵变换得到3个维度为n∗dn*d的矩阵。

    self-attention的计算公式一般定义如下,其中QQKKVV为输入样本的重要属性元素,是输入样本经过空间变换得到,且可以统一到一个特征空间中。公式及算子名称中的"Attention"为"self-attention"的简写。

    Attention(Q,K,V)=Score(Q,K)VAttention(Q,K,V)=Score(Q,K)V

    本算子中Score函数采用Softmax函数,self-attention计算公式为:

    Attention(Q,K,V)=Softmax(QKTd)VAttention(Q,K,V)=Softmax(\frac{QK^T}{\sqrt{d}})V

    其中:QQKTK^T的乘积代表输入xx的注意力,为避免该值变得过大,通常除以dd的开根号进行缩放,并对每行进行softmax归一化,与VV相乘后得到一个n∗dn*d的矩阵。

参数说明

参数名 输入/输出 描述 数据类型 数据格式
query 输入 公式中的输入Q。 FLOAT16、BFLOAT16、INT8 ND
key 输入 公式中的输入K。 FLOAT16、BFLOAT16、INT8 ND
value 输入 公式中的输入V。 FLOAT16、BFLOAT16、INT8 ND
attentionOut 输出 公式中的输出。 FLOAT16、BFLOAT16、INT8 ND
softmax_lse 输出 每个query token对应的log-sum-exp值:log(Σ exp(q·kᵀ/√d)),用于ring attention等需要合并partial attention结果的场景。softmax_lse_flag为False时返回空tensor(numel=0)。 FLOAT32 ND,shape [B, N, S]
softmax_lse_flag 属性(输入) 是否输出softmax_lse。不需要LSE时建议传入False(默认)。 BOOL -
  • Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件:数据类型支持FLOAT16、BFLOAT16、INT8。
  • Atlas 推理系列加速卡产品:仅支持FLOAT16。

约束说明

  • 该接口与PyTorch配合使用时,需要保证CANN相关包与PyTorch相关包的版本匹配。

  • 入参为空的处理:算子内部需要判断参数query是否为空,如果是空则直接返回。参数query不为空Tensor,参数key、value为空tensor,则attentionOut填充为全零。attentionOut为空Tensor时,AscendCLNN框架会处理。其余在上述参数说明中标注了“可传入nullptr”的入参为空指针时,不进行处理。

  • query,key,value输入,功能使用限制如下:

    • Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件:

      • 支持B轴小于等于65536(64k),输入类型包含INT8时D轴非32对齐或输入类型为FLOAT16或BFLOAT16时D轴非16对齐时,B轴仅支持到128。

      • 支持N轴小于等于256。

      • S支持小于等于20971520(20M)。部分长序列场景下,如果计算量过大可能会导致bsa算子执行超时(aicore error类型报错,errorStr为:timeout or trap error),此场景下建议做S切分处理,注:这里计算量会受B、S、N、D等的影响,值越大计算量越大。典型的会超时的长序列(即B、S、N、D的乘积较大)场景包括但不限于:

        B Q_N Q_S D KV_N KV_S
        1 20 2097152 256 1 2097152
        1 2 20971520 256 2 20971520
        20 1 2097152 256 1 2097152
        1 10 2097152 512 1 2097152
      • 支持D轴小于等于512。inputLayout为BSH或者BSND时,要求N*D小于65535。

    • Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件:在TND场景下query,key,value输入的综合限制:

      • T小于等于65536。
      • N等于8/16/32/64/128,且Q_N、K_N、V_N相等。
      • Q_D、K_D等于192,V_D等于128/192。
      • 数据类型仅支持BFLOAT16。
      • sparse模式仅支持sparse=0且不传mask,或sparse=3且传入mask。
      • 当sparse=3时,要求每个batch单独的actualSeqLengths < actualSeqLengthsKv。
  • 当inputLayout为BNSD_BSND时,输入query的shape是BNSD,输出attentionOut的shape为BSND;其余情况attentionOut的shape需要与入参query的shape保持一致。

调用说明

调用方式 样例代码 说明
aclnn接口 test_aclnn_BlitzSparseAttention 通过aclnnBlitzSparseAttention调用BlitzSparseAttention算子