cann-robotfix select_attention_operators compatibility with aarch64

文件	最后提交记录	最后更新时间
experiments	修改版权头 Co-authored-by: yayahello<zhaopenglei@hisilicon.com> # message auto-generated for no-merge-commit merge: !3171 merge master into master 修改版权头 Created-by: yayahello Commit-by: yayahello Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 版权声明不标准，修改为标准头 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> 关联Issue [#1100](https://gitcode.com/cann/ops-transformer/issues/1100) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [x] ❓ 其他，请描述： See merge request: cann/ops-transformer!3171	2 个月前
kernels	fix select_attention_operators compatibility with aarch64 Co-authored-by: kostyab<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !4752 merge select_attention_operators_requirements_fix into master fix select_attention_operators compatibility with aarch64 Created-by: kostyab Commit-by: kostyab Merged-by: cann-robot Description: ## 描述 Adapt the requirements.txt, setup.py and verify running of select_attention_operators (sparse pattern predictors based on Quest) to aarch64 architecture + minor readme typos fixes. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2090 ## 测试 Verified on both following machines (both with Ubuntu-22, CANN 8.3.RC1.alpha002): - Host: x86 Device: Ascend910B4 - Host: aarch64 Device: Ascend910B2 Verification steps for the 2nd setup (aarch64+ 910b2): ```bash cd experimental/select_attention_operators source scripts/init_cann.sh Ascend910B2 bash scripts/build_kernels.sh pytest -v experiments # all pass! ``` ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4752	21 天前
scripts	修改版权头 Co-authored-by: yayahello<zhaopenglei@hisilicon.com> # message auto-generated for no-merge-commit merge: !3171 merge master into master 修改版权头 Created-by: yayahello Commit-by: yayahello Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 版权声明不标准，修改为标准头 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> 关联Issue [#1100](https://gitcode.com/cann/ops-transformer/issues/1100) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [x] ❓ 其他，请描述： See merge request: cann/ops-transformer!3171	2 个月前
.gitignore	select_attention_operators - "Quest"-based block-sparse mask predictor kernels for efficient LLM decoding Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !472 merge select_attention_operators into master select_attention_operators - "Quest"-based block-sparse mask predictor kernels for efficient LLM decoding Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 Description We propose custom AscendC kernel implementation of the Quest predictor,[ Tang et al. 2024](https://arxiv.org/abs/2406.10774). Use case: At generative LLM decoing our kernel `quest_block_select_paged` predicts top-k important KV-cache blocks individually for each KV-head, allowing sparse LLM decoding (e.g. using only 8 blocks out of hundreds). We also provide an additional kernel `quest_prefill_metadata` that efficiently creates the metadata after prefill, and can be used to maintain the metadata. Benefits: - standalone kerenels - no dependencies - VLLM-Ascend compatible - the kernels store and load the metadata in the same format VLLM-Ascend stores its KV-cache. Hence KV-blocks (pages) can be re-used for metadata maintenance - Pytorch interface - ready to use immediately after a few seconds of installation - high performance - both kernels run 300x - 20x faster than a plain python implemenation with default torch_npu's backends, allowing offering a negligible performance overhead - support for MHSA and GQA - we extended the original Quest predictor to accommodate multiple queries per KV-head. - bfloat16 and float16 support. - high performance prefill-metadata in 1.3msec for batch size 20, 16k tokens, 8KV heads - high performance prediction in only 25usec for float16 and 37usec bfloat16 (batch size 20, 16k tokens top-8 prediction, 32 query heads, 8 kv-heads) We provide tests, benchmark scripts, and pytorch interface for immediate usage Future release of sparse paged_attention We are intended to release the sparse version of npu_paged_attention kernel in a separate pull request, which is capable of receiving the selected top-k block ids of the current quest predictor and performing a reduced computation for a substantial 4x - 9x attention speedups (measured at batch size 20, sequence lengths 16000 tokens) ## 关联的Issue [Issue #256](https://gitcode.com/cann/ops-transformer/issues/256) ## 测试Testing you can refer to the README.md. here are installation and testing instructions: ```shell # create cond environment cd experimental/select_attention_operators conda create -n sa python=3.11.10 -y conda activate sa pip install -r requirements.txt # build kernels source scripts/init_cann.sh Ascend910B4 # change Ascend910B4 to your card model bash scripts/build_kernels.sh # run all tests pytest -v experiments ``` ## 文档更新 All the documentation is in the readme of the kernels ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [x] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-transformer!472	4 个月前
README.md	fix select_attention_operators compatibility with aarch64 Co-authored-by: kostyab<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !4752 merge select_attention_operators_requirements_fix into master fix select_attention_operators compatibility with aarch64 Created-by: kostyab Commit-by: kostyab Merged-by: cann-robot Description: ## 描述 Adapt the requirements.txt, setup.py and verify running of select_attention_operators (sparse pattern predictors based on Quest) to aarch64 architecture + minor readme typos fixes. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2090 ## 测试 Verified on both following machines (both with Ubuntu-22, CANN 8.3.RC1.alpha002): - Host: x86 Device: Ascend910B4 - Host: aarch64 Device: Ascend910B2 Verification steps for the 2nd setup (aarch64+ 910b2): ```bash cd experimental/select_attention_operators source scripts/init_cann.sh Ascend910B2 bash scripts/build_kernels.sh pytest -v experiments # all pass! ``` ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4752	21 天前
requirements.txt	fix select_attention_operators compatibility with aarch64 Co-authored-by: kostyab<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !4752 merge select_attention_operators_requirements_fix into master fix select_attention_operators compatibility with aarch64 Created-by: kostyab Commit-by: kostyab Merged-by: cann-robot Description: ## 描述 Adapt the requirements.txt, setup.py and verify running of select_attention_operators (sparse pattern predictors based on Quest) to aarch64 architecture + minor readme typos fixes. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2090 ## 测试 Verified on both following machines (both with Ubuntu-22, CANN 8.3.RC1.alpha002): - Host: x86 Device: Ascend910B4 - Host: aarch64 Device: Ascend910B2 Verification steps for the 2nd setup (aarch64+ 910b2): ```bash cd experimental/select_attention_operators source scripts/init_cann.sh Ascend910B2 bash scripts/build_kernels.sh pytest -v experiments # all pass! ``` ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4752	21 天前

Select Attention Operators

High performance Ascend-910B kernels for sparse attention pattern prediction for efficient LLM decoding. The kernels can be launched through python interface which we provide - see kernel usage examples in the experiments directory.

Repo structure

.
|-- experiments - per kernel: test (functional correctness) and benchmark (time and bandwidth)
|   |-- 2_quest_prefill_metadata - constructing metadata after prefill
|   |-- 3_quest_block_select_paged - quest sparse mask predictor using metadata
|   |-- 4_quest_block_select_paged_w -  quest sparse mask predictor using metadata with extra sink+window features
|-- kernels - python packages, each having one or more ascendc kernels and a single torch interface
|   |-- select_attn_ops - predictor kernels (quest predictors of sparse pattern duting LLM decoding)
`-- scripts
    |-- build_kernels.sh - builds all kernels
    `-- init_cann.sh - initialize the environment and Ascend device version

Requirements

Tested to work with:

Ascend910B2, Ascend910B4
CANN versions 8.0.RC3.beta1, 8.2.RC2, 8.3.RC1
Python 3.11.10
torch-npu version 2.4.0, 2.5.1.post1
see requirements.txt for all other requirements

Creating conda environment

create conda environment

conda create -n sa python=3.11.10 -y
conda activate sa
pip install -r requirements.txt

Running (in conda environment)

activate conda and CANN environment, compile the operators and build their python api (as python packages):

source scripts/init_cann.sh Ascend910B4 # change Ascend910B4 to your card model
bash scripts/build_kernels.sh

Run all tests that are found in the experiments subdirectory

pytest -v experiments

Usage

Current best practise (proven at vllm-ascend) is to use quest_prefill_metadata() kernel for the creation of the metadata (after prefill) and every 128 tokens to update the metadata, and to use quest_block_select_paged_in_out_w() to predict important KV block indices given the current query vector of the token being decoded. For detailed usage examples refer to experiments directory.

The kernels are deployed with a very neat built in documentation:

import torch_npu
from select_attn_ops import quest_block_select_paged_in_out_w
help(quest_block_select_paged_in_out_w)

Prints:

Help on built-in function quest_block_select_paged_in_out_w in module select_attn_ops:

quest_block_select_paged_in_out_w(...) method of builtins.PyCapsule instance
    quest_block_select_paged_in_out_w(query: torch.Tensor, maxblocks: torch.Tensor, minblocks: torch.Tensor, metadata_block_tables: torch.Tensor, seq_lens: torch.Tensor, tokens_since_metadata_update: int, selected_indices: torch.Tensor) -> None
    
    
    Alternative interface to the `quest_block_select_paged` kernel which predicts 
    the sparsity mask during decoding in the form of top-k important kv-block 
    indices for every KV-head in every request. The returned KV block ids 
    are not the indices in the KV-cache, but rather from their enumeration 
    from 0 to number of blocks in the sequence length being decoded.
    
    FEATURE 1) WITH PREALLOCATED OUTPUT TENSOR (selected_indices)
    
    FEATURE 2) "w" 2 stands for "window" i.e. the kernel decides whether to add local 
    window blocks ids to the selected indices based on the number of tokens 
    since the last update and based on the sequence length
    
    Args:
        query (torch.Tensor): Query vector of shape [B, H, D] (fp16 or bf16)
        maxblocks (torch.Tensor): Quest metadata with maximum vectors of 
                                every key-cache block of shape 
                                [num_meta_blocks, BLOCK_SIZE, N, D] (fp16 or bf16)
                                important: zeroes must be in place of metadata of non-existing kv blocks
        minblocks (torch.Tensor): Quest metadata with minimum vectors of 
                                every key-cache block of shape 
                                [num_meta_blocks, BLOCK_SIZE, N, D] (fp16 or bf16)
                                important: zeroes must be in place of metadata of non-existing kv blocks
        metadata_block_tables (torch.Tensor): Metadata block tables of 
                                            shape [B, MMBPR] (int32)
        seq_lens (torch.Tensor): Sequence length of each request in the batch
                            of shape [B] (int32)
        tokens_since_metadata_update (int) - number of tokens that were decoded 
                            since the last metadata update (note metadata update is 
                            done only on the multiple of BLOCK_SIZE tokens which is 
                            lower or equal to the sequence length at the moment of update)
                            set to -1 to disable selection of KV blocks for which the 
                            metadata doesn't exist.
        selected_indices (torch.Tensor): Selected indices vector of shape [B, N, k] (int32): 
                                Number of highest indices to return for every KV head
    
    Returns:
        <fills out the selected_indices tensor>
        
    
    Limitations: due to kernel's internal buffer design on 910B:
        D = 128
        BLOCK_SIZE = 128
        H / N <= BLOCK_SIZE
        MMBPR <= 6
        k % 8 == 0

Development Workflow for a new kernel "OP"

Add new kernel implementations in the kernels/ directory in one of 2 ways:
1. under an existing python package e.g. kernels/select_attn_ops/. Then add your kernel code as new OP.cpp, add a compilation line to compile.sh, add a torch interface inside torch_interface.cpp
2. as a new python package: kernels/OP/, with a OP.cpp kernel implementation; torch_interface.cpp, compile.sh, build.sh in it.
Create a dedicated experiment directory experiments/5_OP and implement in it the following programs:
- ref_OP.py - start off by implementing a reference python model for correctness.
- gen_data_OP.py - a function that produces a set of input tensors for your kernel.
- test_OP.py with a smoke test (single run first) to validate correctness on a focused single input, then extend to automated pytesting across wide range of input shapes/data-types
- benchmark_OP.py - measure performance (time, bandwidth)