ops-transformer_8242/experimental/select_attention_operators/kernels · zhuzemao/ops-transformer_8242 - AtomGit

cann-robotfix select_attention_operators compatibility with aarch64

文件	最后提交记录	最后更新时间
select_attn_ops	fix select_attention_operators compatibility with aarch64 Co-authored-by: kostyab<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !4752 merge select_attention_operators_requirements_fix into master fix select_attention_operators compatibility with aarch64 Created-by: kostyab Commit-by: kostyab Merged-by: cann-robot Description: ## 描述 Adapt the requirements.txt, setup.py and verify running of select_attention_operators (sparse pattern predictors based on Quest) to aarch64 architecture + minor readme typos fixes. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2090 ## 测试 Verified on both following machines (both with Ubuntu-22, CANN 8.3.RC1.alpha002): - Host: x86 Device: Ascend910B4 - Host: aarch64 Device: Ascend910B2 Verification steps for the 2nd setup (aarch64+ 910b2): ```bash cd experimental/select_attention_operators source scripts/init_cann.sh Ascend910B2 bash scripts/build_kernels.sh pytest -v experiments # all pass! ``` ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4752	21 天前
__init__.py	select_attention_operators - "Quest"-based block-sparse mask predictor kernels for efficient LLM decoding Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !472 merge select_attention_operators into master select_attention_operators - "Quest"-based block-sparse mask predictor kernels for efficient LLM decoding Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 Description We propose custom AscendC kernel implementation of the Quest predictor,[ Tang et al. 2024](https://arxiv.org/abs/2406.10774). Use case: At generative LLM decoing our kernel `quest_block_select_paged` predicts top-k important KV-cache blocks individually for each KV-head, allowing sparse LLM decoding (e.g. using only 8 blocks out of hundreds). We also provide an additional kernel `quest_prefill_metadata` that efficiently creates the metadata after prefill, and can be used to maintain the metadata. Benefits: - standalone kerenels - no dependencies - VLLM-Ascend compatible - the kernels store and load the metadata in the same format VLLM-Ascend stores its KV-cache. Hence KV-blocks (pages) can be re-used for metadata maintenance - Pytorch interface - ready to use immediately after a few seconds of installation - high performance - both kernels run 300x - 20x faster than a plain python implemenation with default torch_npu's backends, allowing offering a negligible performance overhead - support for MHSA and GQA - we extended the original Quest predictor to accommodate multiple queries per KV-head. - bfloat16 and float16 support. - high performance prefill-metadata in 1.3msec for batch size 20, 16k tokens, 8KV heads - high performance prediction in only 25usec for float16 and 37usec bfloat16 (batch size 20, 16k tokens top-8 prediction, 32 query heads, 8 kv-heads) We provide tests, benchmark scripts, and pytorch interface for immediate usage Future release of sparse paged_attention We are intended to release the sparse version of npu_paged_attention kernel in a separate pull request, which is capable of receiving the selected top-k block ids of the current quest predictor and performing a reduced computation for a substantial 4x - 9x attention speedups (measured at batch size 20, sequence lengths 16000 tokens) ## 关联的Issue [Issue #256](https://gitcode.com/cann/ops-transformer/issues/256) ## 测试Testing you can refer to the README.md. here are installation and testing instructions: ```shell # create cond environment cd experimental/select_attention_operators conda create -n sa python=3.11.10 -y conda activate sa pip install -r requirements.txt # build kernels source scripts/init_cann.sh Ascend910B4 # change Ascend910B4 to your card model bash scripts/build_kernels.sh # run all tests pytest -v experiments ``` ## 文档更新 All the documentation is in the readme of the kernels ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [x] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-transformer!472	4 个月前
ascendc_extension.py	修改版权头 Co-authored-by: yayahello<zhaopenglei@hisilicon.com> # message auto-generated for no-merge-commit merge: !3171 merge master into master 修改版权头 Created-by: yayahello Commit-by: yayahello Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 版权声明不标准，修改为标准头 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> 关联Issue [#1100](https://gitcode.com/cann/ops-transformer/issues/1100) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [x] ❓ 其他，请描述： See merge request: cann/ops-transformer!3171	2 个月前