文件最后提交记录最后更新时间
fix select_attention_operators compatibility with aarch64 Co-authored-by: kostyab<konstantin.berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !4752 merge select_attention_operators_requirements_fix into master fix select_attention_operators compatibility with aarch64 Created-by: kostyab Commit-by: kostyab Merged-by: cann-robot Description: ## 描述 Adapt the requirements.txt, setup.py and verify running of select_attention_operators (sparse pattern predictors based on Quest) to aarch64 architecture + minor readme typos fixes. ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2090 ## 测试 **Verified on both following machines (both with Ubuntu-22, CANN 8.3.RC1.alpha002):** - Host: x86 Device: Ascend910B4 - Host: aarch64 Device: Ascend910B2 **Verification steps for the 2nd setup (aarch64+ 910b2):** ```bash cd experimental/select_attention_operators source scripts/init_cann.sh Ascend910B2 bash scripts/build_kernels.sh pytest -v experiments # all pass! ``` ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他,请描述: See merge request: cann/ops-transformer!475221 天前
select_attention_operators - "Quest"-based block-sparse mask predictor kernels for efficient LLM decoding Co-authored-by: Konstantin Berestizshevsky<Konstantin.Berestizshevsky@huawei.com> # message auto-generated for no-merge-commit merge: !472 merge select_attention_operators into master select_attention_operators - "Quest"-based block-sparse mask predictor kernels for efficient LLM decoding Created-by: kostyab Commit-by: Konstantin Berestizshevsky Merged-by: cann-robot Description: ## 描述 Description We propose custom AscendC kernel implementation of the Quest predictor,[ Tang et al. 2024](https://arxiv.org/abs/2406.10774). **Use case**: At generative LLM decoing our kernel quest_block_select_paged predicts top-k important KV-cache blocks individually for each KV-head, allowing sparse LLM decoding (e.g. using only 8 blocks out of hundreds). We also provide an additional kernel quest_prefill_metadata that efficiently creates the metadata after prefill, and can be used to maintain the metadata. **Benefits:** - standalone kerenels - no dependencies - VLLM-Ascend compatible - the kernels store and load the metadata in the same format VLLM-Ascend stores its KV-cache. Hence KV-blocks (pages) can be re-used for metadata maintenance - Pytorch interface - ready to use immediately after a few seconds of installation - high performance - both kernels run 300x - 20x faster than a plain python implemenation with default torch_npu's backends, allowing offering a negligible performance overhead - support for MHSA and GQA - we extended the original Quest predictor to accommodate multiple queries per KV-head. - bfloat16 and float16 support. - high performance prefill-metadata in 1.3msec for batch size 20, 16k tokens, 8KV heads - high performance prediction in only 25usec for float16 and 37usec bfloat16 (batch size 20, 16k tokens top-8 prediction, 32 query heads, 8 kv-heads) **We provide tests, benchmark scripts, and pytorch interface for immediate usage** **Future release of sparse paged_attention** We are intended to release the sparse version of npu_paged_attention kernel in a separate pull request, which is capable of receiving the selected top-k block ids of the current quest predictor and performing a reduced computation for a substantial 4x - 9x attention speedups (measured at batch size 20, sequence lengths 16000 tokens) ## 关联的Issue [Issue #256](https://gitcode.com/cann/ops-transformer/issues/256) ## 测试Testing you can refer to the README.md. here are installation and testing instructions: ```shell # create cond environment cd experimental/select_attention_operators conda create -n sa python=3.11.10 -y conda activate sa pip install -r requirements.txt # build kernels source scripts/init_cann.sh Ascend910B4 # change Ascend910B4 to your card model bash scripts/build_kernels.sh # run all tests pytest -v experiments ``` ## 文档更新 All the documentation is in the readme of the kernels ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [x] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [ ] 其他,请描述: See merge request: cann/ops-transformer!4724 个月前
修改版权头 Co-authored-by: yayahello<zhaopenglei@hisilicon.com> # message auto-generated for no-merge-commit merge: !3171 merge master into master 修改版权头 Created-by: yayahello Commit-by: yayahello Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动,包括改动的原因和所采取的方法。--> 版权声明不标准,修改为标准头 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue,请在这里提供Issue链接。例如:关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单,请在这里描述问题单单号。--> 关联Issue [#1100](https://gitcode.com/cann/ops-transformer/issues/1100) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新,请在这里指出。例如:更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [x] ❓ 其他,请描述: See merge request: cann/ops-transformer!31712 个月前