vllm_ascend/docs/source/user_guide/feature_guide · yilunh/vllm_ascend - AtomGit

GGitHub[Doc][Misc] Add KV Cache CPU Offload feature guide (#6961 )

文件	最后提交记录	最后更新时间
images	[Feature][Doc] Add AI QoS module, tuning tool, and user guide (#8706) ### What this PR does / why we need it? This PR adds AI QoS support for operator-facing tuning on Ascend: a Python tool to apply/undo and print UB switch–style configuration, unit tests, and an English user guide with platform and software constraints. - `csrc/ai_qos`: Exposes `set_qos` / `get_qos`, `set_bw` / `get_bw`, and fuse/global config helpers via pybind11; integrated into the build (CMake / setup.py as applicable in this tree). - `tools/ai_qos.py`: `apply` to snapshot baseline and program QoS state; `unset` to restore and remove state; supports auto/manual traffic priorities and prints command for UB switch configuration. - `tests/ut/test_ai_qos_tool.py`: Mocks `torch.npu` and `vllm_ascend.ai_qos`; covers device list, first-apply baseline reuse, and unset/restore. - Docs (`docs/source/user_guide/feature_guide/AI QoS Introduction_en.md`): Background, Auto/Manual usage, how to disable; Usage constraints including: - AIV H2D / AIV D2D host QoS: not effective with the current driver stack; delivery planned via module upgrade after driver support lands. - Software: Ascend HDK 26.0.0+, LingQu-based UB switch version as listed in the doc table. ### Does this PR introduce _any_ user-facing change? Yes. Operators get a new optional pre-inference step (`python tools/ai_qos.py `/` unset`) and a published English guide with version and constraint information. ### How was this patch tested? - `pytest -sv tests/ut/test_ai_qos_tool.py` (or full `pytest -sv tests/ut` as required by the project) - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: gtl <gaotianlong6@h-partners.com> Co-authored-by: gtl <gaotianlong6@h-partners.com>	22 天前
Ai_QoS_introduction_en.md	[Feature][Doc] Add AI QoS module, tuning tool, and user guide (#8706) ### What this PR does / why we need it? This PR adds AI QoS support for operator-facing tuning on Ascend: a Python tool to apply/undo and print UB switch–style configuration, unit tests, and an English user guide with platform and software constraints. - `csrc/ai_qos`: Exposes `set_qos` / `get_qos`, `set_bw` / `get_bw`, and fuse/global config helpers via pybind11; integrated into the build (CMake / setup.py as applicable in this tree). - `tools/ai_qos.py`: `apply` to snapshot baseline and program QoS state; `unset` to restore and remove state; supports auto/manual traffic priorities and prints command for UB switch configuration. - `tests/ut/test_ai_qos_tool.py`: Mocks `torch.npu` and `vllm_ascend.ai_qos`; covers device list, first-apply baseline reuse, and unset/restore. - Docs (`docs/source/user_guide/feature_guide/AI QoS Introduction_en.md`): Background, Auto/Manual usage, how to disable; Usage constraints including: - AIV H2D / AIV D2D host QoS: not effective with the current driver stack; delivery planned via module upgrade after driver support lands. - Software: Ascend HDK 26.0.0+, LingQu-based UB switch version as listed in the doc table. ### Does this PR introduce _any_ user-facing change? Yes. Operators get a new optional pre-inference step (`python tools/ai_qos.py `/` unset`) and a published English guide with version and constraint information. ### How was this patch tested? - `pytest -sv tests/ut/test_ai_qos_tool.py` (or full `pytest -sv tests/ut` as required by the project) - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: gtl <gaotianlong6@h-partners.com> Co-authored-by: gtl <gaotianlong6@h-partners.com>	22 天前
Fine_grained_TP.md	[Doc] Sensitive word modification (#8298) ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 个月前
Multi_Token_Prediction.md	[Spec Decode]clean up spec decode interface (#6947) This pull request refactors the speculative decoding proposer interface to align with upstream vLLM, removing the local `Proposer` interface and renaming methods to `propose`. This is the first step. In the future we should remove the class register and just add few Ascend specified method once the arch in vLLM is ready. - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2 个月前
batch_invariance.md	[Doc] Sensitive word modification (#8298) ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 个月前
context_parallel.md	[Doc] Sensitive word modification (#8298) ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 个月前
cpu_binding.md	[BugFix][Doc] Avoid A2 CPU binding overlap from hidden NPUs and doc updates (#8792) ### What this PR does / why we need it? This PR fixes A2 CPU binding pool construction when a worker process only sees part of the logical NPU topology but its cpuset overlaps CPUs affiliated with non-visible NPUs. Also update CPU binding community docs following v0.18.0 version's release. - CPU Binding Logic Improvement: Updated the CPU binding planner to consider all logical NPUs, including non-visible ones, when calculating CPU distribution to prevent potential overlaps in partial-visibility A2 worker environments. - Binding Pool Filtering: Ensured that the final CPU binding pool is strictly limited to visible/running NPUs, while using non-visible NPUs only as a reference to avoid conflicting assignments. - Test Coverage: Added new unit tests to verify that non-running NPUs are correctly skipped during pool construction while still respecting cpuset overlaps. This PR is built over https://github.com/vllm-project/vllm-ascend/pull/8645 while fixing some critical logic defects. Fixes issue #8600. Co-authored with @Rozwel-dx. ### Does this PR introduce _any_ user-facing change? No public API change. For Ascend A2 deployments that use CPU binding with partial NPU visibility, CPU assignment can change to avoid overlap with CPUs associated with non-visible logical NPUs. The final assignment remains limited to the visible/running NPUs for the worker. ### How was this patch tested? E2E test on A2 --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: chenchuw886 <chenchuwei@huawei.com> Co-authored-by: Rozwel-dx <Rozwel-dx@users.noreply.github.com>	30 天前
dynamic_batch.md	[Doc] Optimize documentation to prevent misleading information. (#7813) ### What this PR does / why we need it? Optimize some documentation to prevent misleading information. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 Signed-off-by: wangli <wangli858794774@gmail.com>	1 个月前
dynamic_chunk_pipeline_parallel.md	[Doc] add dynamic chunked pipeline parallel guide (#8728) ### What this PR does / why we need it? This PR adds a comprehensive guide for the Dynamic Chunked Pipeline Parallel (CPP) feature in vLLM-Ascend. It includes an overview of the strategy, technical details on the quadratic latency model, and configuration instructions. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? Documentation changes were verified by reviewing the rendered markdown content. - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: wangyu <wy02300127@antgroup.com> Co-authored-by: wangyu <wy02300127@antgroup.com>	1 个月前
epd_disaggregation.md	[CI][Docs] Add scheduled Sphinx link check for docs (#8273) ### What this PR does / why we need it? This PR adds a scheduled documentation link check workflow to catch broken links and unexpected redirects in the Sphinx docs on a regular basis. Main changes: - add a GitHub Actions workflow to run `make -C docs linkcheck` weekly and on manual trigger - configure Sphinx `linkcheck` options in `docs/source/conf.py` to reduce flaky CI failures - document how to run the docs link check locally and where to find the generated reports - fix link in docs This helps us detect documentation link issues earlier and makes docs maintenance more proactive and repeatable. In docs/source/community/contributors.md, delete: ``` \| 17 \| [@dependabot[bot]](https://github.com/dependabot[bot]) \| 2025/02/27 \| [a5564ed](https://github.com/vllm-project/vllm-ascend/commit/a5564ed5d8fd9818936a22d9ea35951a27513b4c) \| \| 149 \| [@invalid-email-address](https://github.com/invalid-email-address) \| 2025/09/14 \| [c9da5de](https://github.com/vllm-project/vllm-ascend/commit/c9da5dea5c271187c0119848ede9c0518a0c41b2) \| \| 207 \| [@Copilot](https://github.com/Copilot) \| 2025/11/11 \| [24bca67](https://github.com/vllm-project/vllm-ascend/commit/24bca674412b56418c94bda7d659105315505a8e) \| \| 292 \| [@nomewang](https://github.com/nomewang) \| 2026/01/12 \| [348cdf9](https://github.com/vllm-project/vllm-ascend/commit/348cdf98aad7ae9b399bf8481fcf2bb3baa6a636) \| ``` ### Does this PR introduce _any_ user-facing change? No user-facing runtime change. This only affects documentation validation and contributor workflow. ### How was this patch tested? - Verified the workflow definition and Sphinx linkcheck-related configuration changes - Added local usage instructions in: - `docs/source/developer_guide/contribution/testing.md` If needed, the check can be run locally with: ```bash make -C docs linkcheck SPHINXOPTS="-W --keep-going" ``` - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	1 个月前
eplb_swift_balancer.md	[EPLB][Feature] EPLB Support w4a8 (#6263) ### What this PR does / why we need it? [EPLB][Feature] EPLB Support w4a8 depend on #8411 ### Does this PR introduce _any_ user-facing change? GMM does not support the input whose weight is a tensor list. Therefore, this feature must be used together with `export VLLM_ASCEND_ENABLE_FUSED_MC2=1`. ### How was this patch tested? test in dsv4、qwen3.5、kimi2.5 - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	10 天前
external_dp.md	[Doc][Misc] Correcting the document and uploading the model deployment template (#8241) ### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 个月前
flash_attention.md	[BugFix] Update the package name from 'flash_attn_v3' to 'flash_attn_npu_v3' (#9303) ### What this PR does / why we need it? This PR updates the package name from `flash_attn_v3` to `flash_attn_npu_v3` to align with the `flash_attn_npu` open-source repository. It also updates the documentation with the repository link and installation instructions. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 --------- Signed-off-by: wangx700 <wangxin700@huawei.com>	10 天前
graph_mode.md	[Doc][Misc] Add static kernel compilation section to graph mode guide (#8882) ### What this PR does / why we need it? The graph mode documentation previously included `enable_static_kernel: True` in the Npugraph_ex explicit configuration example, which could mislead users into thinking static kernel is a recommended default setting. In reality, static kernel is disabled by default (`enable_static_kernel: bool = False`) and has significant first-run compilation overhead. This PR: - Separates static kernel documentation into its own dedicated subsection - Removes `enable_static_kernel` from the Npugraph_ex basic configuration example - Adds a clear note about compilation time cost (several minutes to tens of minutes depending on operator count) - Recommends accounting for this overhead in the warmup phase - Documents how to verify static kernel is active via Ascend Profiling (`op_statistic.csv`) - Documents the visible Python warning emitted during compilation ### Does this PR introduce _any_ user-facing change? Documentation only. No code changes. ### How was this patch tested? - Verified accuracy of claims against torchair source code (`static_kernel.py`) - Confirmed `warnings.warn()` is used (visible by default) for the compilation start message - Confirmed `enable_static_kernel` defaults to `False` in `vllm_ascend/ascend_config.py` - vLLM version: v0.19.1 - vLLM main: https://github.com/vllm-project/vllm/commit/d886c26d4d4fef7d079696beb4ece1cfb4b008a8 Signed-off-by: wangjiacheng <wangjiacheng13@huawei.com> Co-authored-by: wangjiacheng <wangjiacheng13@huawei.com>	23 天前
index.md	[Doc][Misc] Add KV Cache CPU Offload feature guide (#6961) ## What this PR does / why we need it? Adds a feature guide document for KV Cache CPU Offload on Ascend NPU. This addresses #6791 — a user asked how to use the KV cache CPU offloading feature. While a detailed answer was posted as a comment on the issue, this PR adds proper documentation to the feature guide so that all users can discover this information. The new document covers: - Overview of the feature and its benefits - Configuration parameters (`KVTransferConfig` with `OffloadingConnector` and `NPUOffloadingSpec`) - Python API usage example - Online serving CLI example - How the feature works internally (async D2H/H2D transfers, LRU eviction) - Optional KV cache events configuration ## Does this PR introduce _any_ user-facing change? Yes — adds new documentation page at `docs/source/user_guide/feature_guide/kv_cache_cpu_offload.md`. ## How was this patch tested? - Documentation-only change - Verified the toctree entry in the feature guide index - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	9 天前
kv_cache_cpu_offload.md	[Doc][Misc] Add KV Cache CPU Offload feature guide (#6961) ## What this PR does / why we need it? Adds a feature guide document for KV Cache CPU Offload on Ascend NPU. This addresses #6791 — a user asked how to use the KV cache CPU offloading feature. While a detailed answer was posted as a comment on the issue, this PR adds proper documentation to the feature guide so that all users can discover this information. The new document covers: - Overview of the feature and its benefits - Configuration parameters (`KVTransferConfig` with `OffloadingConnector` and `NPUOffloadingSpec`) - Python API usage example - Online serving CLI example - How the feature works internally (async D2H/H2D transfers, LRU eviction) - Optional KV cache events configuration ## Does this PR introduce _any_ user-facing change? Yes — adds new documentation page at `docs/source/user_guide/feature_guide/kv_cache_cpu_offload.md`. ## How was this patch tested? - Documentation-only change - Verified the toctree entry in the feature guide index - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	9 天前
kv_pool.md	[Ops][BugFix] Update Yuanrong backend handling for KV Pool (#9203) ### What this PR does / why we need it? This PR updates the Yuanrong backend for KV Pool to improve reliability in distributed deployments and make Yuanrong setup guidance clearer. - Use `get_world_group().local_rank` for Yuanrong NPU device selection instead of `parallel_config.rank`, matching the Mooncake backend behavior. - Split large Yuanrong `exist`, `get`, and `put` requests into bounded batches before calling Datasystem APIs. - Aggregate failed-key logging in the Yuanrong `get` path to avoid excessive logs for large batches. - Update the Yuanrong KV Pool documentation with recommended Datasystem worker parameters, Remote H2D requirements, HugeTLB checks, and source-build guidance when the prebuilt package does not match the local CANN or Ascend driver version. ### Does this PR introduce _any_ user-facing change? No API or configuration compatibility change. The Yuanrong backend behavior is fixed internally, and the documentation now provides clearer deployment guidance. The Yuanrong `load_kvc took` info log is no longer emitted. ### How was this patch tested? - Ran syntax validation for the touched Python backend file: `python -m py_compile vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/yuanrong_backend.py` - Ran diff whitespace validation: `git diff --check` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/ce29c26b31d432b1b4bc028c46bb2c3b07a667d8 --------- Signed-off-by: yangsonglin13 <yangsonglin566@gmail.com>	12 天前
large_scale_ep.md	[Doc][Misc] Correcting the document and uploading the model deployment template (#8241) ### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 个月前
layer_sharding.md	[Doc][Misc] Improve readability and fix typos in documentation (#8266) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>	1 个月前
lmcache_ascend_deployment.md	[Misc][Upgrade] Upgrade CANN to 9.0.0 and triton-ascend to 3.2.1 (#9085) Upgrade CANN to 9.0.0 and triton-ascend to 3.2.1 - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	17 天前
lora.md	[Doc][Misc] Update translations and documentation links (#8942) ### What this PR does / why we need it This PR updates several documentation files and their Chinese translations. Key changes include: - Correcting broken links in `Hunyuan-A13B-Instruct.md` and `kv_pool.md`. - Updating terminology in Chinese translations (e.g., "功能指南" to "特性指南"). - Fixing formatting issues in `.po` files. - Updating issue references in the support matrix. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? Manual verification of updated links and terminology. - vLLM version: v0.19.1 - vLLM main: https://github.com/vllm-project/vllm/commit/d886c26d4d4fef7d079696beb4ece1cfb4b008a8 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	21 天前
netloader.md	[Doc] Optimize documentation to prevent misleading information. (#7813) ### What this PR does / why we need it? Optimize some documentation to prevent misleading information. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 Signed-off-by: wangli <wangli858794774@gmail.com>	1 个月前
quantization.md	[CI][Docs] Add scheduled Sphinx link check for docs (#8273) ### What this PR does / why we need it? This PR adds a scheduled documentation link check workflow to catch broken links and unexpected redirects in the Sphinx docs on a regular basis. Main changes: - add a GitHub Actions workflow to run `make -C docs linkcheck` weekly and on manual trigger - configure Sphinx `linkcheck` options in `docs/source/conf.py` to reduce flaky CI failures - document how to run the docs link check locally and where to find the generated reports - fix link in docs This helps us detect documentation link issues earlier and makes docs maintenance more proactive and repeatable. In docs/source/community/contributors.md, delete: ``` \| 17 \| [@dependabot[bot]](https://github.com/dependabot[bot]) \| 2025/02/27 \| [a5564ed](https://github.com/vllm-project/vllm-ascend/commit/a5564ed5d8fd9818936a22d9ea35951a27513b4c) \| \| 149 \| [@invalid-email-address](https://github.com/invalid-email-address) \| 2025/09/14 \| [c9da5de](https://github.com/vllm-project/vllm-ascend/commit/c9da5dea5c271187c0119848ede9c0518a0c41b2) \| \| 207 \| [@Copilot](https://github.com/Copilot) \| 2025/11/11 \| [24bca67](https://github.com/vllm-project/vllm-ascend/commit/24bca674412b56418c94bda7d659105315505a8e) \| \| 292 \| [@nomewang](https://github.com/nomewang) \| 2026/01/12 \| [348cdf9](https://github.com/vllm-project/vllm-ascend/commit/348cdf98aad7ae9b399bf8481fcf2bb3baa6a636) \| ``` ### Does this PR introduce _any_ user-facing change? No user-facing runtime change. This only affects documentation validation and contributor workflow. ### How was this patch tested? - Verified the workflow definition and Sphinx linkcheck-related configuration changes - Added local usage instructions in: - `docs/source/developer_guide/contribution/testing.md` If needed, the check can be run locally with: ```bash make -C docs linkcheck SPHINXOPTS="-W --keep-going" ``` - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	1 个月前
rfork.md	[Doc] Optimize documentation to prevent misleading information. (#7813) ### What this PR does / why we need it? Optimize some documentation to prevent misleading information. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 Signed-off-by: wangli <wangli858794774@gmail.com>	1 个月前
sequence_parallelism.md	[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8056) ### What this PR does / why we need it? ### What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. ### Does this PR introduce _any_ user-facing change? No, this PR contains documentation-only updates. ### How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. - vLLM version: v0.18.0 - vLLM main: https://github.com/vllm-project/vllm/commit/29e48707e8144b78dd5d756f793c26a405043f3d --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	1 个月前
sleep_mode.md	[Doc][Misc] Improve readability and fix typos in documentation (#8266) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>	1 个月前
speculative_decoding.md	[SpecDecode][Feature] Implement AscendExtractHiddenStatesProposer for speculative decoding (#8799) ### What this PR does / why we need it? This PR introduces the `AscendExtractHiddenStatesProposer` to support the `extract_hidden_states` speculative decoding method on Ascend NPUs. It adapts the base proposer to use ACL graphs and implements Ascend-specific logic for preparing next token IDs. Additionally, the model runner is updated to support KV cache allocation and reshaping for `cache_only_layers`. ### Does this PR introduce _any_ user-facing change? Yes, users can now use the `extract_hidden_states` speculative decoding method on Ascend hardware. ### How was this patch tested? The changes were verified with new E2E tests in `tests/e2e/singlecard/spec_decode/test_extract_hidden_states.py` and unit tests in `tests/ut/spec_decode/test_extract_hidden_states_proposer.py`. - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: Lin-Qingyang-Alec <895744968@qq.com>	23 天前
structured_output.md	[Doc] Update structured output doc with upstream link (#4015) ### What this PR does / why we need it? Currently, the usage of structured output feature in vllm-ascend is totally the same as that in vllm. Thus, IMO, it's better to remove this doc directly to avoid some case that there are some changes in the upstream doc and we don't update our doc in time, which can be misleading to users. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	5 个月前
ucm_deployment.md	[Doc][Feature] PD Disaggregation with UCM and Mooncake (#8338) ### What this PR does / why we need it? Refactors UCM deployment documentation with improved structure and content: - Adds new sections: "Why Use UCM" and "How UCM Works" explaining UCM architecture, capabilities, and design principles - Reorganizes PD Disaggregation into Centralized PD and Distributed PD (P2P) scenarios with updated examples using PipelineStore - Adds PD-Mixed Inference section with configuration and testing guidance - Adds Large-Scale Expert Parallelism PD Disaggregation example (DP4TP8 Prefill + DP8TP4 Decode) with benchmark results - Updates all configurations from deprecated UcmNfsStore to recommended PipelineStore with YAML config file approach - Improves documentation clarity and formal writing style ### Does this PR introduce _any_ user-facing change? Yes. Users should: - Use PipelineStore instead of deprecated UcmNfsStore - Provide UCM configuration via YAML file (UCM_CONFIG_FILE) instead of inline JSON parameters - Refer to the new document structure for deployment guidance Signed-off-by: sumingZero <469434916@qq.com> Co-authored-by: sumingZero <469434916@qq.com>	1 个月前
weight_prefetch.md	[Doc][Misc] Improve readability and fix typos in documentation (#8266) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>	1 个月前