文件最后提交记录最后更新时间
[Doc] Update Model Deployment Tutorial Template (#9331) ### What this PR does / why we need it? This PR updates the Model-Deployment-Tutorial-Template.md to improve clarity and provide better guidance for users. Key changes include: - Renaming "Environment Preparation" to "Prerequisites". - Adding instructions and examples for linking to the Public FAQ for common issues. - Reordering columns in the optimization techniques table for better readability. - Clarifying requirements for the FAQ section. ### Does this PR introduce _any_ user-facing change? No, this is a documentation template update. ### How was this patch tested? The changes were reviewed for consistency and formatting. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>9 天前
[Feat][SP] Suport SP for VL MoE models (#7044) ### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove sp_threshold in additional config and reuse sp_min_token_num from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 | Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR | |------------|---------------------|------------------------| | 4k | 429.40 | 323.3 | | 16k | 1297.01 | 911.74 | - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>2 个月前
[Doc][Feature] Add issue-workflow-guidelines.md (#8968) ### What this PR does / why we need it? This guideline improves onboarding for new contributors and reduces ambiguity for maintainers when triaging issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check content locally and maintainer can review via github preview, also need check the result of readthedocs CI workflow. · vLLM version: v0.18.0 · vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 --------- Signed-off-by: Tian-Fantasea <tt553093031@gmail.com> Signed-off-by: Tian-Fantasea <Tian-Fantasea@noreply.gitcode.com> Signed-off-by: Tian <tt553093031@gmail.com> Co-authored-by: Tian-Fantasea <Tian-Fantasea@noreply.gitcode.com>8 天前
[CI]Remove quantization e2e test case (#9160) ### What this PR does / why we need it? **1. Remove quantization e2e test case** To reduce the e2e running time, the e2e test cases related to quantization are deleted.The CPU UT and NPU UT of the quantization module have been used to maintain the quantization feature. **2.Add a llm-compressor quantization nightly case** Added a test case for verifying the accuracy of weights in the llm-compressor format in the nightly test. - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 --------- Signed-off-by: wangkunpeng <1289706727@qq.com>14 天前
[Doc] Translated Doc files 2026-05-19 (#9308) ## Auto-Translation Summary Translated **3** file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/flash_attention.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/26075935427) - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 Signed-off-by: wangxiyuan <wangxiyuan@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan@users.noreply.github.com>10 天前
[Doc] Add sphinx build for vllm-ascend (#55) ### What this PR does / why we need it? This patch enables the doc build for vllm-ascend - Add sphinx build for vllm-ascend - Enable readthedocs for vllm-ascend - Fix CI: - exclude vllm-empty/tests/mistral_tool_use to skip `You need to agree to share your contact information to access this model` which introduce in https://github.com/vllm-project/vllm/commit/314cfade02b28d50349c4df1a7ea0bbdaef589f1 - Install test req to fix https://github.com/vllm-project/vllm-ascend/actions/runs/13304112758/job/37151690770: ``` vllm-empty/tests/mistral_tool_use/conftest.py:4: in <module> import pytest_asyncio E ModuleNotFoundError: No module named 'pytest_asyncio' ``` - exclude docs PR ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. test locally: ```bash # Install dependencies. pip install -r requirements-docs.txt # Build the docs and preview make clean; make html; python -m http.server -d build/html/ ``` Launch browser and open http://localhost:8000/. 2. CI passed with preview: https://vllm-ascend--55.org.readthedocs.build/en/55/ Signed-off-by: Yikun Jiang <yikunkero@gmail.com>1 年前
[Doc][310p] Add the 310p guide (#8640) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>8 天前
[Doc][Misc] Add KV Cache CPU Offload feature guide (#6961) ## What this PR does / why we need it? Adds a feature guide document for **KV Cache CPU Offload** on Ascend NPU. This addresses #6791 — a user asked how to use the KV cache CPU offloading feature. While a detailed answer was posted as a comment on the issue, this PR adds proper documentation to the feature guide so that all users can discover this information. The new document covers: - Overview of the feature and its benefits - Configuration parameters (KVTransferConfig with OffloadingConnector and NPUOffloadingSpec) - Python API usage example - Online serving CLI example - How the feature works internally (async D2H/H2D transfers, LRU eviction) - Optional KV cache events configuration ## Does this PR introduce _any_ user-facing change? Yes — adds new documentation page at docs/source/user_guide/feature_guide/kv_cache_cpu_offload.md. ## How was this patch tested? - Documentation-only change - Verified the toctree entry in the feature guide index - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>9 天前
[CI]Main2main 0515 (#9176) ### What this PR does / why we need it? Upstream PR [vllm-project/vllm#39568](https://github.com/vllm-project/vllm/pull/39568) is a complete rewrite of the routed-experts capture/transport pipeline. It supersedes both: - The original 0.20.2 design — RoutedExpertsCapturer.get_instance() singleton, save_captured_experts(indices=...), shared-memory + fcntl.flock cross-process transport. - The intermediate PR #39917 design — module-level get_global_experts_capturer(), init_routed_experts_capturer_with_shared_cache(), issue_routing_d2h_copy(), extract_routed_experts_for_current_batch(). This API existed in main for only a few days and was never in a stable release; it has been **fully removed**. After the upgrade to vLLM 0515, vllm-ascend faces two API surfaces that are incompatible at the source level: | Aspect | 0.20.2 | main | |---|---|---| | Capturer access | RoutedExpertsCapturer.get_instance() (singleton) | runner.routed_experts_capturer (per-runner instance, no global) | | Per-step clear_buffer | via singleton | via runner attribute | | Per-step D2H + ship | capturer.save_captured_experts(indices=cpu_slot_mapping) (sync, shm write) | runner-managed pinned routed_experts_cpu D2H + RoutedExpertsLists on ModelRunnerOutput.routed_experts | | Output channel | shm/flock to scheduler | ModelRunnerOutput.routed_experts: RoutedExpertsLists (NamedTuple, msgpack + zmq IPC) | | slot_mapping source | slot_mapping.cpu().numpy() saved to self.cpu_slot_mapping | private device snapshot routed_experts_slot_mapping_device, then pinned routed_experts_slot_mapping_cpu | | Layer hook injection | select_experts calls singleton from inside apply() | module.router.set_capture_fn(...) from _bind_routed_experts_capturer | ## Strategy Overview 1. **Keep the 0.20.2 path intact.** It already works end-to-end. All 0.20.2-specific call sites stay byte-identical. 2. **Add a parallel main path** gated by `vllm_version_is("0.20.2") == False. Reuse upstream GPUModelRunner.init_routed_experts_capturer()` (inherited) for buffer allocation; override only _bind_routed_experts_capturer because Ascend's select_experts does not go through upstream BaseRouter. 3. **Async scheduling: piggyback on upstream AsyncGPUModelRunnerOutput.** vllm-ascend already constructs that wrapper directly, so adding the routed_experts= kwarg is enough — the wrapper handles to_cpu_nonblocking() on its copy stream and tolists() finalization in get_output() for free. 4. **No new compat module, no monkey patches.** Branching is inline at each call site; total surface is one new method (_bind_routed_experts_capturer) plus three branched call sites in model_runner_v1.py and one in fused_moe.py. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/ce29c26b31d432b1b4bc028c46bb2c3b07a667d8 --------- Signed-off-by: wangli <wangli858794774@gmail.com>12 天前
[Misc]Drop custom so for CANN8.5.1 (#9184) ### What this PR does / why we need it? This PR removes the custom FIA (Flash Infer Attention) operator replacement scripts and their associated documentation. These scripts were previously used to provide optimized operators for CANN 8.5.1. With the optimization now integrated or no longer required as a manual step, these tools and the corresponding FAQ/notice sections are being dropped to simplify the codebase and documentation. ### Does this PR introduce _any_ user-facing change? Yes, users will no longer see the instructions or FAQ entries regarding the manual installation of custom FIA operators for CANN 8.5.1. The scripts tools/install_flash_infer_attention_score_ops_a2.sh and tools/install_flash_infer_attention_score_ops_a3.sh are removed. ### How was this patch tested? Documentation changes were verified for consistency. The removal of scripts is a cleanup task following the deprecation of the manual replacement process. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/ce29c26b31d432b1b4bc028c46bb2c3b07a667d8 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>14 天前
[Doc][Feature] Add issue-workflow-guidelines.md (#8968) ### What this PR does / why we need it? This guideline improves onboarding for new contributors and reduces ambiguity for maintainers when triaging issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check content locally and maintainer can review via github preview, also need check the result of readthedocs CI workflow. · vLLM version: v0.18.0 · vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 --------- Signed-off-by: Tian-Fantasea <tt553093031@gmail.com> Signed-off-by: Tian-Fantasea <Tian-Fantasea@noreply.gitcode.com> Signed-off-by: Tian <tt553093031@gmail.com> Co-authored-by: Tian-Fantasea <Tian-Fantasea@noreply.gitcode.com>8 天前
[CI] Solve the problems of slow download speed and UV (#9304) ### What this PR does / why we need it? 1. Replace triton-ascend source. 2. Add uv. ### Does this PR introduce _any_ user-facing change? Speed ​​up PR execution. ### How was this patch tested? Check the installation time of vllm-ascend in the pr. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b310 天前
[Doc] link updates and formatting fixes (#7734) ### What this PR does / why we need it? Correct the link error ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.18.0 - vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>1 个月前
[Doc] optimize doc presentation (#9091) ### What this PR does / why we need it? optimize doc presentation 1. update hdk version according to CANN version. 2. remove tsinghua mirror source and update pip before using pip 3. update log when offline infering 4. install modelscope before using VLLM_USE_MODEL_SCOPE - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Signed-off-by: zouyida2052 <zouyida2002@gmail.com>17 天前