文件最后提交记录最后更新时间
[BugFix][310p] Fixing the aclgraph error caused by blocktable (#8948) ### What this PR does / why we need it? This PR fixes an ACL Graph error on Ascend 310P devices by moving the block table's slot mapping computation to the CPU. On 310P, certain device-side arithmetic operations used in the default slot mapping computation are unsupported or cause errors during graph execution. Key changes: - Overrode BlockTable for 310P to use NumPy for slot mapping computation. - Updated NPUModelRunner to perform this computation on the CPU early in the input preparation phase. - Avoided unsupported device-side additions for positions and seq_lens on 310P by using CPU buffers. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified on Ascend 310P hardware with vLLM v0.19.1. - vLLM version: v0.19.1 - vLLM main: https://github.com/vllm-project/vllm/commit/d886c26d4d4fef7d079696beb4ece1cfb4b008a8 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>9 天前
[Misc] Fix doc test (#8277) ### What this PR does / why we need it? This patch normalize the doc tests between nightly tests andPR tests and update it to the latest daily built images (main/v0.18.0). - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>1 个月前
[CI] Refactor light test cases and update test coverage (#9059) ### What this PR does / why we need it? Main changes: - Adds dedicated light E2E test files for: - singlecard basic light coverage, including Qwen3 dense, embedding, VLM, and W8A8 Eagle3 cases - 2-card light coverage, including Qwen3 MoE TP2/EP and Qwen3-VL PP2 multimodal cases - 4-card light coverage, including DeepSeek W8A8 TP/PP/EP/EPLB and PD disaggregation cases - Adds RemotePDServer and DisaggPDProxy helpers for ordinary PD disaggregation E2E tests. - Supports launching multiple vLLM serve processes in one test. - Assigns ASCEND_RT_VISIBLE_DEVICES based on each server's tp * dp requirement. - Launches the disaggregated prefill proxy and validates requests through the proxy endpoint. - Updates E2E CI config to run the new light suites. - Replaces old light suite entries with the new test_light.py cases. - Adds a 4-card light CI job for the new 4-card light coverage. - Increase patch_qwen3_vl_moe_pp_layer_range until the commit of the vllm code includes: - https://github.com/vllm-project/vllm/commit/cee6751e548357478a9943cae5786062b7b95127 | Feature | Qwen3<br>-0.6B | Qwen3<br>-8B-W8A8 | Qwen3-Embedding | Qwen3.5<br>-0.8B | Qwen3<br>-30B | Qwen3<br>-VL-30B | DeepSeek<br>-V3.2-W8A8-Pruning | DeepSeek<br>-V3.2-W8A8-Pruning | | -- | -- | -- | -- | -- | -- | -- | -- | -- | | Card count | 1 | 1 | 1 | 1 | 2 | 2 | 4 | 4 | | Dense | ✅ | ✅ |   |   |   |   |   |  | | Moe |   |   |   |   | ✅ |   | ✅ | ✅ | | Embedding |   |   | ✅ |   |   |   |   |  | | Mamba/SSM |   |   |   | ✅ |   |   |   |  | | Multimodal Reasoning |   |   |   | ✅ |   | ✅ |   |  | | TP |   |   |   |   | ✅ |   | ✅|  | | PP |   |   |   |   |   | ✅ | ✅ |  | | EP |   |   |   |   | ✅ |   | ✅ |  | | EPLB |   |   |   |   | ✅ |   | |  | | Full Graph |   | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | | PIECEWISE Graph | ✅ |   | ✅ |   |   |   |   |  | | PD disaggregation |   |   |   |   |   |   |   | ✅ | | W8A8 |   | ✅ |   |   |   |   | ✅ | ✅ | | MTP |   |   |   | ✅ |   |   |   |  | | Eagle-3 |   | ✅ |   |   |   |   |   |  | | SFA/DSA |   |   |   |   |   |   |   | ✅ | ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated E2E-Light cases with A2/A3 passed. - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 --------- Signed-off-by: MrZ20 <2609716663@qq.com>11 天前
[Doc] add Mixtral-8x7B-Instruct-v0.1 model docs and config (#8537) ### What this PR does / why we need it? This PR improves the scheduler profiling behavior for mixtral workloads by refining chunk handling logic. Previously, the profiling process could lead to inaccurate scheduling results under certain conditions. This change ensures more stable and consistent behavior. - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: lihaofei-2026 <haofei@isrc.iscas.ac.cn>9 天前
[CI]Main2main 0515 (#9176) ### What this PR does / why we need it? Upstream PR [vllm-project/vllm#39568](https://github.com/vllm-project/vllm/pull/39568) is a complete rewrite of the routed-experts capture/transport pipeline. It supersedes both: - The original 0.20.2 design — RoutedExpertsCapturer.get_instance() singleton, save_captured_experts(indices=...), shared-memory + fcntl.flock cross-process transport. - The intermediate PR #39917 design — module-level get_global_experts_capturer(), init_routed_experts_capturer_with_shared_cache(), issue_routing_d2h_copy(), extract_routed_experts_for_current_batch(). This API existed in main for only a few days and was never in a stable release; it has been **fully removed**. After the upgrade to vLLM 0515, vllm-ascend faces two API surfaces that are incompatible at the source level: | Aspect | 0.20.2 | main | |---|---|---| | Capturer access | RoutedExpertsCapturer.get_instance() (singleton) | runner.routed_experts_capturer (per-runner instance, no global) | | Per-step clear_buffer | via singleton | via runner attribute | | Per-step D2H + ship | capturer.save_captured_experts(indices=cpu_slot_mapping) (sync, shm write) | runner-managed pinned routed_experts_cpu D2H + RoutedExpertsLists on ModelRunnerOutput.routed_experts | | Output channel | shm/flock to scheduler | ModelRunnerOutput.routed_experts: RoutedExpertsLists (NamedTuple, msgpack + zmq IPC) | | slot_mapping source | slot_mapping.cpu().numpy() saved to self.cpu_slot_mapping | private device snapshot routed_experts_slot_mapping_device, then pinned routed_experts_slot_mapping_cpu | | Layer hook injection | select_experts calls singleton from inside apply() | module.router.set_capture_fn(...) from _bind_routed_experts_capturer | ## Strategy Overview 1. **Keep the 0.20.2 path intact.** It already works end-to-end. All 0.20.2-specific call sites stay byte-identical. 2. **Add a parallel main path** gated by `vllm_version_is("0.20.2") == False. Reuse upstream GPUModelRunner.init_routed_experts_capturer()` (inherited) for buffer allocation; override only _bind_routed_experts_capturer because Ascend's select_experts does not go through upstream BaseRouter. 3. **Async scheduling: piggyback on upstream AsyncGPUModelRunnerOutput.** vllm-ascend already constructs that wrapper directly, so adding the routed_experts= kwarg is enough — the wrapper handles to_cpu_nonblocking() on its copy stream and tolists() finalization in get_output() for free. 4. **No new compat module, no monkey patches.** Branching is inline at each call site; total surface is one new method (_bind_routed_experts_capturer) plus three branched call sites in model_runner_v1.py and one in fused_moe.py. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/ce29c26b31d432b1b4bc028c46bb2c3b07a667d8 --------- Signed-off-by: wangli <wangli858794774@gmail.com>12 天前
[CI] add weekly case (#9380) ### What this PR does / why we need it? We run the weekly test case at a fixed period. add weekly case ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by running the test - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 --------- Signed-off-by: chen-commits <1636718796@qq.com> Signed-off-by: chen <1636718796@qq.com>8 天前
[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>3 个月前
[BugFix] Update the package name from 'flash_attn_v3' to 'flash_attn_npu_v3' (#9303) ### What this PR does / why we need it? This PR updates the package name from flash_attn_v3 to flash_attn_npu_v3 to align with the flash_attn_npu open-source repository. It also updates the documentation with the repository link and installation instructions. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 --------- Signed-off-by: wangx700 <wangxin700@huawei.com>9 天前
[Misc][Upgrade] Upgrade CANN to 9.0.0 and triton-ascend to 3.2.1 (#9085) Upgrade CANN to 9.0.0 and triton-ascend to 3.2.1 - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>16 天前
[CI] fix nightly MiniMax-M2.5-w8a8 (#9042) ### What this PR does / why we need it? This PR fix nightly MiniMax-M2.5-w8a8 perf config, we need test them daily. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? by the running the test - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 --------- Signed-off-by: weixin <murongfengerxch@163.com>8 天前
[Test] Clean up duplicate test for ascend scheduler (#1819) There are some duplicate tests for ascend scheduler. This PR remove them to make the test clear. After this PR. the singlecard e2e cost time is reduced from 47min to 46min. - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/1eb2b9c10205b68658dede9dac73390706ef2e05 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>10 个月前
Increase doctest timeout to 300s and time print (#3041) ### What this PR does / why we need it? Increase doctest timeout to 300s and time print, according to time print in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time consumed in Graph capturing, so I think it's fine to increase doctest timeout This PR also add time log for each task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Run /vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh - CI passed - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/a684c0124cb8ac04984b6fd621d99e1463016eac Closes: https://github.com/vllm-project/vllm-ascend/issues/3045 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>8 个月前
[Bugfix][CI] Optimize the cleanup mechanism of RemoteOpenAIServer (#9356) ### What this PR does / why we need it? - Extract the existing RemoteEPDServer process-tree cleanup logic into a shared _terminate_process_tree() helper. - Reuse the helper in both RemoteOpenAIServer and RemoteEPDServer. - Return standard exit code 1 for failed suites instead of -1, avoiding shell-side 255 exit codes. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 Signed-off-by: MrZ20 <2609716663@qq.com>8 天前
[Lint]Style: Convert test/ to ruff format(Batch #1) (#6738) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | tests/e2e/310p/multicard/test_vl_model_multicard.py | | tests/e2e/310p/singlecard/test_vl_model_singlecard.py | | tests/e2e/310p/test_utils.py | | tests/e2e/conftest.py | | tests/e2e/model_utils.py | | tests/e2e/models/conftest.py | | tests/e2e/models/test_lm_eval_correctness.py | | tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py | | tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py | | tests/e2e/multicard/2-cards/test_data_parallel.py | | tests/e2e/multicard/2-cards/test_disaggregated_encoder.py | | tests/e2e/multicard/2-cards/test_expert_parallel.py | | tests/e2e/multicard/2-cards/test_external_launcher.py | | tests/e2e/multicard/2-cards/test_full_graph_mode.py | | tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py | | tests/e2e/multicard/2-cards/test_offline_inference_distributed.py | | tests/e2e/multicard/2-cards/test_offline_weight_load.py | | tests/e2e/multicard/2-cards/test_pipeline_parallel.py | | tests/e2e/multicard/2-cards/test_prefix_caching.py | | tests/e2e/multicard/2-cards/test_quantization.py | | tests/e2e/multicard/2-cards/test_qwen3_moe.py | | tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py | | tests/e2e/multicard/2-cards/test_qwen3_performance.py | | tests/e2e/multicard/2-cards/test_shared_expert_dp.py | | tests/e2e/multicard/2-cards/test_single_request_aclgraph.py | | tests/e2e/multicard/2-cards/test_sp_pass.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>2 个月前
[Doc] Fix documentation formatting and improve code examples (#8660) ### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: MrZ20 <2609716663@qq.com>1 个月前
[CI]Style: Convert test/ to ruff format(Batch #2) (#6739) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py | | tests/e2e/multicard/4-cards/long_sequence/test_basic.py | | tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill_cp.py | | tests/e2e/multicard/4-cards/long_sequence/test_mtp.py | | tests/e2e/multicard/4-cards/long_sequence/test_prefix_caching_cp.py | | tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py | | tests/e2e/multicard/4-cards/test_data_parallel_tp2.py | | tests/e2e/multicard/4-cards/test_kimi_k2.py | | tests/e2e/multicard/4-cards/test_qwen3_next.py | | tests/e2e/nightly/multi_node/scripts/multi_node_config.py | | tests/e2e/nightly/multi_node/scripts/test_multi_node.py | | tests/e2e/nightly/multi_node/scripts/utils.py | | tests/e2e/singlecard/pooling/test_classification.py | | tests/e2e/singlecard/pooling/test_embedding.py | | tests/e2e/singlecard/pooling/test_scoring.py | | tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py | | tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py | | tests/e2e/utils.py | | tests/e2e/vllm_interface/singlecard/test_sampler.py | | tests/e2e/weekly/single_node/models/test_qwen3_30b_acc.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>1 个月前