文件最后提交记录最后更新时间
[Misc] Upgrade torch-npu to 2.10.0 (#9128) ### What this PR does / why we need it? [Misc] Upgrade torch-npu to 2.10.0 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>15 天前
[CI] Main2main 0514 (#9155) ### What this PR does / why we need it? 1. fix https://github.com/vllm-project/vllm/issues/33322 overwrite gpu_modelrunner.sync_and_gather_intermediate_tensors, for the sceniro pp+sp+tp, skip scatter the residual for ascend 2. https://github.com/vllm-project/vllm/issues/35520 Adapted to the modifications of ModelRunner v2 for hybrid attn in interface level, . Todo: Added support for Mamba in ModelRunner in Ascend. any pull_request is welcome 3. https://github.com/vllm-project/vllm/issues/40711 4. https://github.com/vllm-project/vllm/pull/42121 5. https://github.com/vllm-project/vllm/pull/41706 6. https://github.com/vllm-project/vllm/issues/39917 Disable async_schedule when enable_return_routed_experts=True 7. https://github.com/vllm-project/vllm/pull/41046 8. https://github.com/vllm-project/vllm/pull/41055 9. https://github.com/vllm-project/vllm/pull/41035 10. https://github.com/vllm-project/vllm/pull/42434 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 --------- Signed-off-by: wangli <wangli858794774@gmail.com>15 天前
[CI]Style: Convert test/ to ruff format(Batch #2) (#6739) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py | | tests/e2e/multicard/4-cards/long_sequence/test_basic.py | | tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill_cp.py | | tests/e2e/multicard/4-cards/long_sequence/test_mtp.py | | tests/e2e/multicard/4-cards/long_sequence/test_prefix_caching_cp.py | | tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py | | tests/e2e/multicard/4-cards/test_data_parallel_tp2.py | | tests/e2e/multicard/4-cards/test_kimi_k2.py | | tests/e2e/multicard/4-cards/test_qwen3_next.py | | tests/e2e/nightly/multi_node/scripts/multi_node_config.py | | tests/e2e/nightly/multi_node/scripts/test_multi_node.py | | tests/e2e/nightly/multi_node/scripts/utils.py | | tests/e2e/singlecard/pooling/test_classification.py | | tests/e2e/singlecard/pooling/test_embedding.py | | tests/e2e/singlecard/pooling/test_scoring.py | | tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py | | tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py | | tests/e2e/utils.py | | tests/e2e/vllm_interface/singlecard/test_sampler.py | | tests/e2e/weekly/single_node/models/test_qwen3_30b_acc.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>1 个月前
[Feature] ngram_npu implement compatilble with aync scheduler (#9008) ### What this PR does / why we need it? same to: https://github.com/vllm-project/vllm-ascend/pull/8337 , and fixed ci test error. How was this patch tested? ngram_gpu + async_scheduling script: vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16 --tensor-parallel-size 1 --gpu-memory-utilization 0.8 --max-model-len 32768 --trust-remote-code --no-enable-prefix-caching --async-scheduling --speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}' ngram script: vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16 --tensor-parallel-size 1 --gpu-memory-utilization 0.8 --max-model-len 32768 --trust-remote-code --no-enable-prefix-caching --speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}' test script: vllm bench serve --port 8898 --backend vllm --model /model/Qwen3-14B --endpoint /v1/completions --dataset-name sonnet --dataset-path /model/vllm/benchmarks/sonnet.txt --request-rate 2.0 --sonnet-input-len 128 --sonnet-output-len 100 --sonnet-prefix-len 10 --num-prompts 40 --ignore-eos --percentile-metrics "ttft,tpot,itl,e2el test results: ngram+sync: ttft 330ms tpot 39ms ngram+async(this ps):ttft 136ms tpot 33ms cooperate with @CarterDuan - vLLM version: v0.19.1 - vLLM main: https://github.com/vllm-project/vllm/commit/4d51588e2381018348f1022dfa3a7698899805b7 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com>14 天前
[CI] Add unit test framework (#1201) This PR added the unit test framework to enable ut for vLLM Ascend. Unit test runs on CPU machines. It'll be ran once lint check is passed the same as e2e test. For unit test, this PR created a new folder called ut under tests module. All the test file in ut should keep the same with the code in vllm-ascend. The file name should be start with test_ prefix. For example, in this PR. the test_ascend_config.py is added for ascend_config.py test. A new fille worker/test_worker_v1.py is also added as the placeholder. This file should be the unit test for vllm-ascend/worker/worker_v1.py. Additional, a new fake_weight folder is added, it contains the config.json from facebook/opt-125m, so that the test will not always visit huggingface. TODO: We should add all the unit test file one by one in the future. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>11 个月前
[Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455) ### What this PR does / why we need it? Replace text-match assertions with a two-tier logprob accuracy check: - Prefill (token 0): assert token ID is identical between eager baseline and compiled mode, then verify logprob matches within atol. - Decode (tokens 1-2): if chosen tokens match, compare logprobs directly; if they differ, cross-lookup the baseline token in the compiled model's top-20 distribution and assert the assigned logprob is within decode_atol (defaults to 2x atol). This tolerates minor argmax drift caused by floating-point differences while still catching distribution divergence. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/8a680463fab3bc9e6760417cd5c0a6aa58283065 --------- Signed-off-by: wangli <wangli858794774@gmail.com>2 个月前
[CI] Main2main upgrade to 0324 (#7787) ### What this PR does / why we need it? main2main upgrade to vllm 0324. fix breaks: 1. PR [#37487](https://github.com/vllm-project/vllm/pull/37487) [V0 Deprecation] Refactor kv cache from list to element (c59a132f9) — self.kv_cache from list[tensor](per virtual engine)changed to tensor 2. PR [#37874](https://github.com/vllm-project/vllm/pull/37874) [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package (e3c6c10ca) — LRUOffloadingManager + CPUBackend been refactor to CPUOffloadingManager 3. PR [#32951](https://github.com/vllm-project/vllm/pull/32951) [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (fafe76b4a) — a) changes self.positions and self.seq_lens from CpuGpuBuffer to plain GPU tensor; b) change _get_cumsum_and_arange output paramter. Another _prepare_input_ids add num_reqs. 5. PR [#35007](https://github.com/vllm-project/vllm/pull/35007)[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (dc6908ac6) — delete vllm_is_batch_invariant() and const variable VLLM_BATCH_INVARIANT,replace with vllm.envs Know issues: 1.310p Qwen3.5 test failed for qwen3.5 patch failure, see issue: #7976 @YangShuai52 is fixing. ### Does this PR introduce _any_ user-facing change? 1. As Zero Async Scheduler + spec decode needs _compute_slot_mapping_kernel of NPU and corresponding accepted draft token validation delaye suppots see PR #7640 , this PR make this change: when in spec decode case close the async scheduler. In this way, the Main2Main can be developed in parallel with Spec Decode + Async scheduler, util next release version. Co-Authored-By: zhaomingyu <zhaomingyu13@h-partners.com> wangbj127 <wangbj1207@126.com> SidaoY <1024863041@qq.com> 22dimensions <waitingwind@foxmail.com> - vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: wangbj127 <wangbj1207@126.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: wangbj127 <wangbj1207@126.com>1 个月前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[BugFix] Update the package name from 'flash_attn_v3' to 'flash_attn_npu_v3' (#9303) ### What this PR does / why we need it? This PR updates the package name from flash_attn_v3 to flash_attn_npu_v3 to align with the flash_attn_npu open-source repository. It also updates the documentation with the repository link and installation instructions. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 --------- Signed-off-by: wangx700 <wangxin700@huawei.com>9 天前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[CI] Main2main upgrade to 0324 (#7787) ### What this PR does / why we need it? main2main upgrade to vllm 0324. fix breaks: 1. PR [#37487](https://github.com/vllm-project/vllm/pull/37487) [V0 Deprecation] Refactor kv cache from list to element (c59a132f9) — self.kv_cache from list[tensor](per virtual engine)changed to tensor 2. PR [#37874](https://github.com/vllm-project/vllm/pull/37874) [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package (e3c6c10ca) — LRUOffloadingManager + CPUBackend been refactor to CPUOffloadingManager 3. PR [#32951](https://github.com/vllm-project/vllm/pull/32951) [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (fafe76b4a) — a) changes self.positions and self.seq_lens from CpuGpuBuffer to plain GPU tensor; b) change _get_cumsum_and_arange output paramter. Another _prepare_input_ids add num_reqs. 5. PR [#35007](https://github.com/vllm-project/vllm/pull/35007)[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (dc6908ac6) — delete vllm_is_batch_invariant() and const variable VLLM_BATCH_INVARIANT,replace with vllm.envs Know issues: 1.310p Qwen3.5 test failed for qwen3.5 patch failure, see issue: #7976 @YangShuai52 is fixing. ### Does this PR introduce _any_ user-facing change? 1. As Zero Async Scheduler + spec decode needs _compute_slot_mapping_kernel of NPU and corresponding accepted draft token validation delaye suppots see PR #7640 , this PR make this change: when in spec decode case close the async scheduler. In this way, the Main2Main can be developed in parallel with Spec Decode + Async scheduler, util next release version. Co-Authored-By: zhaomingyu <zhaomingyu13@h-partners.com> wangbj127 <wangbj1207@126.com> SidaoY <1024863041@qq.com> 22dimensions <waitingwind@foxmail.com> - vLLM main: https://github.com/vllm-project/vllm/commit/35141a7eeda941a60ad5a4956670c60fd5a77029 --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: wangbj127 <wangbj1207@126.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: wangbj127 <wangbj1207@126.com>1 个月前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[Misc]update vllm to v0.19.1 (#8448) ### What this PR does / why we need it? 1. update transformers to v5.5.3 1.1 The lm-eval package needs to be upgraded to v0.4.11; otherwise, there will be interface incompatibility 1.2 Transformers 5 drops add_bos_token/add_eos_token when a tokenizer_file is present, while TokenizersBackend defaults add_bos_token=False, so DeepSeek string prompts no longer get BOS injected automatically and TP/EP or golden outputs diverge. See [tokenization_utils_base.py#L1783-L1785](https://github.com/huggingface/transformers/blob/ded2b747bde5e9933c140c29ca3615d759f5744d/src/transformers/tokenization_utils_base.py#L1783-L1785) and [tokenization_utils_tokenizers.py#L417-L419](https://github.com/huggingface/transformers/blob/ded2b747bde5e9933c140c29ca3615d759f5744d/src/transformers/tokenization_utils_tokenizers.py#L417-L419).This PR updates the corresponding golden values for the affected DeepSeek test cases. 2. fix WAITING_FOR_FSM error by https://github.com/vllm-project/vllm/pull/38048 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>1 个月前
[CI] Main2main upgrade vllm to 0330 (#7962) ### What this PR does / why we need it? Main2main upgrade vllm to 0330 fix breaks: 1. https://github.com/vllm-project/vllm/pull/37728 add clear_row method for BlockTable 2. https://github.com/vllm-project/vllm/pull/37975 Adapt GatedDeltaNetAttention Refactor 3. https://github.com/vllm-project/vllm/pull/37698 update maybe_update_config in vllm_ascend/quantization/modelslim_config.py to adapt this pr change 4. https://github.com/vllm-project/vllm/pull/37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer 5. https://github.com/vllm-project/vllm/pull/37853 for now just to skip test_cpu_offloading.py test case utils this feature has been adapted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.18.0 - vLLM main: https://github.com/vllm-project/vllm/commit/29e48707e8144b78dd5d756f793c26a405043f3d --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>1 个月前
[Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455) ### What this PR does / why we need it? Replace text-match assertions with a two-tier logprob accuracy check: - Prefill (token 0): assert token ID is identical between eager baseline and compiled mode, then verify logprob matches within atol. - Decode (tokens 1-2): if chosen tokens match, compare logprobs directly; if they differ, cross-lookup the baseline token in the compiled model's top-20 distribution and assert the assigned logprob is within decode_atol (defaults to 2x atol). This tolerates minor argmax drift caused by floating-point differences while still catching distribution divergence. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/8a680463fab3bc9e6760417cd5c0a6aa58283065 --------- Signed-off-by: wangli <wangli858794774@gmail.com>2 个月前
[Feature] support structure output for model runner v2 (#8443) ### What this PR does / why we need it? This PR adds Ascend NPU support for structured outputs in model runner v2. Upstream StructuredOutputsWorker.apply_grammar_bitmask cannot be used directly on Ascend because the original Triton kernel with BLOCK_SIZE=8192 may exceed UB capacity. Reducing BLOCK_SIZE avoids UB overflow, but increases the number of work-groups and may lead to unstable scheduling behavior on NPU. To address this, this PR adds an Ascend-specific _apply_grammar_bitmask_kernel implementation that: - introduces BLOCK_SIZE_SUB=1024 tiling inside each block to reduce UB usage - applies the grammar bitmask to logits with an NPU-compatible Triton kernel - patches _apply_grammar_bitmask_kernel for model runner v2 - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>16 天前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR https://github.com/vllm-project/vllm/pull/32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>2 个月前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[Bugfix] Fix multi-instance serving OOM on single card (#7427) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/7308. Subtracting init_non_torch_memory (maybe used by the first instance) from the total non_torch_memory when calculating available_kv_cache_memory. Directly use non_torch_memory_increase (contained in non_kv_cache_memory) to calculate available_kv_cache_memory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch tow vllm-ascend instances sequentially on single card. ```bash # Launch first instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8100 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager # Launch second instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8101 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager ``` **Before this PR:** ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340388298034668 GiB init_non_torch_memory: 0.3616676330566406 GiB non_torch_memory_before_empty_cache: 0.3896217346191406 GiB non_torch_memory_increase: 0.0279541015625 GiB non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2336344718933105 GiB init_non_torch_memory: 18.37220001220703 GiB non_torch_memory_before_empty_cache: 18.399906158447266 GiB non_torch_memory_increase: 0.02754974365234375 GiB non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: -1.32 GiB ``` **After this PR:** ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340540885925293 GiB init_non_torch_memory: 0.36182403564453125 GiB non_torch_memory_before_empty_cache: 0.38979339599609375 GiB non_torch_memory_increase: 0.0279693603515625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.233344554901123 GiB init_non_torch_memory: 18.74309539794922 GiB non_torch_memory_before_empty_cache: 18.770355224609375 GiB non_torch_memory_increase: 0.02725982666015625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: 17.05 GiB ``` - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87 --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>2 个月前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[Feature] Support LoRA with Qwen3.5 dense model. (#9023) ### What this PR does / why we need it? This PR adds support for LoRA with Qwen3.5 dense models (e.g., Qwen3.5-4B, Qwen3.5-27B). It modifies AscendGatedDeltaNetAttention to support the projection layout used in these models and updates LoRA replacement logic to handle merged linear layers with more than two packed modules. The modification to AscendGatedDeltaNetAttention refers to vLLM [GatedDeltaNetAttention](https://github.com/vllm-project/vllm/blob/bc150f50299199599673614f80d12a196f377655/vllm/model_executor/layers/mamba/gdn_linear_attn.py#L519) Fixes [#40869](https://github.com/vllm-project/vllm/issues/40869) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with pytest -sv test_qwen35_densemodel_lora.py. - vLLM version: v0.19.1 - vLLM main: https://github.com/vllm-project/vllm/commit/4d51588e2381018348f1022dfa3a7698899805b7 --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>18 天前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[Lint]Style: Convert test/ to ruff format(Batch #5) (#6747) ### What this PR does / why we need it? | File Path | | :--- | | tests/e2e/singlecard/compile/backend.py | | tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py | | tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py | | tests/e2e/singlecard/compile/test_norm_quant_fusion.py | | tests/e2e/singlecard/model_runner_v2/test_basic.py | | tests/e2e/singlecard/test_aclgraph_accuracy.py | | tests/e2e/singlecard/test_aclgraph_batch_invariant.py | | tests/e2e/singlecard/test_aclgraph_mem.py | | tests/e2e/singlecard/test_async_scheduling.py | | tests/e2e/singlecard/test_auto_fit_max_mode_len.py | | tests/e2e/singlecard/test_batch_invariant.py | | tests/e2e/singlecard/test_camem.py | | tests/e2e/singlecard/test_completion_with_prompt_embeds.py | | tests/e2e/singlecard/test_cpu_offloading.py | | tests/e2e/singlecard/test_guided_decoding.py | | tests/e2e/singlecard/test_ilama_lora.py | | tests/e2e/singlecard/test_llama32_lora.py | | tests/e2e/singlecard/test_models.py | | tests/e2e/singlecard/test_multistream_overlap_shared_expert.py | | tests/e2e/singlecard/test_quantization.py | | tests/e2e/singlecard/test_qwen3_multi_loras.py | | tests/e2e/singlecard/test_sampler.py | | tests/e2e/singlecard/test_vlm.py | | tests/e2e/singlecard/test_xlite.py | | tests/e2e/singlecard/utils.py | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[BugFix] Fix _dummy_run warmup mismatch when using --language-model-only (#8556) ### What this PR does / why we need it? - Fix _dummy_run warmup using self.is_multimodal_model while _preprocess uses self.supports_mm_inputs, causing torch.compile crash when --language-model-only is set - Add single-card e2e test for language_model_only=True on Qwen3-VL-8B-Instruct When --language-model-only is set, is_multimodal_model remains True (the model architecture is still multimodal) but supports_mm_inputs becomes False (all modality limits are 0). The _dummy_run warmup used is_multimodal_model to choose the embeddings path (inpxuts_embeds=tensor), while _preprocess used supports_mm_inputs to choose the token-ids path (input_ids=tensor). This mismatch caused torch.compile/dynamo to crash with AttributeError: 'NoneType' object has no attribute 'size' on the first inference request. ### Does this PR introduce _any_ user-facing change? Now we can use --language-model-only to load only the language component of a multimodal model, reducing HBM usage. ### How was this patch tested? Changed the condition in _dummy_run (line 2574 of model_runner_v1.py) from self.is_multimodal_model to self.supports_mm_inputs to align with _preprocess. - [x] API compatibility test: 100/100 stress test passed - [x] Single-card e2e test: test_multimodal_vl_language_model_only with Qwen3-VL-8B-Instruct Signed-off-by: underfituu <hzhucong@163.com>1 个月前
[Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455) ### What this PR does / why we need it? Replace text-match assertions with a two-tier logprob accuracy check: - Prefill (token 0): assert token ID is identical between eager baseline and compiled mode, then verify logprob matches within atol. - Decode (tokens 1-2): if chosen tokens match, compare logprobs directly; if they differ, cross-lookup the baseline token in the compiled model's top-20 distribution and assert the assigned logprob is within decode_atol (defaults to 2x atol). This tolerates minor argmax drift caused by floating-point differences while still catching distribution divergence. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/8a680463fab3bc9e6760417cd5c0a6aa58283065 --------- Signed-off-by: wangli <wangli858794774@gmail.com>2 个月前
[Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455) ### What this PR does / why we need it? Replace text-match assertions with a two-tier logprob accuracy check: - Prefill (token 0): assert token ID is identical between eager baseline and compiled mode, then verify logprob matches within atol. - Decode (tokens 1-2): if chosen tokens match, compare logprobs directly; if they differ, cross-lookup the baseline token in the compiled model's top-20 distribution and assert the assigned logprob is within decode_atol (defaults to 2x atol). This tolerates minor argmax drift caused by floating-point differences while still catching distribution divergence. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/8a680463fab3bc9e6760417cd5c0a6aa58283065 --------- Signed-off-by: wangli <wangli858794774@gmail.com>2 个月前