文件最后提交记录最后更新时间
[MM][Doc] Update online serving tutorials for Qwen2-Audio (#3606) ### What this PR does / why we need it? Update online serving tutorials for Qwen2-Audio. Part of https://github.com/vllm-project/vllm-ascend/issues/3508. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shen-shanshan <467638484@qq.com>7 个月前
[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>3 个月前
[P/D] Check wildcard address for layerwise connector (#7389) ### What this PR does / why we need it? Check wildcard address address for layerwise connector - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: liziyu <liziyu16@huawei.com>2 个月前
[Lint] fix typos error in epd_load_balance_proxy_layerwise_server_example.py (#7199) ### What this PR does / why we need it? his PR fixes a typo in two function names in the epd_load_balance_proxy_layerwise_server_example.py example script. The function names aquire_aborted_pd_requests and aquire_aborted_prefiller_requests were misspelled and have been corrected to acquire_aborted_pd_requests and acquire_aborted_prefiller_requests respectively. This improves code readability and correctness. Signed-off-by: Ronald1995 <ronaldautomobile@163.com>2 个月前
[CI]Fixed the spell check function in typos.toml (#6753) ### What this PR does / why we need it? The incorrect regular expression syntax .*[UE4M3|ue4m3].* actually ignores all words containing any of the following characters: `u, e, 4, m, 3, |` ```yaml extend-ignore-identifiers-re = [".*Unc.*", ".*_thw", ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*", ".*ot.*", ".*[Tt]h[rR].*"] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".*Unc.*", ".*_thw", ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*", ".*ot.*", ".*[Tt]h[rR].*"] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[CI]Fixed the spell check function in typos.toml (#6753) ### What this PR does / why we need it? The incorrect regular expression syntax .*[UE4M3|ue4m3].* actually ignores all words containing any of the following characters: `u, e, 4, m, 3, |` ```yaml extend-ignore-identifiers-re = [".*Unc.*", ".*_thw", ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*", ".*ot.*", ".*[Tt]h[rR].*"] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".*Unc.*", ".*_thw", ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*", ".*ot.*", ".*[Tt]h[rR].*"] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: MrZ20 <2609716663@qq.com>3 个月前
[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>3 个月前
[ModelLoader][Feature] Add rfork support for fast model loading (#7392) ### What this PR does / why we need it? Support an new load format: RFORK For implementation details of this feature, please refer to #7441 ### Does this PR introduce _any_ user-facing change? add an new options for load-format: rfork e.g. ```bash vllm serve /workspace/models/Qwen3-8B --load-format rfork ``` ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d Signed-off-by: Marck <1412354149@qq.com>2 个月前
[Feature][Debug] Upgrade device-side debug print with launchHostFunc (#8079) ### What this PR does / why we need it? This PR upgrades Ascend debug printing from the old acl_graph_print path to a new launchHostFunc-based device_print implementation. The previous design had two major issues: - it could hit a deadlock path like `main thread -> device sync -> callback -> GIL -> main thread` - it fails under Dynamo / torch.compile To address that, this PR adds a custom-op-based device print path that can be preserved through compile and graph execution, and removes the older graph-only helper. - add device_print(str) and device_print_tensor(Tensor) custom ops in csrc/torch_binding.cpp - add Meta registrations for the new print ops in csrc/torch_binding_meta.cpp - mark the print ops as side-effectful in vllm_ascend/utils.py so FX/Inductor does not drop or reorder them - remove acl_graph_print and its old stream subscription / cleanup path from vllm_ascend/utils.py - keep device_print as the single Python debugging helper, with single-argument semantics - add examples/device_print_demo.py to validate eager, torch.compile(backend="aot_eager"), and torch.npu.graph replay - prefer CANN ACL headers over torch_npu's bundled ACL headers so aclrtLaunchHostFunc ### Does this PR introduce _any_ user-facing change? Yes, for developers debugging Ascend execution: - acl_graph_print is removed - device_print(...) becomes the supported debug helper - the new helper is intended to work in eager mode, torch.compile, and NPU graph replay ### How was this patch tested? - added manual validation through examples/device_print_demo.py - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>1 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Refactor]Refactor of vllm_ascend/distributed module (#5719) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d --------- Signed-off-by: lty <linhebiwen@gmail.com>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Doc][Misc] Add metrics usage documentation and example (#6962) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that output.metrics returns None when using the vLLM offline inference API. **Root Cause**: vLLM disables log stats by default (disable_log_stats=True), which causes output.metrics to be None. **Changes**: 1. Added a NOTE comment in examples/offline_inference_npu.py explaining how to enable metrics 2. Created a new example examples/offline_inference_metrics.py demonstrating how to access request-level metrics (first_token_time, finished_time, etc.) by setting disable_log_stats=False ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com>2 个月前
[Doc][Misc] Add metrics usage documentation and example (#6962) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that output.metrics returns None when using the vLLM offline inference API. **Root Cause**: vLLM disables log stats by default (disable_log_stats=True), which causes output.metrics to be None. **Changes**: 1. Added a NOTE comment in examples/offline_inference_npu.py explaining how to enable metrics 2. Created a new example examples/offline_inference_metrics.py demonstrating how to access request-level metrics (first_token_time, finished_time, etc.) by setting disable_log_stats=False ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com>2 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
[Lint]Style: Convert example to ruff format (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the example/ to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>4 个月前
Drop torchair (#4814) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>5 个月前
[Feat] [310p] Support w8a8sc quantization method (#7075) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>2 个月前