vllm-ascend/examples · qq_43804300/vllm-ascend - AtomGit

GGitHub[Feature][Debug] Upgrade device-side debug print with launchHostFunc (#8079 )

文件	最后提交记录	最后更新时间
chat_templates	[MM][Doc] Update online serving tutorials for `Qwen2-Audio` (#3606) ### What this PR does / why we need it? Update online serving tutorials for `Qwen2-Audio`. Part of https://github.com/vllm-project/vllm-ascend/issues/3508. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	7 个月前
disaggregated_encoder	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	3 个月前
disaggregated_prefill_v1	[P/D] Check wildcard address for layerwise connector (#7389) ### What this PR does / why we need it? Check wildcard address address for layerwise connector - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2 个月前
epd_disaggregated	[Lint] fix typos error in epd_load_balance_proxy_layerwise_server_example.py (#7199) ### What this PR does / why we need it? his PR fixes a typo in two function names in the `epd_load_balance_proxy_layerwise_server_example.py` example script. The function names `aquire_aborted_pd_requests` and `aquire_aborted_prefiller_requests` were misspelled and have been corrected to `acquire_aborted_pd_requests` and `acquire_aborted_prefiller_requests` respectively. This improves code readability and correctness. Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2 个月前
eplb	[CI]Fixed the spell check function in `typos.toml` (#6753) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: MrZ20 <2609716663@qq.com>	3 个月前
external_online_dp	[CI]Fixed the spell check function in `typos.toml` (#6753) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: MrZ20 <2609716663@qq.com>	3 个月前
quantization	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	3 个月前
rfork	[ModelLoader][Feature] Add rfork support for fast model loading (#7392) ### What this PR does / why we need it? Support an new load format: RFORK For implementation details of this feature, please refer to #7441 ### Does this PR introduce _any_ user-facing change? add an new options for load-format: rfork e.g. ```bash vllm serve /workspace/models/Qwen3-8B --load-format rfork ``` ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d Signed-off-by: Marck <1412354149@qq.com>	2 个月前
device_print_demo.py	[Feature][Debug] Upgrade device-side debug print with launchHostFunc (#8079) ### What this PR does / why we need it? This PR upgrades Ascend debug printing from the old `acl_graph_print` path to a new `launchHostFunc`-based `device_print` implementation. The previous design had two major issues: - it could hit a deadlock path like `main thread -> device sync -> callback -> GIL -> main thread` - it fails under Dynamo / `torch.compile` To address that, this PR adds a custom-op-based device print path that can be preserved through compile and graph execution, and removes the older graph-only helper. - add `device_print(str)` and `device_print_tensor(Tensor)` custom ops in `csrc/torch_binding.cpp` - add Meta registrations for the new print ops in `csrc/torch_binding_meta.cpp` - mark the print ops as side-effectful in `vllm_ascend/utils.py` so FX/Inductor does not drop or reorder them - remove `acl_graph_print` and its old stream subscription / cleanup path from `vllm_ascend/utils.py` - keep `device_print` as the single Python debugging helper, with single-argument semantics - add `examples/device_print_demo.py` to validate eager, `torch.compile(backend="aot_eager")`, and `torch.npu.graph` replay - prefer CANN ACL headers over torch_npu's bundled ACL headers so `aclrtLaunchHostFunc` ### Does this PR introduce _any_ user-facing change? Yes, for developers debugging Ascend execution: - `acl_graph_print` is removed - `device_print(...)` becomes the supported debug helper - the new helper is intended to work in eager mode, `torch.compile`, and NPU graph replay ### How was this patch tested? - added manual validation through `examples/device_print_demo.py` - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	1 个月前
offline_data_parallel.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_disaggregated_prefill_npu.py	[Refactor]Refactor of vllm_ascend/distributed module (#5719) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d --------- Signed-off-by: lty <linhebiwen@gmail.com>	4 个月前
offline_embed.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_external_launcher.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_inference_audio_language.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_inference_metrics.py	[Doc][Misc] Add metrics usage documentation and example (#6962) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that `output.metrics` returns `None` when using the vLLM offline inference API. Root Cause: vLLM disables log stats by default (`disable_log_stats=True`), which causes `output.metrics` to be `None`. Changes: 1. Added a NOTE comment in `examples/offline_inference_npu.py` explaining how to enable metrics 2. Created a new example `examples/offline_inference_metrics.py` demonstrating how to access request-level metrics (`first_token_time`, `finished_time`, etc.) by setting `disable_log_stats=False` ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com>	2 个月前
offline_inference_npu.py	[Doc][Misc] Add metrics usage documentation and example (#6962) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that `output.metrics` returns `None` when using the vLLM offline inference API. Root Cause: vLLM disables log stats by default (`disable_log_stats=True`), which causes `output.metrics` to be `None`. Changes: 1. Added a NOTE comment in `examples/offline_inference_npu.py` explaining how to enable metrics 2. Created a new example `examples/offline_inference_metrics.py` demonstrating how to access request-level metrics (`first_token_time`, `finished_time`, etc.) by setting `disable_log_stats=False` ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: NJX-njx <3771829673@qq.com>	2 个月前
offline_inference_npu_long_seq.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_inference_npu_tp2.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_inference_sleep_mode_npu.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
offline_weight_load.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
prompt_embed_inference.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
prompt_embedding_inference.py	[Lint]Style: Convert `example` to `ruff format` (#5863) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	4 个月前
run_dp_server.sh	Drop torchair (#4814) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	5 个月前
save_sharded_state_310.py	[Feat] [310p] Support w8a8sc quantization method (#7075) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2 个月前