文件最后提交记录最后更新时间
[MM][Doc] Update online serving tutorials for Qwen2-Audio (#3606) ### What this PR does / why we need it? Update online serving tutorials for Qwen2-Audio. Part of https://github.com/vllm-project/vllm-ascend/issues/3508. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shen-shanshan <467638484@qq.com>7 个月前
[feat] proxy support elastic scaling (#5063) **[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool:** https://github.com/vllm-project/vllm-ascend/issues/3380 ### What this PR does / why we need it? Support elastic scaling for P/D instances based on mooncake conncetor deplayment. **Support API routes** * /instances/add: add prefill nodes or decode nodes to the list. * /instances/remove: remove prefill nodes or decode nodes from the list. **Support functions** * Support **adding** prefill nodes or decode nodes. - If prefill or decode server deployed **after the proxy deployed**, server can use /instances/add API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. * Support **removing** prefill nodes or decode nodes: - Support using /instances/remove API to **delete the node** from the proxy server. ### Does this PR introduce _any_ user-facing change? For examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py: **Add 2 params** When adding nodes to the proxy, the proxy will wait the nodes to be started util retrying a certain of times. | name | type | default | help | | ----- | ---- | ---- | ---- | | max-waiting-retries | int | 3 | Maximum number of retries for waiting nodes to be started | | waiting-retry-interval | float | 10 | Check interval (seconds) for waiting nodes to be started | For example: ```shell python load_balance_proxy_server_example.py \ --host 0.0.0.0 --port 9000 \ --prefiller-hosts 127.0.0.1 127.0.0.1 \ --prefiller-ports 8100 8101 \ --decoder-hosts 127.0.0.1 127.0.0.1 \ --decoder-ports 8200 8201 \ --max-waiting-retries 3 \ --waiting-retry-interval 10 ``` **Add 2 API routings** * Add instances: instances/add For example, add 2 prefiller instances: ```shell curl -X POST http://localhost:9000/instances/add \ -H "Content-Type: application/json" \ -d '{ "type": "prefill", "instances": ["127.0.0.1:8102", "127.0.0.1:8103"] }' ``` Response: ```shell {"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` If the node '127.0.0.1:8103' has not benn started: ```shell {"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * Remove instances: instances/remove For example, remove 1 decoder instance: ```shell curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` Response: ```shell {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? Run proxy and using /instances/add API to add nodes and /instances/remove API to remove nodes * vLLM version: v0.11.0.rc3 * vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3 - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>5 个月前
[Misc] Cleanup useless print and logger (#5220) 1. Remove useless print 2. use vLLM logger 3. change useless INFO to DEBUG level - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>5 个月前
[Bugfix] fix fastapi version (#5047) ### What this PR does / why we need it? fix fastapi version == 0.123.10(<0.124.0) - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: hfadzxy <starmoon_zhang@163.com>5 个月前
[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>6 个月前
[main][bugfix] bugfix for qwen3 moe quantization (#4599) ### What this PR does / why we need it? Fix the issue where the qwen3 moe service cannot be started due to upgrading the vllm version Error info: AttributeError: 'AscendFusedMoE' object has no attribute 'use dp chunking' ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.2 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>5 个月前
[Doc][P/D] Fix MooncakeConnector's name (#5172) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>5 个月前
[Misc] Update pooling example (#5002) ### What this PR does / why we need it? Since the param task has been depprecated, we should use the latest unified standard parameters for pooling models, this should be more clear - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: wangli <wangli858794774@gmail.com>5 个月前
Drop 0.11.0 support (#4377) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2918c1b49c88c29783c86f78d2c4221cb9622379 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>6 个月前
Fix some ci issue and refactor modelrunner (#2445) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4d9c61993ac4209c97b3afef237b2387f2cd9b97 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>9 个月前
[Misc][V0 Deprecation] Add __main__ guard to all offline examples (#1837) ### What this PR does / why we need it? Add __main__ guard to all offline examples. - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/76b494444fd864ffc53a623420668d1865c804b9 --------- Signed-off-by: shen-shanshan <467638484@qq.com>10 个月前
[Doc]clean up ascend scheduler config from doc (#4612) clean up ascend scheduler config from doc - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>5 个月前
[Misc][V0 Deprecation] Add __main__ guard to all offline examples (#1837) ### What this PR does / why we need it? Add __main__ guard to all offline examples. - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/76b494444fd864ffc53a623420668d1865c804b9 --------- Signed-off-by: shen-shanshan <467638484@qq.com>10 个月前
Drop 0.11.0 support (#4377) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2918c1b49c88c29783c86f78d2c4221cb9622379 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>6 个月前
[MOE]move weight transpose to wakeup for RL secnarios (#4626) ### What this PR does / why we need it? In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: lhp-deep <liuhaopeng1@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>5 个月前
[feature] Prompt Embeddings Support for v1 Engine (#3026) ### What this PR does / why we need it? this PR based on [19746](https://github.com/vllm-project/vllm/issues/19746), support Prompt Embeddings for v1 engine on NPU ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python python examples/prompt_embed_inference.py ``` - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: jesse <szxfml@gmail.com>7 个月前
[Misc][V0 Deprecation] Add __main__ guard to all offline examples (#1837) ### What this PR does / why we need it? Add __main__ guard to all offline examples. - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/76b494444fd864ffc53a623420668d1865c804b9 --------- Signed-off-by: shen-shanshan <467638484@qq.com>10 个月前
Drop torchair (#4814) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>5 个月前