文件最后提交记录最后更新时间
[CI] fix nightly MiniMax-M2.5-w8a8 (#9042) ### What this PR does / why we need it? This PR fix nightly MiniMax-M2.5-w8a8 perf config, we need test them daily. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? by the running the test - vLLM version: v0.20.1 - vLLM main: https://github.com/vllm-project/vllm/commit/c7aa186d67b6f051680831418e957c67f34ba7a2 --------- Signed-off-by: weixin <murongfengerxch@163.com>8 天前
[BugFix][CI][310p] Fix CI error for 310p caused by DSV4 (#9402) ### What this PR does / why we need it? Fix 310P online CI issue caused by an extra argument added to blocktable.py in DeepSeek v4. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? CI - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/0d4d334eaa583b9c09aa4eb7538c22db99fd84b3 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>8 天前
[SpecDecode] Add spec decode support (#500) ### What this PR does / why we need it? Backport: https://github.com/vllm-project/vllm-ascend/pull/252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: https://github.com/vllm-project/vllm-ascend/pull/423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, MultiStepWorker would not step into the branch using NPU prepare, but only into the branch using CPU prepare (line 52 of vllm_ascend/patch/patch_multi_step_worker.py). Although this has no effect on the correct operation of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in patch_multi_step_worker.py: first, the is_cuda_like() check is removed and the TP1DraftModelRunner rewritten in vllm_ascend is used; second, the supports_gpu_multi_step() function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, TP1DraftModelRunner only supports Attention on NPU, but not MLA. The relevant adaptation is in vllm_ascend/worker/draft_model_runner.py. Although I don’t know why the input_positions of model_input.attn_metadata in vllm-ascend needs to be added in execute_model, it is done in model_runner.py, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in draft_model_runner.py in line118 to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO: - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>1 年前