| [SpecDecode] Add spec decode support (#500)
### What this PR does / why we need it?
Backport: https://github.com/vllm-project/vllm-ascend/pull/252
This support speculative decoding in Ascend, including speculating with
a draft model、by matching n-grams in the prompt、using MLP speculators
and using EAGLE based draft models.
Backport: https://github.com/vllm-project/vllm-ascend/pull/423
spec decode MultiStepWorker support TP1DraftModelRunner fully, support
run the draft_model_runner with multi-step prepare on the NPU directly
and support draft_model_runner use MLA.
1. before this pr, MultiStepWorker would not step into the branch
using NPU prepare, but only into the branch using CPU prepare (line 52
of vllm_ascend/patch/patch_multi_step_worker.py). Although this has
no effect on the correct operation of speculative decoding and the
performance of the two branches is basically the same as of the current
version, I support entering this branch in this PR. In general, there
are two main changes in patch_multi_step_worker.py: first, the
is_cuda_like() check is removed and the TP1DraftModelRunner
rewritten in vllm_ascend is used; second, the
supports_gpu_multi_step() function is made to return true on NPU
devices when outer Multi_step_worker could work correct.
3. before this pr, TP1DraftModelRunner only supports Attention on NPU,
but not MLA. The relevant adaptation is in
vllm_ascend/worker/draft_model_runner.py. Although I don’t know why
the input_positions of model_input.attn_metadata in vllm-ascend
needs to be added in execute_model, it is done in model_runner.py,
so I also made corresponding changes. Otherwise, when atten_backend is
MLA, it will prompt that input_positions cannot be found.
4. I commented out two lines in draft_model_runner.py in line118 to
support the scenario of K>1.
```
# lora_mapping=model_input.lora_mapping,
# lora_requests=model_input.lora_requests,
```
I added comments. In the future, when vllm-ascend supports lora feature,
the changes here can be restored.
TODO:
- [ ] revert the patch when the related issues are addressed in vllm
### How was this patch tested?
CI passed with new added test.
- e2e test for medusa proposer:
tests/singlecard/spec_decode/e2e/test_medusa_correctness.py
- e2e test for mlp proposer:
tests/singlecard/spec_decode/e2e/test_mlp_correctness.py
- e2e test for n-gram proposer:
tests/singlecard/spec_decode/e2e/test_ngram_correctness.py
Tests for patched files:
- tests/singlecard/spec_decode/test_dynamic_spec_decode.py
- tests/singlecard/spec_decode/test_multi_step_worker.py
- tests/singlecard/spec_decode/test_ngram_worker.py
- tests/singlecard/spec_decode/test_spec_decode_worker.py
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com> | 1 年前 |