| [SpecDecode][Feature] Implement AscendExtractHiddenStatesProposer for speculative decoding (#8799)
### What this PR does / why we need it?
This PR introduces the AscendExtractHiddenStatesProposer to support
the extract_hidden_states speculative decoding method on Ascend NPUs.
It adapts the base proposer to use ACL graphs and implements
Ascend-specific logic for preparing next token IDs. Additionally, the
model runner is updated to support KV cache allocation and reshaping for
cache_only_layers.
### Does this PR introduce _any_ user-facing change?
Yes, users can now use the extract_hidden_states speculative decoding
method on Ascend hardware.
### How was this patch tested?
The changes were verified with new E2E tests in
tests/e2e/singlecard/spec_decode/test_extract_hidden_states.py and
unit tests in
tests/ut/spec_decode/test_extract_hidden_states_proposer.py.
- vLLM main: vllm-project/vllm@6f786f2
---------
Signed-off-by: Lin-Qingyang-Alec <895744968@qq.com> | 23 天前 |