| [CI]Main2main 0515 (#9176)
### What this PR does / why we need it?
Upstream PR
[vllm-project/vllm#39568](https://github.com/vllm-project/vllm/pull/39568)
is a complete rewrite of the routed-experts capture/transport pipeline.
It supersedes both:
- The original 0.20.2 design — RoutedExpertsCapturer.get_instance()
singleton, save_captured_experts(indices=...), shared-memory +
fcntl.flock cross-process transport.
- The intermediate PR #39917 design — module-level
get_global_experts_capturer(),
init_routed_experts_capturer_with_shared_cache(),
issue_routing_d2h_copy(),
extract_routed_experts_for_current_batch(). This API existed in main
for only a few days and was never in a stable release; it has been
**fully removed**.
After the upgrade to vLLM 0515, vllm-ascend faces two API surfaces that
are incompatible at the source level:
| Aspect | 0.20.2 | main |
|---|---|---|
| Capturer access | RoutedExpertsCapturer.get_instance() (singleton) |
runner.routed_experts_capturer (per-runner instance, no global) |
| Per-step clear_buffer | via singleton | via runner attribute |
| Per-step D2H + ship |
capturer.save_captured_experts(indices=cpu_slot_mapping) (sync, shm
write) | runner-managed pinned routed_experts_cpu D2H +
RoutedExpertsLists on ModelRunnerOutput.routed_experts |
| Output channel | shm/flock to scheduler |
ModelRunnerOutput.routed_experts: RoutedExpertsLists (NamedTuple,
msgpack + zmq IPC) |
| slot_mapping source | slot_mapping.cpu().numpy() saved to
self.cpu_slot_mapping | private device snapshot
routed_experts_slot_mapping_device, then pinned
routed_experts_slot_mapping_cpu |
| Layer hook injection | select_experts calls singleton from inside
apply() | module.router.set_capture_fn(...) from
_bind_routed_experts_capturer |
## Strategy Overview
1. **Keep the 0.20.2 path intact.** It already works end-to-end. All
0.20.2-specific call sites stay byte-identical.
2. **Add a parallel main path** gated by `vllm_version_is("0.20.2") ==
False. Reuse upstream GPUModelRunner.init_routed_experts_capturer()`
(inherited) for buffer allocation; override only
_bind_routed_experts_capturer because Ascend's select_experts does
not go through upstream BaseRouter.
3. **Async scheduling: piggyback on upstream
AsyncGPUModelRunnerOutput.** vllm-ascend already constructs that
wrapper directly, so adding the routed_experts= kwarg is enough — the
wrapper handles to_cpu_nonblocking() on its copy stream and
tolists() finalization in get_output() for free.
4. **No new compat module, no monkey patches.** Branching is inline at
each call site; total surface is one new method
(_bind_routed_experts_capturer) plus three branched call sites in
model_runner_v1.py and one in fused_moe.py.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.20.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/ce29c26b31d432b1b4bc028c46bb2c3b07a667d8
---------
Signed-off-by: wangli <wangli858794774@gmail.com> | 12 天前 |