Precision Data Collection in the vLLM Scenario
Overview
msProbe can collect intermediate precision data during model execution in the vLLM inference scenario.
The usage methods for different vLLM scenarios are as follows:
- For vLLM Ascend, see msProbe Debugging Guide.
- For the native vLLM, add the
PrecisionDebuggerAPI toGPUModelRunnerand start inference to collect data. For details, see the following sections.
For details about performance changes in dump statistics mode and data volume collected in dump tensor mode, see [Dump Baseline] (../baseline/pytorch_data_dump_perf_baseline.md).
Note:
-
Currently, data can be collected only in vLLM eager mode. When starting the service, you need to specify the
--enforce-eagerparameter. -
In the vLLM-Ascend scenario, you are advised to directly reuse the capabilities that have been integrated in the official document and pass
dump_config_paththrough--additional-config. -
PrecisionDebuggershould be initialized as early as possible. It is recommended that the initialization be performed before the model execution logic inGPUModelRunner.__init__starts to ensure that related APIs can be encapsulated by the tool. -
Typically,
execute_modeltakes a long time. You are advised to wrap the logic afterstartin atry...finallyblock, and callstopandstepinside thefinallyclause. If wrapping the logic is inconvenient, you may also callstopandstepbefore eachreturnstatement. Otherwise, result files such asdump.jsonmay become incomplete. -
To collect
L0ormixdata, themodelparameter needs to be passed tostart. To collect onlyL1data, you can directly callstart(). -
If a dynamo-related error occurs, set
export TORCHDYNAMO_DISABLE=1on NPU to disable dynamo globally. -
The tool provides a fixed API support list. If you need to delete or add dump APIs, manually modify support_wrap_ops.yaml. The following is an example:
functional: # functional indicates the operator type. Find the corresponding type and delete or add APIs in the following format: - conv1d - conv2d - conv3dScenario where an API is deleted: The code logic of some models may involve native API type verification. When the tool performs the dump operation, the API type encapsulated by the model may be different from the native API type. In this case, the verification may fail. For details, see Exceptions in FAQs.
Preparations
Environment Preparation
Install msProbe by referring to msProbe Installation Guide.
Constraints
Only data of models implemented based on the PyTorch framework can be collected. Currently, the dynamo scenario where the PyTorch version is 2.7 or later is not supported.
Quick Start
The following describes how to use the tool in two vLLM scenarios.
vLLM-Ascend
In this scenario, you can directly use msProbe capabilities by simply using --additional-config to pass the dump configuration file path. The following is an example provided in the official document:
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--dtype float16 \
--enforce-eager \
--host 0.0.0.0 \
--port 8000 \
--additional-config '{"dump_config_path": "/data/msprobe_config.json"}'
Note:
- The official document can be found on the vLLM-Ascend latest documentation page at the following link:
https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/msprobe_guide.html - This method is applicable to vLLM Ascend that has integrated msProbe capabilities. You do not need to manually modify the
GPUModelRunnerimplementation.
Native vLLM
The following uses a simple example to show how to use msProbe to collect precision data in GPUModelRunner of the native vLLM framework.
-
Create a configuration file.
Create a
config.jsonfile in the current directory to configure dump parameters. The following is an example:{ "task": "statistics", "dump_path": "/home/data_dump", "rank": [], "step": [], "level": "mix", "async_dump": false, "statistics": { "scope": [], "list": [], "data_mode": [ "all" ], "summary_mode": "statistics" } }For details about the
config.jsonfile and configuration examples, see Configuration File Introduction. -
Enable msProbe in the vLLM framework.
Find the file to which the
GPUModelRunnerclass of the vLLM framework belongs, for example,vllm/v1/worker/gpu_model_runner.py.-
Add
PrecisionDebuggerto the__init__method ofGPUModelRunnerand pass theconfig.jsonpath.class GPUModelRunner: def __init__( self, vllm_config: VllmConfig, device: torch.device, ): ################################ msprobe ################################ from msprobe.pytorch import PrecisionDebugger, seed_all seed_all(mode=True) self.debugger = PrecisionDebugger(config_path="./config.json") ################################ msprobe ################################ self.vllm_config = vllm_config self.device = device ... -
Add
start,stop, andstepAPIs to theexecute_modelmethod ofGPUModelRunner.execute_modeltypically takes a long time. You are advised to usetry...finallyto handle the end logic in a unified manner. This approach requires minimal code modification and is less likely to miss areturnbranch.@torch.inference_mode() def execute_model( self, scheduler_output: "SchedulerOutput", intermediate_tensors: Optional[IntermediateTensors] = None, ) -> Union[ModelRunnerOutput, torch.Tensor]: ################################ msprobe ################################ if hasattr(self, "debugger"): self.debugger.start(model=self.model) ################################ msprobe ################################ try: ... return output finally: ################################ msprobe ################################ if hasattr(self, "debugger"): self.debugger.stop() self.debugger.step() ################################ msprobe ################################If using
try...finallyto wrap the entire code block is inconvenient, you can also add the following call before each correspondingreturnstatement.if hasattr(self, "debugger"): self.debugger.stop() self.debugger.step() return output
-
-
Start the vLLM service and collect data.
When starting the service, you need to enable the eager mode. The following is an example:
#!/bin/bash export TORCHDYNAMO_DISABLE=1 vllm serve Qwen/Qwen2.5-0.5B-Instruct \ --dtype float16 \ --enforce-eager \ --host 0.0.0.0 \ --port 8000After the service is started, send an inference request. During the request execution, dump is automatically triggered.
curl http://127.0.0.1:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "prompt": "Explain gravity in one sentence.", "max_tokens": 32, "temperature": 0 }'
Data Collection
vLLM uses the same precision data collection functions and data structure as PyTorch. For more details, see Precision Data Collection in PyTorch.