Precision Data Collection in SGLang
Overview
msProbe can collect model precision data during mode execution by adding PrecisionDebugger to the core class ModelRunner
that is responsible for executing forward propagation in the SGLang framework and starting inference
For details about performance changes in dump statistics mode and data volume collected in dump tensor mode, see [Dump Baseline] (../baseline/pytorch_data_dump_perf_baseline.md).
Note:
-
Before collecting data, specify
--disable-cuda-graphof the SGLang framework to disable the graph mode. -
To collect data in online SGLang mode, specify
--skip-server-warmupof the SGLang framework to disable warmup, preventing data in the warmup phase from being collected. -
If a dynamo-related error occurs, set
export TORCHDYNAMO_DISABLE=1on NPU to disable dynamo globally. -
The tool provides a fixed API support list. If you need to delete or add dump APIs, manually modify support_wrap_ops.yaml. The following is an example:
functional: # functional indicates the operator type. Find the corresponding type and delete or add APIs in the following format: - conv1d - conv2d - conv3dScenario where an API is deleted: The code logic of some models may involve native API type verification. When the tool performs the dump operation, the API type encapsulated by the model may be different from the native API type. In this case, the verification may fail. For details, see FAQs.
Preparations
Environment Setup
Install msProbe by referring to msProbe Installation Guide.
Constraints
Only data of models implemented based on the PyTorch framework can be collected. Currently, the dynamo scenario where the PyTorch version is 2.7 or later is not supported.
Quick Start
The following uses a simple example to describe how to use msProbe to collect precision data in the SGLang framework.
-
Create a configuration file.
Create a
config.jsonfile in the current directory to configure dump parameters. The following is an example:{ "task": "statistics", "dump_path": "/home/data_dump", "rank": [], "step": [], "level": "mix", "async_dump": false, "statistics": { "scope": [], "list": [], "data_mode": [ "all" ], "summary_mode": "statistics" } }For details about config.json, see Configuration File Introduction.
-
Enable msProbe in the SGLang framework.
Find the file of the
ModelRunnerclass in the SGLang framework: sglang/srt/model_executor/model_runner.py.-
Add the
PrecisionDebuggerAPI to the__init__method of theModelRunnerclass and pass the path of theconfig.jsonfile.class ModelRunner(ModelRunnerKVCacheMixin): """ModelRunner runs the forward passes of the models.""" def __init__( self, model_config: ModelConfig, mem_fraction_static: float, gpu_id: int, tp_rank: int, ... ): ################################ msprobe ################################ from msprobe.pytorch import PrecisionDebugger, seed_all seed_all(mode=True) self.debugger = PrecisionDebugger(config_path="./config.json") ################################ msprobe ################################ # Parse args self.mem_fraction_static = mem_fraction_static self.device = server_args.device self.gpu_id = gpu_id self.tp_rank = tp_rank self.tp_size = tp_size ... -
Add
start,stop, andstepAPIs to theforwardmethod of theModelRunnerclass.Default scenario of the SGLang framework:
class ModelRunner(ModelRunnerKVCacheMixin): """ModelRunner runs the forward passes of the models.""" def __init__( ... def forward( self, forward_batch: ForwardBatch, skip_attn_backend_init: bool = False, pp_proxy_tensors: Optional[PPProxyTensors] = None, reinit_attn_backend: bool = False, split_forward_count: int = 1, ) -> ModelRunnerOutput: self.forward_pass_id += 1 ################################ msprobe ################################ if hasattr(self, 'debugger'): self.debugger.start(model=self.model) ################################ msprobe ################################ ... ################################ msprobe ################################ if hasattr(self, 'debugger'): self.debugger.stop() self.debugger.step() ################################ msprobe ################################ return outputIf DP is enabled in the SGLang framework (
--dp-size> 1), configurerank_idin thestartAPI.if hasattr(self, 'debugger'): self.debugger.start(model=self.model, rank_id=self.gpu_id)
-
-
Start model inference in the SGLang framework and start collecting data.
-
Online mode
-
Launch the server.
#!/bin/bash export TORCHDYNAMO_DISABLE=1 python3 -m sglang.launch_server \ --model-path Qwen/Qwen2.5-0.5B-Instruct \ --host 127.0.0.1 \ --port 1027 \ --disable-cuda-graph \ --skip-server-warmup -
Send a request to automatically start data dump.
curl -H "Content-type: application/json" \ -X POST \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [ { "role": "user", "content": "Hello, my name is" } ], "max_tokens": 10 }' \ http://127.0.0.1:1027/v1/chat/completions
-
-
Offline mode
The following is an example of the offline script. When the script is executed, data dump automatically starts.
import os import asyncio import sglang as sgl import sglang.test.doc_patch from sglang.utils import async_stream_and_merge, stream_and_merge def main(): llm = sgl.Engine(model_path="Qwen/Qwen2.5-0.5B-Instruct", disable_cuda_graph=True) prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = {"temperature": 0.8, "top_p": 0.95} outputs = llm.generate(prompts, sampling_params) for prompt, output in zip(prompts, outputs): print("===============================") print(f"Prompt: {prompt}\nGenerated text: {output['text']}") if __name__ == '__main__': main()
-
Data Collection
SGLang uses the same precision data collection functions and data structure as PyTorch. For more details, see Precision Data Collection in PyTorch.