calibrate_npu_gpu
Overview
MindStudio Profiler Analyze (msprof-analyze) provides the calibrate_npu_gpu feature to automatically compare NPU and GPU profile data, helping users perform cross-platform performance calibration and bottleneck analysis. This feature provides the following capabilities:
- Cross-platform analysis: supports profile data in NVIDIA GPU (Nsys SQLite) and Ascend NPU (PyTorch Profiler DB) formats.
- Module matching: uses rule-based matching and fuzzy matching (Levenshtein distance) to automatically align the module hierarchy of GPUs and NPUs.
- Performance difference analysis: accurately calculates the duration ratio between GPU and NPU within the same module to identify performance degradation points.
- Visualization report: generates a comparison report in Excel format.
Preparations
Environment Setup
Install msprof-analyze. For details, see MindStudio Profiler Analyze Installation Guide.
Data Preparation
-
GPU Profile Data Collection
For the GPU platform, you are advised to use NVIDIA Nsys to collect profile data of the PyTorch model. The following script shows how to use
nsys profileto collect GPU profile data for vLLM inference:#!/bin/bash echo "Start Profiling" export CUDA_VISIBLE_DEVICES=0,1 dir_model="/path/to/model" dir_output_prof="/path/to/model_profile_gpu" nsys profile \ --stats=true \ --trace-fork-before-exec=true \ --cuda-graph-trace=node \ --trace=cuda,nvtx \ --capture-range=cudaProfilerApi \ --pytorch=autograd-nvtx \ -o ${dir_output_prof} \ vllm bench latency \ --enforce-eager \ --model ${dir_model} \ --num-iters-warmup 5 \ --num-iters 1 \ --batch-size 16 \ --input-len 512 \ --model-parallel-size 2 \ --output-len 8Key Option Description
--stats=true: enables Nsys to automatically generate an SQLite database file after profile data collection is complete.--trace=cuda,nvtx: enables CUDA and NVTX tracing. NVTX markers are used for module hierarchy parsing.--pytorch=autograd-nvtx: enables NVTX markers for PyTorch autograd.--capture-range=cudaProfilerApi: captures profile data between thecudaProfilerStartandcudaProfilerStopcalls.--enforce-eager: disables CUDA Graphs to ensure operators are executed sequentially for accurate timing.
-
NPU Profile Data Collection
For the NPU (Ascend) platform, use PyTorch Profiler to collect profile data and ensure that
MSTXinstrumentation is enabled. For details, see Ascend PyTorch Profiler.The following script shows how to collect NPU profile data for vLLM inference. Before running it, ensure you have modified the vLLM benchmark latency script by referring to Modifying the vLLM Benchmark Latency Script.
#!/bin/bash # eager mode # Use the profiler capability of vLLM. To support MSTX, modify vllm-ascend/vllm_ascend/worker/worker_v1.py. # - Set mstx in experimental_config to True to enable custom instrumentation. # - Add db to export_type. # - Set enforce_eager to True during vLLM inference. dir_model="/path/to/model" dir_ouput_prof="/path/to/model_profile_npu" # Use the VLLM_TORCH_PROFILER_DIR environment variable to enable profile data collection and specify the profile data output directory. This variable can also be set directly in the terminal. export VLLM_TORCH_PROFILER_DIR=${dir_ouput_prof} export ASCEND_RT_VISIBLE_DEVICES=0,1 echo "Start Profiling" # Modify the benchmark code to add MSTX instrumentation. # Add a profile option to start llm.start_profile(). vllm bench latency \ --enforce-eager \ --model ${dir_model} \ --num-iters-warmup 5 \ --num-iters 1 \ --batch-size 16 \ --input-len 512 \ --output-len 8 \ --model-parallel-size 2 \ --profileKey Configuration Description
- Environment variable: Use
VLLM_TORCH_PROFILER_DIRto set the profile data output directory. Instrumentation configuration: Modify the vLLM-Ascend code to setmsprof_tx=Trueandexport_type=['text', 'db']inexperimental_config. eager mode: Currently, only eager mode is supported. Use--enforce-eagerto ensure operators are executed one by one.
- Environment variable: Use
-
Data File Requirements
Ensure that the collected profile data files meet the following requirements:
- GPU file: SQLite database file exported by Nsys (typically with the .sqlite extension).
- NPU file: DB file generated by PyTorch Profiler (typically
ascend_pytorch_profiler_0.db). - Data integrity: The file must contain complete module hierarchy information (NVTX/MSTX instrumentation) and kernel execution details.
Function Description
Function
Compares the prepared GPU and NPU profile data and outputs the results.
Syntax
msprof-analyze cluster -m calibrate_npu_gpu --profiling_path <npu_profile_path> --baseline_profiling_path <gpu_profile_path> [--output_path <output_path>] [--export_type <export_type>] [--fuzzy_threshold <fuzzy_threshold>] [--dump_intermediate_results]
Command-line options
| Option | Mandatory (Yes/No) | Description |
|---|---|---|
| -m | Yes | Specifies the analysis mode to execute. Set it to calibrate_npu_gpu to enable NPU and GPU profile data breakdown and comparison. |
| --profiling_path | Yes | Specifies the directory of the NPU profile data. |
| --baseline_profiling_path | Yes | Specifies the directory of the GPU profile data files. |
| --output_path | No | Specifies the analysis output directory. By default, results are saved in the current directory. |
| --export_type | No | Specifies the export type. Valid values: db (default) or text. |
| --fuzzy_threshold | No | Specifies the fuzzy match threshold for NPU/GPU module names, which defaults to 0.8. |
| --dump_intermediate_results | No | Saves the intermediate analysis results (GPU/NPU profile analysis results) in the {platform}_report_{rank_id}.xlsx file. By default, this option is not specified, and intermediate analysis results are not saved. |
Example
Run the following command to perform GPU and NPU calibration analysis:
msprof-analyze cluster -m calibrate_npu_gpu \
--profiling_path /path/to/npu_profile \
--baseline_profiling_path /path/to/gpu_profile.sqlite \
--output_path ./calibration_result \
--export_type text \
--dump_intermediate_results
Output Description
msprof-analyze generates the compare_profile_report_{rank_id}.xlsx file in the directory specified by the --output_path option. The following table describes fields in this file.
| Field | Description |
|---|---|
| (GPU) Parent Module | Name of the parent module on the GPU |
| (GPU) Module | Name of the module on the GPU |
| (NPU) Parent Module | Name of the parent module on the NPU |
| (NPU) Module | Name of the module on the NPU |
| Match Type | Matching type: rule (rule-based matching) or fuzzy (fuzzy matching) |
| (GPU) Op Name | List of operator names on the GPU |
| (GPU) Op Count | Number of operator occurrences on the GPU |
| (GPU) Kernel List | List of kernel names on the GPU |
| (GPU) Total Kernel Duration(us) | Total execution duration on the GPU (μs) |
| (GPU) Total Kernel Duration(%) | Total execution duration percentage on the GPU (%) |
| (GPU) Avg Kernel Duration(us) | Average execution duration on the GPU (μs) |
| (NPU) Op Name | List of operator names on the NPU |
| (NPU) Op Count | Number of operator occurrences on the NPU |
| (NPU) Kernel List | List of kernel names on the NPU |
| (NPU) Total Kernel Duration(us) | Total execution duration on the NPU (μs) |
| (NPU) Total Kernel Duration(%) | Total execution duration percentage on the NPU (%) |
| (NPU) Avg Kernel Duration(us) | Average execution duration on the NPU (μs) |
| (NPU/GPU) Module Time Ratio | Module-level NPU/GPU duration ratio |
| (NPU-GPU,us) Module Time Diff | Module-level NPU-GPU duration difference (μs) |
Appendix
Modifying the vLLM Benchmark Latency Script
The following is a code snippet from vllm/vllm/benchmarks/latency.py for the GPU environment. Modify the highlighted sections to adapt the code for the NPU environment:
import argparse
import dataclasses
import json
import os
import time
-from typing import Any
+from typing import Any, Optional
import numpy as np
import torch
import torch.nn as nn
from tqdm import tqdm
import vllm.envs as envs
-from vllm.benchmarks.lib.utils import convert_to_pytorch_benchmark_format, write_to_json
+from vllm.benchmarks.lib.utils import (convert_to_pytorch_benchmark_format, write_to_json)
from vllm.engine.arg_utils import EngineArgs
from vllm.inputs import PromptType
from vllm.sampling_params import BeamSearchParams
# Modification 1: Inject NVTX ranges into nn.Module calls for profiling
# --- START OF INJECTION ---
-import nvtx
+import torch_npu
+import torch_nn as nn
import torch.cuda.profiler as cuda_profiler
original_call = nn.Module.__call__
def custom_call(self, *args, **kwargs):
+ # Customize the call method and add MSTX instrumentation.
module_name = self.__class__.__name__
- nvtx.push_range(module_name)
+ mstx_id = torch_npu.npu.mstx.range_start(module_name, domain="Module") # Start module instrumentation with domain set to Module
tmp = original_call(self, *args, **kwargs)
- nvtx.pop_range()
+ torch_npu.npu.mstx.range_end(mstx_id, domain="Module") # End module instrumentation with domain set to Module
return tmp
nn.Module.__call__ = custom_call
-# --- END OF INJECTION ---
+# Replace the default call method
... # original code
def main(args: argparse.Namespace):
... # original code
def llm_generate():
if not args.use_beam_search:
llm.generate(dummy_prompts, sampling_params=sampling_params, use_tqdm=False)
else:
llm.beam_search(
dummy_prompts,
BeamSearchParams(
beam_width=args.n,
max_tokens=args.output_len,
ignore_eos=True,
),
)
- def run_to_completion(profile_dir: str | None = None):
+ def run_to_completion(profile_dir: Optional[str] = None):
if profile_dir:
llm.start_profile()
llm_generate()
llm.stop_profile()
else:
start_time = time.perf_counter()
llm_generate()
end_time = time.perf_counter()
latency = end_time - start_time
return latency
print("Warming up...")
for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"):
run_to_completion(profile_dir=None)
if args.profile:
profile_dir = envs.VLLM_TORCH_PROFILER_DIR
print(f"Profiling (results will be saved to '{profile_dir}')...")
run_to_completion(profile_dir=profile_dir)
return
- cuda_profiler.start() # Modification 2: inform nsys to start profiling at this point
# Benchmark.
latencies = []
for _ in tqdm(range(args.num_iters), desc="Profiling iterations"):
latencies.append(run_to_completion(profile_dir=None))
- cuda_profiler.stop() # Modification 3: inform nsys to stop profiling at this point
... # original code