calibrate_npu_gpu

Overview

MindStudio Profiler Analyze (msprof-analyze) provides the calibrate_npu_gpu feature to automatically compare NPU and GPU profile data, helping users perform cross-platform performance calibration and bottleneck analysis. This feature provides the following capabilities:

Cross-platform analysis: supports profile data in NVIDIA GPU (Nsys SQLite) and Ascend NPU (PyTorch Profiler DB) formats.
Module matching: uses rule-based matching and fuzzy matching (Levenshtein distance) to automatically align the module hierarchy of GPUs and NPUs.
Performance difference analysis: accurately calculates the duration ratio between GPU and NPU within the same module to identify performance degradation points.
Visualization report: generates a comparison report in Excel format.

Preparations

Environment Setup

Install msprof-analyze. For details, see MindStudio Profiler Analyze Installation Guide.

Data Preparation

GPU Profile Data Collection

For the GPU platform, you are advised to use NVIDIA Nsys to collect profile data of the PyTorch model. The following script shows how to use nsys profile to collect GPU profile data for vLLM inference:
```
#!/bin/bash

echo "Start Profiling"
export CUDA_VISIBLE_DEVICES=0,1
dir_model="/path/to/model"
dir_output_prof="/path/to/model_profile_gpu"

nsys profile  \
    --stats=true \
    --trace-fork-before-exec=true \
    --cuda-graph-trace=node \
    --trace=cuda,nvtx \
    --capture-range=cudaProfilerApi \
    --pytorch=autograd-nvtx \
    -o ${dir_output_prof} \
vllm bench latency \
    --enforce-eager \
    --model ${dir_model} \
    --num-iters-warmup 5 \
    --num-iters 1 \
    --batch-size 16 \
    --input-len 512 \
    --model-parallel-size 2 \
    --output-len 8
```
Key Option Description
- --stats=true: enables Nsys to automatically generate an SQLite database file after profile data collection is complete.
- --trace=cuda,nvtx: enables CUDA and NVTX tracing. NVTX markers are used for module hierarchy parsing.
- --pytorch=autograd-nvtx: enables NVTX markers for PyTorch autograd.
- --capture-range=cudaProfilerApi: captures profile data between the cudaProfilerStart and cudaProfilerStop calls.
- --enforce-eager: disables CUDA Graphs to ensure operators are executed sequentially for accurate timing.

NPU Profile Data Collection

For the NPU (Ascend) platform, use PyTorch Profiler to collect profile data and ensure that MSTX instrumentation is enabled. For details, see Ascend PyTorch Profiler.

The following script shows how to collect NPU profile data for vLLM inference. Before running it, ensure you have modified the vLLM benchmark latency script by referring to Modifying the vLLM Benchmark Latency Script.

#!/bin/bash

# eager mode
# Use the profiler capability of vLLM. To support MSTX, modify vllm-ascend/vllm_ascend/worker/worker_v1.py.
# - Set mstx in experimental_config to True to enable custom instrumentation.
# - Add db to export_type.
# - Set enforce_eager to True during vLLM inference.

dir_model="/path/to/model"
dir_ouput_prof="/path/to/model_profile_npu"

# Use the VLLM_TORCH_PROFILER_DIR environment variable to enable profile data collection and specify the profile data output directory. This variable can also be set directly in the terminal.
export VLLM_TORCH_PROFILER_DIR=${dir_ouput_prof}
export ASCEND_RT_VISIBLE_DEVICES=0,1

echo "Start Profiling"
# Modify the benchmark code to add MSTX instrumentation.
# Add a profile option to start llm.start_profile().
vllm bench latency \
    --enforce-eager \
    --model ${dir_model} \
    --num-iters-warmup 5 \
    --num-iters 1 \
    --batch-size 16 \
    --input-len 512 \
    --output-len 8 \
    --model-parallel-size 2 \
    --profile

Key Configuration Description

Environment variable: Use VLLM_TORCH_PROFILER_DIR to set the profile data output directory. Instrumentation configuration: Modify the vLLM-Ascend code to set msprof_tx=True and export_type=['text', 'db'] in experimental_config. eager mode: Currently, only eager mode is supported. Use --enforce-eager to ensure operators are executed one by one.

Data File Requirements

Ensure that the collected profile data files meet the following requirements:
- GPU file: SQLite database file exported by Nsys (typically with the .sqlite extension).
- NPU file: DB file generated by PyTorch Profiler (typically ascend_pytorch_profiler_0.db).
- Data integrity: The file must contain complete module hierarchy information (NVTX/MSTX instrumentation) and kernel execution details.

Function Description

Function

Compares the prepared GPU and NPU profile data and outputs the results.

Syntax

msprof-analyze cluster -m calibrate_npu_gpu --profiling_path <npu_profile_path> --baseline_profiling_path <gpu_profile_path> [--output_path <output_path>] [--export_type <export_type>] [--fuzzy_threshold <fuzzy_threshold>] [--dump_intermediate_results]

Command-line options

Option	Mandatory (Yes/No)	Description
-m	Yes	Specifies the analysis mode to execute. Set it to `calibrate_npu_gpu` to enable NPU and GPU profile data breakdown and comparison.
--profiling_path	Yes	Specifies the directory of the NPU profile data.
--baseline_profiling_path	Yes	Specifies the directory of the GPU profile data files.
--output_path	No	Specifies the analysis output directory. By default, results are saved in the current directory.
--export_type	No	Specifies the export type. Valid values: `db` (default) or `text`.
--fuzzy_threshold	No	Specifies the fuzzy match threshold for NPU/GPU module names, which defaults to `0.8`.
--dump_intermediate_results	No	Saves the intermediate analysis results (GPU/NPU profile analysis results) in the `{platform}_report_{rank_id}.xlsx` file. By default, this option is not specified, and intermediate analysis results are not saved.

Example

Run the following command to perform GPU and NPU calibration analysis:

msprof-analyze cluster -m calibrate_npu_gpu \
  --profiling_path /path/to/npu_profile \
  --baseline_profiling_path /path/to/gpu_profile.sqlite \
  --output_path ./calibration_result \
  --export_type text \
  --dump_intermediate_results

Output Description

msprof-analyze generates the compare_profile_report_{rank_id}.xlsx file in the directory specified by the --output_path option. The following table describes fields in this file.

Field	Description
(GPU) Parent Module	Name of the parent module on the GPU
(GPU) Module	Name of the module on the GPU
(NPU) Parent Module	Name of the parent module on the NPU
(NPU) Module	Name of the module on the NPU
Match Type	Matching type: `rule` (rule-based matching) or `fuzzy` (fuzzy matching)
(GPU) Op Name	List of operator names on the GPU
(GPU) Op Count	Number of operator occurrences on the GPU
(GPU) Kernel List	List of kernel names on the GPU
(GPU) Total Kernel Duration(us)	Total execution duration on the GPU (μs)
(GPU) Total Kernel Duration(%)	Total execution duration percentage on the GPU (%)
(GPU) Avg Kernel Duration(us)	Average execution duration on the GPU (μs)
(NPU) Op Name	List of operator names on the NPU
(NPU) Op Count	Number of operator occurrences on the NPU
(NPU) Kernel List	List of kernel names on the NPU
(NPU) Total Kernel Duration(us)	Total execution duration on the NPU (μs)
(NPU) Total Kernel Duration(%)	Total execution duration percentage on the NPU (%)
(NPU) Avg Kernel Duration(us)	Average execution duration on the NPU (μs)
(NPU/GPU) Module Time Ratio	Module-level NPU/GPU duration ratio
(NPU-GPU,us) Module Time Diff	Module-level NPU-GPU duration difference (μs)

Appendix

Modifying the vLLM Benchmark Latency Script

The following is a code snippet from vllm/vllm/benchmarks/latency.py for the GPU environment. Modify the highlighted sections to adapt the code for the NPU environment:

import argparse
import dataclasses
import json
import os
import time
-from typing import Any
+from typing import Any, Optional

import numpy as np
import torch
import torch.nn as nn

from tqdm import tqdm

import vllm.envs as envs
-from vllm.benchmarks.lib.utils import convert_to_pytorch_benchmark_format, write_to_json
+from vllm.benchmarks.lib.utils import (convert_to_pytorch_benchmark_format, write_to_json)
from vllm.engine.arg_utils import EngineArgs
from vllm.inputs import PromptType
from vllm.sampling_params import BeamSearchParams

# Modification 1: Inject NVTX ranges into nn.Module calls for profiling
# --- START OF INJECTION ---
-import nvtx
+import torch_npu
+import torch_nn as nn
import torch.cuda.profiler as cuda_profiler
original_call = nn.Module.__call__

def custom_call(self, *args, **kwargs):
+   # Customize the call method and add MSTX instrumentation.
    module_name = self.__class__.__name__
-    nvtx.push_range(module_name)
+    mstx_id = torch_npu.npu.mstx.range_start(module_name, domain="Module")    # Start module instrumentation with domain set to Module
    tmp = original_call(self, *args, **kwargs)
-    nvtx.pop_range()
+    torch_npu.npu.mstx.range_end(mstx_id, domain="Module")    # End module instrumentation with domain set to Module
    return tmp
nn.Module.__call__ = custom_call
-# --- END OF INJECTION ---
+# Replace the default call method

... # original code

def main(args: argparse.Namespace):
    
    ... # original code

    def llm_generate():
        if not args.use_beam_search:
            llm.generate(dummy_prompts, sampling_params=sampling_params, use_tqdm=False)
        else:
            llm.beam_search(
                dummy_prompts,
                BeamSearchParams(
                    beam_width=args.n,
                    max_tokens=args.output_len,
                    ignore_eos=True,
                ),
            )

-    def run_to_completion(profile_dir: str | None = None):
+    def run_to_completion(profile_dir: Optional[str] = None):
        if profile_dir:
            llm.start_profile()
            llm_generate()
            llm.stop_profile()
        else:
            start_time = time.perf_counter()
            llm_generate()
            end_time = time.perf_counter()
            latency = end_time - start_time
            return latency

    print("Warming up...")
    for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"):
        run_to_completion(profile_dir=None)

    if args.profile:
        profile_dir = envs.VLLM_TORCH_PROFILER_DIR
        print(f"Profiling (results will be saved to '{profile_dir}')...")
        run_to_completion(profile_dir=profile_dir)
        return

-    cuda_profiler.start() # Modification 2: inform nsys to start profiling at this point

    # Benchmark.
    latencies = []
    for _ in tqdm(range(args.num_iters), desc="Profiling iterations"):
        latencies.append(run_to_completion(profile_dir=None))

-    cuda_profiler.stop() # Modification 3: inform nsys to stop profiling at this point

    ... # original code