advisor

Overview

The expert suggestion (advisor) feature analyzes profile data collected by using Ascend PyTorch Profiler or MindSpore Profiler and provides performance tuning suggestions.

For details about how to collect profile data by using Ascend PyTorch Profiler, see Ascend PyTorch Profiler. For details about how to collect profile data using MindSpore Profiler, see Performance Profiling.

Preparations

Environment Setup

To use the advisor feature through the CLI, you must install MindStudio Profiler Analyze (msprof-analyze). For details, see msprof-analyze Installation Guide.
Using the advisor feature through Jupyter Notebook requires the following preparations:
1. Install Jupyter Notebook. For details about how to install and use Jupyter Notebook, visit the Jupyter Notebook official website.
```
pip install jupyter notebook
```
1. Download the msprof-analyze source code.
```
git clone https://gitcode.com/Ascend/msprof-analyze
```

Data preparation

msprof-analyze requires an input directory containing the collected profile data. For instructions on how to collect such data, see Data Preparation.

Constraints

CANN versions earlier than 8.0RC1 support only text format analysis. CANN 8.0RC1 and later versions support both text and db formats.
Jupyter Notebook is not supported for MindSpore.

`advisor` Functions (CLI)

Function

The msprof-analyze advisor command line includes the following subcommands:

all

Analyzes overall performance bottlenecks: including all functions listed in the following table.
computation

Analyzes compute bottlenecks: including the computation and Kernel compare functions in the following table.
schedule

Analyzes scheduling bottlenecks: including the schedule and API compare functions in the following table.

The following table describes the complete functions of advisor, which are controlled by all, computation, and schedule.

dimension	mode	Description	Supported Scenario
overall	Overall Summary	Breaks down profile data by dimensions such as computation, communication, and idle time.	PyTorch, MindSpore
	Environment Variable Issues	Recommends environment variable settings.	PyTorch
	slow rank	Identifies slow ranks.	PyTorch, MindSpore
	slow link	Identifies slow links.	PyTorch, MindSpore
computation	AICPU Issues	Identifies AICPU issues for performance tuning.	PyTorch, MindSpore
	Operator Dynamic Shape Issues	Identifies dynamic-shape operators.	PyTorch
	AI Core Performance Analysis	Analyzes the performance of MatMul, FlashAttentionScore, AI_VECTOR_CORE, and MIX_AIV operators.	PyTorch
	Block Dim Issues	Identifies Block Dim operator issues for performance tuning.	PyTorch, MindSpore
	Operator No Bound Issues	Analyzes operator bottlenecks.	PyTorch, MindSpore
	Fusion Issues	Analyzes operator fusion issues for graph tuning.	PyTorch, MindSpore
	AI Core Frequency Issues	Analyzes AI Core operator frequency reduction issues.	PyTorch, MindSpore
communication	Packet Analysis	Detects small communication packets.	PyTorch, MindSpore
	Bandwidth Contention Analysis	Detects bandwidth contention between communication and computation.	PyTorch, MindSpore
	Communication Retransmission Analysis	Detects communication retransmission.	PyTorch, MindSpore
	Byte Alignment Analysis	Detects byte alignment for communication operators. For communication operators using the SDMA transmission type, the data volume must be a multiple of 512 bytes to prevent bandwidth degradation.	PyTorch, MindSpore
schedule	Affinity API Issues	Analyzes affinity API replacement for performance tuning	PyTorch, MindSpore
	Operator Dispatch Issues	Identifies operator delivery issues (path 3/path 5).	PyTorch
	SyncBatchNorm Issues	Detects BatchNorm synchronization issues.	PyTorch, MindSpore
	Synchronize Stream Issues	Detects stream synchronization issues.	PyTorch, MindSpore
	GC Analysis	Identifies abnormal garbage collection events. This feature requires enabling `gc_detect_threshold` under `experimental_config` during data collection using Ascend PyTorch Profiler.	PyTorch
	Fusible Operator Analysis	Detects operator sequences with host or MTE bottlenecks for code optimization or the development of fusion operators.	PyTorch, MindSpore
dataloader	Slow Dataloader Issues	Detects abnormal DataLoaders.	PyTorch, MindSpore
memory	Memory Operator Issues	Identifies abnormal memory allocation and release operations.	PyTorch, MindSpore
comparison	Kernel compare of Rank* Step* and Rank* Step*	Identifies the kernel data of the benchmark and that of the profile data to be compared. In scenarios without a benchmark, it compares fast and slow ranks within a cluster. In scenarios with a benchmark, it compares identical ranks across clusters with significant performance gaps.	PyTorch, MindSpore
	Api compare of Rank* Step* and Rank* Step*	Identifies the API data of the benchmark and that of the profile data to be compared. In scenarios without a benchmark, it compares fast and slow ranks within a cluster. In scenarios with a benchmark, it compares identical ranks across clusters with significant performance gaps.	PyTorch

The tool automatically performs cluster and overall environment_variable_analysis in cluster scenarios, whereas it performs only the overall analysis in single-rank scenarios.

Precautions

None

Syntax

Overall performance bottlenecks

msprof-analyze advisor all -d <profiling_path> [-bp <benchmark_profiling_path>] [-o <output_path>] [-cv <cann_version>] [-tv <torch_version>] [-pt <profiling_type>] [--force] [-l <language>] [--debug] [-h]

Compute bottlenecks

msprof-analyze advisor computation -d <profiling_path> [-o <output_path>] [-cv <cann_version>] [-tv <torch_version>] [-pt <profiling_type>] [--force] [-l <language>] [--debug] [-h]

Scheduling bottlenecks

msprof-analyze advisor schedule -d <profiling_path> [-o <output_path>] [-cv <cann_version>] [-tv <torch_version>] [--force] [-l <language>] [--debug] [-h]

Command-line Options

Option	Mandatory (Yes/No)	Description
-d --profiling_path	Yes	Specifies the path to the profile data file or directory. For Ascend PyTorch Profiler, set this option to the `_ascend_pt` profile data result directory. For MindSpore Profiler, set it to the `_ascend_ms` profile data result directory. For cluster data, set it to the parent directory of `_ascend_pt` or `_ascend_ms`.
-bp --benchmark_profiling_path	No	Specifies the directory containing the benchmark profile data used for performance comparison. The profile data is obtained using the profiling tool. This option is not supported by the `computation` or `schedule` subcommands.
-o --output_path	No	Specifies the output path for analysis results. After the `advisor` analysis is complete, the results are saved to this directory. This option defaults to the current directory if not specified.
-cv --cann_version	No	Specifies the CANN software version corresponding to the profiling tool used for data collection. Supported compatible versions are 6.3.RC2, 7.0.RC1, 7.0.0, and 8.0.RC1. If this option is not specified, 8.0.RC1 is used by default. Profile data collected using other CANN versions may cause unpredictable issues during analysis. You can obtain the version by running the following command and checking the version field: `cat /usr/local/Ascend/cann/aarch64-linux/ascend_toolkit_install.info`.
-tv --torch_version	No	Specifies the PyTorch version of the runtime environment. The default value is `1.11.0`. Supported versions are `1.11.0` and `2.1.0`. If the runtime uses a different version (such as `1.11.3`), you can ignore minor version differences and select the closest supported version (such as `1.11.0`).
-pt --profiling_type	No	Specifies the type of the profiling tool used for profile data collection. Valid values: `pytorch` (default): used for profile data collected using the Ascend PyTorch Profiler API. `mindspore`: used for profile data collected using the MindSpore Profiler API. `mslite`: used for profile data collected using the Benchmark tool (not recommended). This option is not supported by the `schedule` subcommand.
--force	No	Forcibly executes `advisor`. This option forcibly skips the following checks: Ownership check: Proceed even if the current user is not the owner of the specified directory or files. File size check: Proceed even if a CSV file exceeds 5 GB, a JSON file exceeds 10 GB, or a DB file exceeds 8 GB. Specifying this option enables forced execution, which is disabled if not specified.
-l --language	No	Specifies the output language for the analysis results. Valid values: `cn` (default): Chinese `en`: English
--debug	No	Enables detailed stack trace printing if a tool error occurs. Specifying this option enables debug mode, which is disabled if not specified.
-h, -H --help	No	Displays help information for the subcommands and parameters of the current command.

Examples

Overall performance bottlenecks

msprof-analyze advisor all -d $HOME/profiling_data/

Compute bottlenecks

msprof-analyze advisor computation -d $HOME/profiling_data/

Scheduling bottlenecks

msprof-analyze advisor schedule -d $HOME/profiling_data/

In single-rank scenarios, specify the *_ascend_pt or *_ascend_ms directory of the profile data. In multi-rank and cluster scenarios, specify the parent directory of *_ascend_pt or *_ascend_ms.

Output Description

Brief analysis suggestions are displayed in the terminal, and mstt_advisor_{timestamp}.html and mstt_advisor_{timestamp}.xlsx files are generated for preview.
The content of the mstt_advisor_{timestamp}.xlsx file is identical to the terminal output.
For details about the analysis of the mstt_advisor_{timestamp}.html file, see Output File Description.
The following examples show the content format of the command output.

Overall performance bottlenecks

Compute bottlenecks

Scheduling bottlenecks

`advisor` Functions (Jupyter Notebook)

Go to the msprof_analyze/advisor directory and run the following command to start Jupyter Notebook:
```
jupyter notebook
```
If the command is executed successfully, the browser automatically opens the msprof_analyze/advisor directory, as shown in the following example.

In a Linux environment, the command output displays the URL for the Jupyter Notebook page. Copy this URL and open it in a browser to access the Jupyter Notebook interface. When using a remote server, replace the domain name localhost with the server IP address.
Open the required .ipynb file and copy the path to the profile data collected by Ascend PyTorch Profiler. Each .ipynb file represents a profile data analysis task. Then, specify the *_path parameter by using the copied path, as shown in the following figure.
Click Run to start profile data analysis.

Detailed analysis results will be displayed directly on the .ipynb page.

Output File Description

Report Analysis (Without Benchmark)

"Without benchmark" refers to executing msprof-analyze advisor without specifying the -bp option. In this scenario, the tool evaluates Computing Time and Free Time (idle time) to determine whether to compare the kernel and API profile data. Data from the slowest rank is used as the benchmark, while data from the fastest rank serves as the comparison target.

As shown in the following figure, the tool diagnoses issues across dimensions (including cluster, single-rank performance breakdown, scheduling, and computation) and provides the corresponding optimization suggestions. Red, yellow, and green indicators represent high, medium, and low issue priorities, respectively.

Input Image Description

Analysis of the overall Module

The overall module displays the identified issues but does not provide optimization suggestions.

In single-rank scenarios without a benchmark, the Environment Variable Issues section of the overall module provides environment variable setting suggestions.

For a detailed introduction to the environment variables shown in the preceding figure, see ACL_NN_CACHE_LIMIT and HOST_CACHE_CAPACITY.
In single-rank scenarios without a benchmark, the overall summary section of the overall module provides an analysis including the performance breakdown of the slow rank in the current training task. It displays duration statistics across three dimensions: computation, communication, and scheduling. This analysis helps identify whether the training performance bottleneck is a computation, communication, or scheduling issue. It does not provide optimization suggestions.
In cluster scenarios without a benchmark, the overall module provides fast/slow rank and fast/slow link analysis.

Analysis of the comparison Module

The following figure shows the content of the comparison module, which identifies kernel and API data for both the benchmark and that of the target profile data to be compared. In scenarios without a benchmark, this module presents the comparison results of the fast and slow rank profile data within the cluster, including the following sections:

Kernel compare of Rank* Step* and Rank* Step*: provides the target total, average, maximum, and minimum durations, the number of calls, the corresponding benchmark data, and the calculated Diff Total Ratio (benchmark total duration/target total duration) and Diff Avg Ratio (benchmark average duration/target average duration).

If the Diff Total Ratio or Diff Avg Ratio is greater than 1, the performance of the current environment is better. If the ratio is less than 1, the current environment requires optimization. If the ratio is equal to 1, the performance of the current environment is close to the benchmark environment.

In the preceding figure, inf indicates a denominator of 0 (target data not obtained or is zero); None indicates that no data was obtained.
Api compare of Rank* Step* and Rank* Step*: provides the target total duration, self-duration (excluding sub-API calls), average duration, and number of calls of the API data to be compared, as well as the corresponding data of the benchmark. This section also provides the calculated Diff Total Ratio (benchmark total duration/target total duration), Diff Self Ratio (benchmark self-duration/target self-duration), Diff Avg Ratio (benchmark average duration/target average duration), and Diff Calls Ratio (benchmark number of calls/target number of calls).

If the Diff Total Ratio, Diff Self Ratio, Diff Avg Ratio, or Diff Calls Ratio is greater than 1, the performance of the current environment is better. If the ratio is less than 1, the current environment requires optimization. If the ratio is equal to 1, the performance of the current environment is close to the benchmark environment.

In the preceding figure, inf indicates a denominator of 0 (target data not obtained or is zero); None indicates that no data was obtained.

The comparison module in the mstt_advisor_{timestamp}.html file displays only the top 10 kernel and API records. For details, refer to the mstt_advisor_{timestamp}.xlsx file.

Analysis of the performance problem analysis Module

The performance problem analysis module consists of the following submodules:

The memory module analyzes abnormal memory allocation and release operations.

memory

The communication module analyzes performance from the communication dimension. It currently supports detection of small communication packets, bandwidth contention between communication and computation, communication retransmission, and communication operator byte alignment.

communication

The meanings of Zero1, Zero2, and Zero3 in the preceding figure are as follows:

Zero1: Each NPU stores a complete set of gradients and model parameters but only 1/N of the optimizer states. Each NPU uses its data for forward and backward propagation. After backward propagation, each NPU synchronizes gradients across all ranks through all-reduce communication so that each rank has the gradients for all operators. Each rank updates the 1/N model parameters based on the gradients and the 1/N optimizer states. Then, it uses all-gather communication to send the updated 1/N model parameters to other ranks because each rank has a complete set of model parameters to be updated.
Zero2: Each NPU stores a complete set of model parameters but only 1/N of the optimizer states and 1/N of the gradients. Each NPU uses its data for forward propagation. After backward propagation, each rank calculates the local gradients and uses reduce-scatter communication to aggregate them, ensuring each rank stores only 1/N of the gradients. Each rank updates the 1/N model parameters based on the 1/N optimizer states and 1/N gradients. Then, it uses all-gather communication to send the updated model parameters to other ranks because each rank has a complete set of model parameters to be updated.
Zero3: Each NPU stores 1/N of the model parameters, 1/N of the optimizer states, and 1/N of the gradients. Before forward propagation, each rank obtains the complete model parameters through all-gather communication and then performs forward propagation computation. It evicts the model parameters part by part after use. Before backward propagation, each rank obtains complete model parameters through all-gather communication. It evicts the model parameters part by part after use. Gradients are aggregated using reduce-scatter. Each rank updates the 1/N model parameters based on the 1/N optimizer states and 1/N gradients. Since each rank stores only 1/N model parameters, there is no need to send the updated model parameters to other ranks.

Communication Retransmission Analysis: identifies communication groups where retransmission occurs and provides optimization suggestions.

As shown in the following figure, this module identifies communication retransmission issues in the current training task and provides the corresponding optimization suggestions.

cluster_2

Bandwidth Contention Analysis: detects communication bandwidth contention during concurrent computation and communication.

bandwidth

Byte Alignment Analysis: For communication operators using the SDMA transmission type, the data volume must be a multiple of 512 bytes to prevent bandwidth degradation.

byte_alignment

The computation module analyzes the compute performance of the device. It identifies potential bottlenecks within categories such as AICPU, dynamic shape, AI Core performance, Block Dim, operator fusion graphs, and AI Core frequency reduction, providing the corresponding suggestions. Perform performance tuning based on the report details. Examples:

computation_1

block_dim

op_no_bound

AI_Core_Performance_Analysis

For details about the torch_npu.npu.set_compile_mode API, see torch_npu.npu.set_compile_mode. For details, see AICPU Operator Replacement Examples.

When pipeline parallel (PP) stage issues exist, the computation module analyzes the issues by stage. Each stage represents a pipeline partition. For example, ranks 0–7 belong to stage-0 and ranks 8–15 belong to stage-1.

computation_2

The dataloader module detects Slow DataLoader Issues, primarily including abnormally high-latency calls, and provides optimization suggestions.

dataloader

In the preceding figure, the pin_memory (memory locking) and num_workers (number of data loading subprocesses) parameters are used for data loading optimization.

The schedule module presents analysis results for GC Analysis, Affinity API Issues, operator compile (aclOpCompile) issues, SyncBatchNorm Issues, Synchronize Stream Issues, and Fusible Operator Analysis.

The results of Fusible Operator Analysis are displayed only on the terminal and saved in the mstt_advisor_{timestamp}.xlsx file. This includes the host bottleneck-based operator sequence analysis and MTE bottleneck-based operator sequence analysis tabs, as shown in the following figure.

Fusible_Operator_Analysis

Field	Description
start index	Index position of the sequence start operator in `kernel_details.csv` or `op_summary.csv` (header excluded; the first index is `0`)
end index	Index position of the sequence end operator in `kernel_details.csv` or `op_summary.csv`
total time(us)	Total duration of the operator sequence (operator gaps included) (μs)
execution time(us)	Total execution duration of operators within the sequence (μs)
mte time(us)	Total data movement duration of operators within the sequence (μs)
occurrences	Number of sequence occurrences
mte bound	Flag indicating an MTE bottleneck
host bound	Flag indicating a host bottleneck

As shown in the following figure, GC Analysis indicates the existence of abnormal garbage collection events. You can address GC issues through effective Python memory management, using gc.set_threshold() to adjust GC thresholds, or using gc.disable() to disable GC.

The gc.set_threshold() and gc.disable() functions in the figure are described as follows:

In Python, the gc module provides control over the garbage collector.

gc.set_threshold(threshold0, threshold1, threshold2): sets the thresholds for garbage collection. The garbage collector divides all objects into three generations (Generation 0, 1, and 2), and objects in each generation move to the next generation after garbage collection. threshold0 controls the Generation 0 garbage collection frequency, threshold1 controls the Generation 1 frequency, and threshold2 controls the Generation 2 frequency. Setting threshold0 to 0 disables garbage collection.
gc.disable(): disables automatic garbage collection. After a gc.disable() call, the garbage collector will not run automatically until a manual gc.enable() call.

As shown in the following figure, the Affinity API Issues section identifies replaceable affinity APIs and provides the corresponding code stack. You can locate the code requiring modification based on the stack and refer to the provided modification examples (Examples for Fused Operator API Replacement During Migration to Ascend).

schedule_3

As shown in the following figure, the Synchronize Stream Issues section identifies time-consuming synchronization streams and provides the triggering code stack. The corresponding code must be modified based on the stack to eliminate these synchronization streams.

schedule_2

For details regarding the ASCEND_LAUNCH_BLOCKING environment variable, see ASCEND_LAUNCH_BLOCKING.

As shown in the following figure, the Operator Dispatch Issues section indicates that the following code must be added at the beginning of the execution script to eliminate aclOpCompile issues.

torch_npu.npu.set_compile_mode(jit_compile=False);
torch_npu.npu.config.allow_internal_format = False

For details regarding these APIs, see torch_npu.npu.set_compile_mode and torch_npu.npu.config.allow_internal_format.

Input Image Description

For details regarding the aclopCompileAndExecute API, see aclopCompileAndExecute.

Report Analysis (With Benchmark)

"With benchmark" refers to executing the msprof-analyze advisor command after specifying the -bp option with the directory of the benchmark profile data for comparison.

Single-rank scenarios with a benchmark: the overall module is not analyzed here. The performance problem analysis module yields the same results as the scenario without a benchmark.

Cluster scenarios with a benchmark:

The overall module analyzes fast/slow ranks and fast/slow links, which are consistent with the cluster scenarios without a benchmark. For details, see "Report Analysis (Without Benchmark) > Analysis of the overall Module".
The Environment Variable Issues section is provided and is the same as that in single-rank scenarios without a benchmark. For details, see "Report Analysis (Without Benchmark) > Analysis of the overall Module".
The comparison module is also provided. (In scenarios without a benchmark, the tool compares the profile data of the slowest rank against the fastest rank within a cluster. In scenarios with a benchmark, the comparison is between the same rank across two clusters where significant duration differences exist.)

The following figure shows an example of the comparison module, which identifies the comparison results of Kernel and API data for the benchmark and the target profile data, including:

Kernel compare of Target and Benchmark: provides the target total, average, maximum, and minimum durations, the number of calls, the corresponding benchmark data, and the calculated Diff Total Ratio (benchmark total duration/target total duration) and Diff Avg Ratio (benchmark average duration/target average duration)

If the Diff Total Ratio or Diff Avg Ratio is greater than 1, the performance of the current environment is better. If the ratio is less than 1, the current environment requires optimization. If the ratio is equal to 1, the performance of the current environment is close to the benchmark environment.

In the preceding figure, inf indicates a denominator of 0 (target data not obtained or is zero); None indicates that no data was obtained.
Api compare of Target and Benchmark: provides the target total duration, self-duration (excluding sub-API calls), average duration, and number of calls of the API data to be compared, as well as the corresponding data of the benchmark. This section also provides the calculated Diff Total Ratio (benchmark total duration/target total duration), Diff Self Ratio (benchmark self-duration/target self-duration), Diff Avg Ratio (benchmark average duration/target average duration), and Diff Calls Ratio (benchmark number of calls/target number of calls).

If the Diff Total Ratio, Diff Self Ratio, Diff Avg Ratio, or Diff Calls Ratio is greater than 1, the performance of the current environment is better. If the ratio is less than 1, the current environment requires optimization. If the ratio is equal to 1, the performance of the current environment is close to the benchmark environment.

In the preceding figure, inf indicates a denominator of 0 (target data not obtained or is zero); None indicates that no data was obtained.