ba5c68ba创建于 7 天前历史提交

文件	最后提交记录	最后更新时间
README.md	docs: 英文资料合入	7 天前
calibrate_npu_gpu_instruct.md	docs: 英文资料合入	7 天前
cluster_kernels_analysis_instruct.md	docs: 英文资料合入	7 天前
cluster_time_compare_summary_instruct.md	docs: 英文资料合入	7 天前
cluster_time_summary_instruct.md	docs: 英文资料合入	7 天前
communication_bottleneck_instruct.md	docs: 英文资料合入	7 天前
computational_op_masking_instruct.md	docs: 英文资料合入	7 天前
custom_analysis_guide.md	docs: 英文资料合入	7 天前
export_summary_instruct.md	docs: 英文资料合入	7 天前
free_analysis_instruct.md	docs: 英文资料合入	7 天前
module_statistic_instruct.md	docs: 英文资料合入	7 天前
pp_chart_instruct.md	docs: 英文资料合入	7 天前
recipe_output_format_introduct.md	docs: 英文资料合入	7 天前

Recipe Analysis Rules

This chapter summarizes the advanced analysis features of MindStudio Profiler Analyze (msprof-analyze) for cluster scenarios. Key areas of focus include multi-dimensional information summaries, breakdown comparisons, communication bottleneck identification, and delivery issue analysis. You can also perform custom analysis by using Recipe rules. For more information, see the Custom Analysis Rule Development Guide.

Preparations

The tool supports the following cluster data types:

DB-format cluster data collected by Ascend PyTorch Profiler
Lightweight cluster DB data collected by msMonitor

For details about profile data collection, see Profile Data Collection Guides.

When you use Ascend PyTorch Profiler, you must collect or parse db results offline. Example:

experimental_config = torch_npu.profiler._ExperimentalConfig(
    export_type=[torch_npu.profiler.ExportType.Db]
)

You can also specify the export type during offline parsing.

from torch_npu.profiler.profiler import analyse

if __name__ == "__main__":
    analyse(profiler_path="./result_data", export_type=["db"])

Usage

Syntax for using msprof-analyze:

msprof-analyze -m <feature> -d <profiling_path> [options]

Example:

msprof-analyze -m cluster_time_summary -d ./cluster_data -o ./output
msprof-analyze -m free_analysis -d ./cluster_data -o ./output

Common options:

-m: specifies the analysis feature.
-d: specifies the directory of profile data.
-o: specifies the output path. If this option is not specified, the tool saves the results in the cluster_analysis_output directory within the input path.

For more information, see Command-line Options and Parameters.

Analysis Features

Breakdown and Comparison

Analysis Feature	Description	Document Link
cluster_time_summary	Provides a breakdown of iteration time during cluster training to help identify performance bottlenecks.	cluster_time_summary
cluster_time_compare_summary	Provides cluster-level profile data comparison for AI task execution to help identify performance bottlenecks.	cluster_time_compare_summary
module_statistic	Analyzes model hierarchical structures automatically for PyTorch models to help accurately locate performance bottlenecks.	module_statistic
calibrate_npu_gpu	Compares NPU and GPU profile data automatically to assist with cross-platform performance calibration and bottleneck analysis.	calibrate_npu_gpu

Computation

Analysis Feature	Description	Document Link
compute_op_sum	Summarizes computation operators executed on the device.	-
freq_analysis	Identifies whether the AI Core is idle (frequency at 800 MHz) or abnormal (frequency not at 1800 MHz or 800 MHz) and provides the analysis results.	-
ep_load_balance	Summarizes and analyzes Mixture of Experts (MoE) load information.	-
computational_op_masking	Calculates the overlap between operator execution durations during cluster training to help you identify performance bottlenecks.	computational_op_masking

Communication

Analysis Feature	Description	Document Link
communication_group_map	Displays the communication group and parallel strategy in cluster scenarios.	-
communication_time_sum	Summarizes and analyzes communication durations and bandwidth in cluster scenarios.	-
communication_matrix_sum	Summarizes and analyzes the communication matrix in cluster scenarios.	-
hccl_sum	Summarizes information for communication operators.	-
pp_chart	Analyzes and visualizes the duration for each phase of pipeline parallelism (PP).	pp_chart
slow_rank	Identifies the causes of slow ranks and shows how often each rank is affected based on the fast and slow rank statistics algorithm.	-
communication_bottleneck	Identifies fast and slow ranks for long-duration communication operators, and infers the host-side or device-side operations that cause communication waits.	communication_bottleneck

Host Delivery

Analysis Feature	Description	Document Link
cann_api_sum	Summarizes CANN APIs.	-
mstx_sum	Summarizes MSTX custom instrumentation.	-
free_analysis	Automatically analyzes large idle periods on the device to identify their causes and help you locate performance issues.	free_analysis

Other Features

Analysis Feature	Category	Description	Document Link
export_summary	Data export	Exports API statistics and kernel details for each rank in the cluster to generate the `api_statistic.csv` and `kernel_details.csv` files.	export_summary
mstx2commop	Data processing	Converts communication information from MSTX built-in communication instrumentation into the communication operator table format.	-
p2p_pairing	Data processing	Generates a global association index for P2P operators and attaches the output index to the `COMMUNICATION_OP` table as a new field `opConnectionId`.	-

Output File Description

For details about the output deliverables of the msprof-analyze features, see Table Structures of Recipe Results and cluster_analysis.db Deliverables.

Command-line Options and Parameters

Global Options and Parameters

The following table primarily describes the input, output, format, execution, and help options.

Option/Parameter	Mandatory (Yes/No)	Description
--profiling_path or -d	Yes	Specifies the profile data collection directory. If the `-o` option is not specified, running the analysis script automatically creates the `cluster_analysis_output` folder in this directory to save the analysis data.
--output_path or -o	No	Specifies a custom output path. Running the analysis script automatically creates the `cluster_analysis_output` folder in this directory to save the analysis data.
--mode or -m	No	Specifies the analysis mode to execute. For details, see Analysis Features.
--export_type	No	Sets the format for exported data. Valid values: `db` (.db file), `notebook` (Jupyter Notebook file), and `text` (text-based formats such as JSON, CSV, and Excel). Default value: `db`.
--force	No	Enables forced execution. The user assumes responsibility for the `force` action. This option bypasses the following checks: • Ownership check: Proceed even if the current user is not the owner of the specified directory or files. • File size check: Proceed even if a CSV file exceeds 5 GB, a JSON file exceeds 10 GB, or a DB file exceeds 8 GB. • Permission check: Proceed directly by ignoring read and write permission checks on the specified directory and files. Specifying this option enables forced execution, which is disabled if not specified.
--parallel_mode	No	Sets the concurrency mode for collecting multi-rank and multi-node database data. Set it to `concurrent` to use the `concurrent.feature` process pool.
-v, -V or --version	No	Displays the version number.
-h, -H or --help	No	Displays help information for command-line options.
auto-completion	No	Enables automatic completion and allows you to use the Tab key to automatically complete all sub-parameters for the `msprof-analyze` tool in the current view.

Analysis Feature Options

Option	Mandatory (Yes/No)	Description
--rank_list	No	Specifies a list of rank IDs for which the tool parses profile data. The default value is `all`, which indicates that the tool parses data for all ranks. Specify this option by using actual rank IDs. These IDs must be integers greater than or equal to 0. If a specified ID is larger than the range of ranks used in training, the tool parses only the data for valid rank IDs. For example, if the environment has ranks 0–7 but the training uses only ranks 0–3, and you set this option to `0,3,4,10`, the tool parses only the data for rank 0 and rank 3. Configuration example: `--rank_list 0,1,2`. This option is supported only when `-m` is set to `cann_api_sum`, `compute_op_sum`, `hccl_sum`, or `mstx_sum`.
--step_id	No	Specifies the step ID for profile data analysis. Only profile data for the specified step is analyzed. The step ID must exist within the actual profile data. By default, this option is not specified, which triggers full analysis. Configuration example: `--step_id=1`. This option is supported only when `-m` is set to `cann_api_sum`, `compute_op_sum`, `hccl_sum`, or `mstx_sum`.
--top_num	No	Sets the number of top-N time-consuming communication operators. Default value: `15`. Configuration example: `--top_num 20`. This option is available only when `-m` is set to `hccl_sum`.
--exclude_op_name	No	Specifies whether to include the operator name in the `compute_op_name` results. Example: `--exclude_op_name` (no additional argument is required). This option is available only when `-m` is set to `compute_op_sum`.
--bp	No	Specifies the path to the benchmark cluster profile data for comparison. For example, `--bp {bp_cluster_profiling_path}` compares data from `profiling_path` with data from `bp_cluster_profiling_path`. This option is available only when `-m` is set to `cluster_time_compare_summary`.