Recipe Analysis Rules
This chapter summarizes the advanced analysis features of MindStudio Profiler Analyze (msprof-analyze) for cluster scenarios. Key areas of focus include multi-dimensional information summaries, breakdown comparisons, communication bottleneck identification, and delivery issue analysis. You can also perform custom analysis by using Recipe rules. For more information, see the Custom Analysis Rule Development Guide.
Preparations
The tool supports the following cluster data types:
- DB-format cluster data collected by Ascend PyTorch Profiler
- Lightweight cluster DB data collected by msMonitor
For details about profile data collection, see Profile Data Collection Guides.
When you use Ascend PyTorch Profiler, you must collect or parse db results offline. Example:
experimental_config = torch_npu.profiler._ExperimentalConfig(
export_type=[torch_npu.profiler.ExportType.Db]
)
You can also specify the export type during offline parsing.
from torch_npu.profiler.profiler import analyse
if __name__ == "__main__":
analyse(profiler_path="./result_data", export_type=["db"])
Usage
Syntax for using msprof-analyze:
msprof-analyze -m <feature> -d <profiling_path> [options]
Example:
msprof-analyze -m cluster_time_summary -d ./cluster_data -o ./output
msprof-analyze -m free_analysis -d ./cluster_data -o ./output
Common options:
-m: specifies the analysis feature.-d: specifies the directory of profile data.-o: specifies the output path. If this option is not specified, the tool saves the results in thecluster_analysis_outputdirectory within the input path.
For more information, see Command-line Options and Parameters.
Analysis Features
Breakdown and Comparison
| Analysis Feature | Description | Document Link |
|---|---|---|
| cluster_time_summary | Provides a breakdown of iteration time during cluster training to help identify performance bottlenecks. | cluster_time_summary |
| cluster_time_compare_summary | Provides cluster-level profile data comparison for AI task execution to help identify performance bottlenecks. | cluster_time_compare_summary |
| module_statistic | Analyzes model hierarchical structures automatically for PyTorch models to help accurately locate performance bottlenecks. | module_statistic |
| calibrate_npu_gpu | Compares NPU and GPU profile data automatically to assist with cross-platform performance calibration and bottleneck analysis. | calibrate_npu_gpu |
Computation
| Analysis Feature | Description | Document Link |
|---|---|---|
| compute_op_sum | Summarizes computation operators executed on the device. | - |
| freq_analysis | Identifies whether the AI Core is idle (frequency at 800 MHz) or abnormal (frequency not at 1800 MHz or 800 MHz) and provides the analysis results. | - |
| ep_load_balance | Summarizes and analyzes Mixture of Experts (MoE) load information. | - |
| computational_op_masking | Calculates the overlap between operator execution durations during cluster training to help you identify performance bottlenecks. | computational_op_masking |
Communication
| Analysis Feature | Description | Document Link |
|---|---|---|
| communication_group_map | Displays the communication group and parallel strategy in cluster scenarios. | - |
| communication_time_sum | Summarizes and analyzes communication durations and bandwidth in cluster scenarios. | - |
| communication_matrix_sum | Summarizes and analyzes the communication matrix in cluster scenarios. | - |
| hccl_sum | Summarizes information for communication operators. | - |
| pp_chart | Analyzes and visualizes the duration for each phase of pipeline parallelism (PP). | pp_chart |
| slow_rank | Identifies the causes of slow ranks and shows how often each rank is affected based on the fast and slow rank statistics algorithm. | - |
| communication_bottleneck | Identifies fast and slow ranks for long-duration communication operators, and infers the host-side or device-side operations that cause communication waits. | communication_bottleneck |
Host Delivery
| Analysis Feature | Description | Document Link |
|---|---|---|
| cann_api_sum | Summarizes CANN APIs. | - |
| mstx_sum | Summarizes MSTX custom instrumentation. | - |
| free_analysis | Automatically analyzes large idle periods on the device to identify their causes and help you locate performance issues. | free_analysis |
Other Features
| Analysis Feature | Category | Description | Document Link |
|---|---|---|---|
| export_summary | Data export | Exports API statistics and kernel details for each rank in the cluster to generate the api_statistic.csv and kernel_details.csv files. |
export_summary |
| mstx2commop | Data processing | Converts communication information from MSTX built-in communication instrumentation into the communication operator table format. | - |
| p2p_pairing | Data processing | Generates a global association index for P2P operators and attaches the output index to the COMMUNICATION_OP table as a new field opConnectionId. |
- |
Output File Description
For details about the output deliverables of the msprof-analyze features, see Table Structures of Recipe Results and cluster_analysis.db Deliverables.
Command-line Options and Parameters
Global Options and Parameters
The following table primarily describes the input, output, format, execution, and help options.
| Option/Parameter | Mandatory (Yes/No) | Description |
|---|---|---|
| --profiling_path or -d | Yes | Specifies the profile data collection directory. If the -o option is not specified, running the analysis script automatically creates the cluster_analysis_output folder in this directory to save the analysis data. |
| --output_path or -o | No | Specifies a custom output path. Running the analysis script automatically creates the cluster_analysis_output folder in this directory to save the analysis data. |
| --mode or -m | No | Specifies the analysis mode to execute. For details, see Analysis Features. |
| --export_type | No | Sets the format for exported data. Valid values: db (.db file), notebook (Jupyter Notebook file), and text (text-based formats such as JSON, CSV, and Excel). Default value: db. |
| --force | No | Enables forced execution. The user assumes responsibility for the force action. This option bypasses the following checks:• Ownership check: Proceed even if the current user is not the owner of the specified directory or files. • File size check: Proceed even if a CSV file exceeds 5 GB, a JSON file exceeds 10 GB, or a DB file exceeds 8 GB. • Permission check: Proceed directly by ignoring read and write permission checks on the specified directory and files. Specifying this option enables forced execution, which is disabled if not specified. |
| --parallel_mode | No | Sets the concurrency mode for collecting multi-rank and multi-node database data. Set it to concurrent to use the concurrent.feature process pool. |
| -v, -V or --version |
No | Displays the version number. |
| -h, -H or --help |
No | Displays help information for command-line options. |
| auto-completion | No | Enables automatic completion and allows you to use the Tab key to automatically complete all sub-parameters for the msprof-analyze tool in the current view. |
Analysis Feature Options
| Option | Mandatory (Yes/No) | Description |
|---|---|---|
| --rank_list | No | Specifies a list of rank IDs for which the tool parses profile data. The default value is all, which indicates that the tool parses data for all ranks. Specify this option by using actual rank IDs. These IDs must be integers greater than or equal to 0. If a specified ID is larger than the range of ranks used in training, the tool parses only the data for valid rank IDs. For example, if the environment has ranks 0–7 but the training uses only ranks 0–3, and you set this option to 0,3,4,10, the tool parses only the data for rank 0 and rank 3. Configuration example: --rank_list 0,1,2.This option is supported only when -m is set to cann_api_sum, compute_op_sum, hccl_sum, or mstx_sum. |
| --step_id | No | Specifies the step ID for profile data analysis. Only profile data for the specified step is analyzed. The step ID must exist within the actual profile data. By default, this option is not specified, which triggers full analysis. Configuration example: --step_id=1.This option is supported only when -m is set to cann_api_sum, compute_op_sum, hccl_sum, or mstx_sum. |
| --top_num | No | Sets the number of top-N time-consuming communication operators. Default value: 15. Configuration example: --top_num 20.This option is available only when -m is set to hccl_sum. |
| --exclude_op_name | No | Specifies whether to include the operator name in the compute_op_name results. Example: --exclude_op_name (no additional argument is required).This option is available only when -m is set to compute_op_sum. |
| --bp | No | Specifies the path to the benchmark cluster profile data for comparison. For example, --bp {bp_cluster_profiling_path} compares data from profiling_path with data from bp_cluster_profiling_path.This option is available only when -m is set to cluster_time_compare_summary. |