Cluster Operator Duration Analysis
Overview
The cluster operator duration analysis feature uses cluster_prof_info_analysis.py to collect and display statistics for the top N operators in cluster scenarios. It identifies operators with the fastest, slowest, and average duration, as well as the highest variance, on each rank based on the op_summary information of the multi-rank profile data.
Currently, operator information for multiple ranks can be obtained only by viewing the profile data of each rank individually. Compute performance differences between operators across different ranks cannot be compared directly.
Preparations
Environment Setup
Copy cluster_prof_info_analysis.py to a directory and install the required Python libraries.
pip install pandas
pip install plotly
Data Preparation
Copy the profile data of all nodes to an environment. The profile data must be placed under node* directories. For example, in a cluster scenario with 2 nodes and 16 ranks, where each node has 8 ranks, copy the profile data to a directory structure as follows:
├── node0 # It can be node0 or node0_xxx, indicating a node.
│ ├── PROF_XXXXX # Profile data of a single rank. msProf profile data parsing must be completed.
│ ├── SUMMARY
│ ├── op_summary_XX.csv
| ...... # Aggregated profile data of all eight ranks under the node
├── node1 # It can be node1 or node1_xxx, indicating a node.
│ ├── PROF_XXXXX # Profile data of a single rank.
│ ├── SUMMARY
│ ├── op_summary_XX.csv # op_summary table used for parsing.
| ......
Cluster Operator Duration Analysis
Function
Collects and displays statistics for the top N operators with the fastest, slowest, and average duration, as well as the highest variance, on each rank.
Precautions
None
Syntax
python3 cluster_prof_info_analysis.py -d <data_path> -t <type> [-n <top_n>]
Command-line Options
| Option | Mandatory (Yes/No) | Description |
|---|---|---|
| -d | Yes | Specifies the profile data directory for cluster scenarios. Enter the parent directory of the node* directories.• If op_summary does not exist in some directories, no information is displayed and no error is reported.• If no op_summary data exists in the specified directory, an error is reported indicating that the data files cannot be found.• If data in the op_summary column of a file is incorrect or cannot be read, the specific faulty file is identified. |
| -t | Yes | Specifies the output file type for the analysis results. The values can be: html (default), csv, or all.If the configuration is incorrect, an error message is displayed along with the correct configuration format. |
| -n | No | (HTML only) Specifies the number of top N (default: 10) operators to be displayed based on average duration. Values exceeding 30 may increase processing time.• The value must be greater than 0. If a value less than or equal to 0 is entered, data for only one operator is exported by default. • If the value specified exceeds the total number of operators, the total number of operators is used. |
Example
python3 cluster_prof_info_analysis.py –d ./cluster_data -t csv -n 5
Output File Description
cluster_op_time_analysis.csv
Classifies operators by op_name, input_shape, input_size, and output_shape. Statistics, such as the maximum, minimum, variance, average, and range of the duration, are collected and displayed for each operator category across different ranks and nodes.
xxx_info.html
HTML files for various features (time and ratio), displaying the box plots of the top N operators.
time and ratio indicate the duration and proportion fields in the performance metrics of AI Core and AI Vector Core operators.
The execution duration and proportion of top N operators are displayed as box plots in the HTML file.
One coordinate system is generated for each of the top N operators. Each system represents one operator feature. Coordinates are sorted from left to right and then downward based on the average value of total_time.
- Horizontal coordinate:
node_devicerepresents the specific rank on a node, sorted in ascending order. - Vertical coordinate: indicates the duration.
- Coordinate name: displayed below the coordinate system in the format of
op_name-input_shape.