cluster_time_summary

Overview

Large-scale cluster scenarios involve multiple compute nodes and massive amounts of data. Single-rank profile data statistics and analysis cannot evaluate the overall operational performance of a cluster.

The original deliverable cluster_step_trace_time.csv does not have a dedicated execution command, making it inconvenient to use. Additionally, it does not provide metrics such as memory copies. Therefore, enhancement is required.

Fine-grained cluster profile data breakdown (cluster_time_summary) provides a breakdown of iteration duration during cluster training. By analyzing the computation, communication, and memory copy durations, it helps users identify performance bottlenecks.

Preparations

Environment Setup

Install msprof-analyze. For details, see MindStudio Profiler Analyze Installation Guide.

Data Preparation

msprof-analyze requires an input directory containing the collected profile data. For instructions on how to collect such data, see Data Preparation.

Fine-grained Cluster Profile Data Breakdown

Function

Analyzes the collected cluster data by using the cluster_time_summary feature of msprof-analyze.

Syntax

msprof-analyze -m cluster_time_summary -d <cluster_data> [-o <output_path>]

Command-line Options

Option	Mandatory (Yes/No)	Description
-m	Yes	Specifies the analysis mode to execute. Set it to `cluster_time_summary` to enable fine-grained breakdown of cluster profile data.
-d	Yes	Specifies the cluster profile data directory.
-o	No	Specifies the output directory. The default value is the directory specified by the `-d` option.

For details about more options, see Command-line Options and Parameters of msprof-analyze.

Example

Perform fine-grained breakdown of cluster profile data.

msprof-analyze -m cluster_time_summary -d ./xxx/cluster_data -o ./xxx/output_path

Output Description

If the export type is set to db, cluster_analysis_output/cluster_analysis.db is generated in the output directory. If the export type is set to text, cluster_analysis_output/ClusterTimeSummary/cluster_time_summary_{timestamp}.csv is generated in the output directory.
Data table name: ClusterTimeSummary

Output File Description

The following table describes the fields in the ClusterTimeSummary table.

Field	Type	Description
rank	INTEGER	Rank ID
step	INTEGER	Iteration number
stepTime	REAL	Total iteration duration
computation	REAL	Total computation duration of operators on the NPU
communicationNotOverlapComputation	REAL	Communication duration not overlapped by computation
communicationOverlapComputation	REAL	Overlap duration of computation and communication
communication	REAL	Total communication duration of operators on the NPU
free	REAL	Idle duration (total iteration duration minus computation, communication, and copy durations)
communicationWaitStageTime	REAL	Total wait duration during communication
communicationTransmitStageTime	REAL	Total transmission duration during communication
memory	REAL	Copy duration
memoryNotOverlapComputationCommunication	REAL	Copy duration not overlapped by computation or communication

Time-related fields in the preceding table are in microseconds (μs).

Except for the header format, the data in cluster_time_summary_{timestamp}.csv is consistent with that in the .db file.

Output Analysis

Identify performance bottlenecks by analyzing the proportions of computation, communication, memory copy, and idle durations.
Compare duration metrics across ranks within the cluster to locate performance issues. For example, significant fluctuations in computation duration typically indicate inter-rank desynchronization or uneven compute rank performance. Excessive variance in communication duration suggests a need to prioritize troubleshooting for parameter plane network congestion or configuration anomalies.
The cluster_time_compare_summary feature can be used in conjunction to effectively locate the root cause of cluster performance deterioration.