Table Structures of Recipe Results and cluster_analysis.db Deliverables
NOTE
-
When
msprof-analyzeis configured with the--modeoption, the profile data is analyzed and thecluster_analysis.dbdeliverables are generated. This topic describes the table structures and fields of these deliverables. -
Some analysis features do not generate the
cluster_analysis.dbfile.
cluster_step_trace_time.csv
Generated when the data parsing mode is communication_matrix, communication_time, or all.
Column A: Steps. This column is set during profile data collection. Generally, profile data for a single step is sufficient for cluster performance analysis. If multiple steps are collected, filter them first.
Column B: Type. Valid values are rank and stage, which are closely related to the index. rank represents a single rank, while stage represents a rank group (PP parallel stage). If the type is stage, the information in columns D through K represents the maximum values within the rank group.
Column C: Index. This column is related to the type and indicates the device ID.
Column D: Computing. This column displays the computation duration.
Column E: Communication (Not Overlapped). This column displays the communication duration not overlapped by computation.
Column F: Overlapped. This column displays the duration where computation and communication overlap.
Column G: Communication. This column displays the total communication duration.
Column H: Free. This column displays the idle duration, which indicates the duration where the device is neither communicating nor computing. This may include the SDMA copy and idle wait durations.
Column I: Stage. This column and the following two columns are valid only for PP parallelism. Stage duration represents the total time excluding the duration of receive operators.
Column J: Bubble. This column displays the bubble time, which is the sum of the duration of all receive operators.
Column K: Communication (Not Overlapped and Exclude Receive). This column indicates the communication duration that is not overlapped and excludes the duration of receive operators.
Column L: Preparing. This column displays the duration from the start of an iteration to the execution of the first computation or communication operator.
Column M: DP Index. This column displays the index of the DP group to which the cluster data belongs after being partitioned based on the parallel strategy. If the data is not collected, this column is not displayed.
Column N: PP Index. This column displays the index of the PP group to which the cluster data belongs after being partitioned based on the parallel strategy. If not collected, this column is not displayed.
Column O: TP Index. This column displays the index of the TP group to which the cluster data belongs after being partitioned based on the parallel strategy. If not collected, this column is not displayed.
Tips: Filter Column B by the stage type to check for issues between stages. Then, filter Column B by the rank type to check for issues between ranks. Perform the following troubleshooting checks:
-
Check for slow ranks or load imbalance based on the computation duration difference.
-
Check for host-bound issues or uneven distribution based on the idle duration statistics.
-
Check for excessive communication duration based on the duration displayed in the Communication (Not Overlapped and Exclude Receive) column.
-
Check whether the bubble configuration is appropriate and whether imbalance exists between stages based on the proportion of bubble time and the theoretical calculation formula.
Theoretically, the values for these durations should remain relatively consistent. If the difference between the maximum and minimum values exceeds 5%, a slow rank may exist.
cluster_communication_matrix.json
Generated when the data parsing mode is communication_matrix or all.
Open the JSON file using VS Code or a JSON viewer and search for Total. There will be multiple results. Generally, the structure of the link bandwidth information is as follows:
{src_rank}-{dst_rank}: {
"Transport Type": "LOCAL",
"Transit Time(ms)": 0.02462,
"Transit Size(MB)": 16.777216,
"Bandwidth(GB/s)": 681.4466
}
Tips: You can identify slow link issues based on the rank interconnection bandwidth and the link type.
LOCAL: represents on-chip copy, which provides the highest speed.HCCSorPCIE: represents intra-node inter-chip copy, which provides medium speed.RDMA: represents inter-node copy, which provides the lowest speed.
cluster_communication.json
Generated when the data parsing mode is set to communication_time or all.
It mainly provides the communication duration data.
compute_op_sum
When -m compute_op_sum is set, the following tables are generated.
ComputeOpAllRankStats
Description:
Provides statistical analysis of computation duration for all ranks, grouped by OpType and TaskType. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| OpType | TEXT | Computation operator type |
| TaskType | TEXT | Accelerator type for operator execution |
| Count | INTEGER | Number of operators grouped by OpType and TaskType |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
ComputeOpPerRankStatsByOpType
Description:
Provides statistical analysis of computation duration for each rank, grouped by OpType and TaskType. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| OpType | TEXT | Computation operator type |
| TaskType | TEXT | Accelerator type for operator execution |
| Count | INTEGER | Number of operators grouped by OpType and TaskType |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| Rank | INTEGER | Rank ID |
ComputeOpPerRankStatsByOpName
Description:
Not generated when the --exclude_op_name option is specified.
It provides a statistical analysis of computation duration for each rank, grouped by OpName, OpType, TaskType, and InputShapes. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| OpName | TEXT | Computation operator name |
| OpType | TEXT | Computation operator type |
| TaskType | TEXT | Accelerator type for operator execution |
| InputShapes | TEXT | Input shape of the operator |
| Count | INTEGER | Number of operators in this group |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| Rank | INTEGER | Rank ID |
cann_api_sum
When -m cann_api_sum is set, the following tables are generated:
CannApiSum
Description:
Provides statistical analysis of the duration of each unique API across all ranks. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| name | TEXT | API name |
| timeRatio | REAL | Percentage of the duration of the API relative to the total duration of all APIs |
| totalTimeNs | INTEGER | Total duration of the API |
| totalCount | INTEGER | Number of APIs |
| averageNs | REAL | Average duration |
| Q1Ns | REAL | 25th percentile of duration |
| medNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| minNs | REAL | Minimum duration |
| maxNs | REAL | Maximum duration |
| stdev | REAL | Standard deviation of the duration |
| minRank | TEXT | A set of ranks corresponding to minNs |
| maxRank | TEXT | A set of ranks corresponding to maxNs |
CannApiSumRank
Description:
Provides statistical analysis of the duration of each unique API on each rank. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| name | TEXT | API name |
| durationRatio | REAL | Percentage of the duration of the API relative to the total duration of all APIs on the rank |
| totalTimeNs | INTEGER | Total duration of the API |
| totalCount | INTEGER | Number of APIs |
| averageNs | REAL | Average duration |
| minNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| medNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| maxNs | REAL | Maximum duration |
| stdev | REAL | Standard deviation of the duration |
| rank | INTEGER | Rank ID |
hccl_sum
When -m hccl_sum is set, the following tables are generated:
HcclAllRankStats
Description:
Provides statistical analysis of the duration of each communication operator type (such as hcom_broadcast_) across all ranks. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| OpType | TEXT | Communication operator type |
| Count | INTEGER | Count |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
HcclPerRankStats
Description:
Provides statistical analysis of the duration of each communication operator type (such as hcom_broadcast_) on each rank. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| OpType | TEXT | Communication operator type |
| Count | INTEGER | Count |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| Rank | INTEGER | Rank ID |
HcclGroupNameMap
Description:
Provides a mapping of ranks contained within each communication group.
Table fields
| Field | Type | Description |
|---|---|---|
| GroupName | TEXT | Communication group, such as {ip_address}%enp67s0f5_60000_0_1708156014257149 |
| GroupId | TEXT | Last three digits of the hash value of the communication group |
| Ranks | TEXT | All ranks within the communication group |
HcclTopOpStats
Description:
Provides an analysis of the computation duration for all communication operators across all ranks. It displays data for the top N (default value: 15) communication operators with the largest average durations. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| OpName | TEXT | Communication operator name, such as hcom_allReduce__606_0_1 |
| Count | INTEGER | Count |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| MinRank | INTEGER | Rank with the minimum duration for the communication operator |
| MaxRank | INTEGER | Rank with the maximum duration for the communication operator |
mstx_sum
When -m mstx_sum is set, the following tables are generated:
MSTXAllFrameworkStats
Description:
Provides statistical analysis of the framework-side duration of MSTX instrumentation. This analysis is based on cluster profile data in db format and does not distinguish between ranks.
Table fields
| Field | Type | Description |
|---|---|---|
| Name | TEXT | Information carried by the MSTX instrumentation data |
| Count | INTEGER | Number of instrumentation events grouped by Name within the iteration |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| StepId | INTEGER | Iteration ID |
MSTXAllCannStats
Description:
Provides statistical analysis of the CANN-layer duration of MSTX instrumentation. This analysis is based on cluster profile data in db format and does not distinguish between ranks.
Table fields
| Field | Type | Description |
|---|---|---|
| Name | TEXT | Information carried by the MSTX instrumentation data |
| Count | INTEGER | Number of instrumentation events grouped by Name within the iteration |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| StepId | INTEGER | Iteration ID |
MSTXAllDeviceStats
Description:
Provides statistical analysis of the device-side duration of MSTX instrumentation. This analysis is based on cluster profile data in db format and does not distinguish between ranks.
Table fields
| Field | Type | Description |
|---|---|---|
| Name | TEXT | Information carried by the MSTX instrumentation data |
| Count | INTEGER | Number of instrumentation events grouped by Name within the iteration |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| StepId | INTEGER | Iteration ID |
MSTXMarkStats
Description:
Provides statistical analysis of the duration of MSTX instrumentation for each rank, grouped by Rank and StepId. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| Name | TEXT | Information carried by the MSTX instrumentation data |
| FrameworkDurationNs | REAL | Framework-side duration |
| CannDurationNs | REAL | CANN layer duration |
| DeviceDurationNs | REAL | Device-side duration |
| Rank | INTEGER | global rank |
| StepId | INTEGER | Iteration ID |
communication_group_map
When -m communication_group_map is set, the following tables are generated:
CommunicationGroupMapping
Description:
Provides the mapping between communication groups and parallel strategies based on the cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| type | TEXT | Operator type (collective or p2p). Operators with names containing send, recv, or receive are classified as p2p. |
| rank_set | TEXT | A set of ranks (global ranks) within the communication group. |
| group_name | TEXT | Hash value of the communication group, which maps to group_id. |
| group_id | TEXT | Communication group name defined within HCCL, such as {ip_address}%enp67s0f5_60000_0_1708156014257149 |
| pg_name | TEXT | Service-defined communication group name (such as dp, dp_cp, and mp). |
cluster_time_summary
When -m cluster_time_summary is set, the following tables are generated:
Note: This table is similar to cluster_step_trace_time.csv, which will be replaced later.
ClusterTimeSummary
Description:
Provides statistical analysis of cluster duration for all ranks to facilitate performance issue identification. This analysis is based on cluster profile data in db format.
Table fields (time unit: μs)
| Field | Type | Description |
|---|---|---|
| rank | INTEGER | global rank |
| step | INTEGER | Iteration ID |
| stepTime | REAL | Total iteration duration |
| computation | REAL | Total computation duration |
| communicationNotOverlapComputation | REAL | Communication duration not overlapped by computation |
| communicationOverlapComputation | REAL | Duration of the overlap between computation and communication |
| communication | REAL | Total communication duration |
| free | REAL | Idle time (total duration when the device is neither communicating nor computing, excluding asynchronous memory copy) |
| communicationWaitStageTime | REAL | Total communication wait duration |
| communicationTransmitStageTime | REAL | Total communication transmission duration |
| memory | REAL | Total asynchronous memory copy duration |
| memoryNotOverlapComputationCommunication | REAL | Total duration of asynchronous memory copy not overlapped by computation or communication |
| taskLaunchDelayAvgTime | REAL | Delivery duration (average duration from the start of the host-side API to the start of the device-side task) |
cluster_time_compare_summary
When -m cluster_time_compare_summary is set, the following tables are generated.
Note: This analysis feature requires the cluster_time_summary results. Both cluster data and benchmark cluster data must contain a cluster_analysis.db file including the ClusterTimeSummary table.
ClusterTimeCompareSummary
Description: Provides a comparison between the current cluster and the benchmark cluster. For example, computationDiff indicates the difference in computation time between the current cluster and the benchmark cluster. A positive computationDiff value indicates the current cluster computation time exceeds that of the benchmark cluster, while a negative value indicates the opposite.
Table fields (time unit: μs)
| Field | Type | Description |
|---|---|---|
| rank | INTEGER | global rank |
| step | INTEGER | Iteration ID |
| stepTime | REAL | Iteration duration for current cluster data |
| stepTimeBase | REAL | Computation time for benchmark cluster data |
| stepTimeDiff | REAL | Difference in iteration duration |
| ...... | - | Some fields omitted (for the ClusterTimeSummary table, current cluster data, benchmark cluster data, and the difference between the two are displayed) |
| taskLaunchDelayAvgTime | REAL | Delivery duration for current cluster data |
| taskLaunchDelayAvgTimeBase | REAL | Delivery duration for benchmark cluster data |
| taskLaunchDelayAvgTimeDiff | REAL | Difference in delivery duration |
freq_analysis
Description:
Provides AI Core frequency analysis to enable one-click NPU frequency reduction detection. This analysis is based on cluster profile data in db format. There are three frequency scenarios:
- Normal: The frequency remains stable at 1800 MHz.
- Idle state: When the NPU is idle for an extended period, the device automatically reduces the frequency to 800 MHz.
- Abnormal reduction: When NPU frequency reduction occurs due to other factors, abnormal frequencies apart from 1800 MHz and 800 MHz are detected.
When -m freq_analysis is set, the following tables are generated if frequency reduction occurs.
FreeFrequencyRanks
Description:
Idle state: When the NPU is idle for an extended period, the device automatically reduces the frequency to 800 MHz.
Table fields
| Field | Type | Description |
|---|---|---|
| rankId | INTEGER | global rank |
| aicoreFrequency | TEXT | [800, 1800] |
AbnormalFrequencyRanks
Description:
Abnormal reduction: When NPU frequency reduction occurs due to other factors, abnormal frequencies apart from 1800 MHz and 800 MHz are detected.
Table fields
| Field | Type | Description |
|---|---|---|
| rankId | INTEGER | global rank |
| aicoreFrequency | TEXT | List of frequencies in abnormal reduction scenarios, such as [800, 1150, 1450, 1800] |
ep_load_balance
Description:
In cluster training scenarios, MoE load imbalance refers to the uneven distribution of tasks across different expert models in a distributed environment, causing some expert models to overload while others remain idle. This imbalance reduces overall system efficiency and creates potential performance bottlenecks.
When -m ep_load_balance is set, the following tables are generated.
EPTokensSummary
Description:
Provides GroupedMatmul operator shape analysis. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| rank | INTEGER | global rank |
| epRanks | TEXT | A set of ranks within the same Expert Parallelism (EP) group, such as [rank0,rank1] |
| inputShapesSummary | INTEGER | Sum of the first dimension of all input_shapes for the GroupedMatmul operator on this rank |
TopEPTokensInfo
Description:
Provides information about EP groups with load imbalance.
Table fields
| Field | Type | Description |
|---|---|---|
| epRanks | TEXT | A set of ranks within the EP group with load imbalance, such as [rank0, rank1] |
| tokensDiff | INTEGER | Difference between the maximum and minimum values within the same EP group |
mstx2commop
When -m mstx2commop is set, cluster_analysis.db is not generated, and the built-in communication instrumentation data is converted into communication operators.
Note: This setting generates a new COMMUNICATION_OP table. You are advised to use it in combination with Level_none. Otherwise, the original table structure will be damaged.
Output:
When Level_none is set, the unified database does not contain a COMMUNICATION_OP table. This analysis feature converts built-in communication instrumentation data into communication operators for display in MindStudio Insight.
slow_rank
When -m slow_rank is set, the following tables are generated.
SlowRank
Description:
Provides slow rank analysis based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| rankId | INTEGER | Slow rank |
| slowAffectCount | INTEGER | Number of communications affected by this rank |
SlowOpStats
Description:
Provides communication operator statistics corresponding to slow rank bottleneck locations. This analysis is based on cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| SlowRank | TEXT | Slow rank ID |
| OpName | TEXT | Communication operator name |
| GroupName | TEXT | Communication group name |
| Timestamp | TEXT | Communication operator timestamp |
| Count | INTEGER | Count |
| MeanNs | REAL | Average duration |
| StdNs | REAL | Standard deviation of the duration |
| MinNs | REAL | Minimum duration |
| Q1Ns | REAL | 25th percentile of duration |
| MedianNs | REAL | 50th percentile of duration |
| Q3Ns | REAL | 75th percentile of duration |
| MaxNs | REAL | Maximum duration |
| SumNs | REAL | Total duration |
| MinRank | INTEGER | Rank with the minimum duration for the communication operator |
| MaxRank | INTEGER | Rank with the maximum duration for the communication operator |
p2p_pairing
When -m p2p_pairing is set, cluster_analysis.db is not generated.
This analysis feature displays P2P operator connection lines, allowing users to identify the source rank (src_rank) and destination rank (dst_rank) for send and receive operations. Currently, MindStudio Insight does not support this feature.
Output:
An opConnectionId column is added to the COMMUNICATION_OP table in the ascend_pytorch_profiler_{rank_id}.db file of the cluster data. P2P operators across different ranks can be linked based on this operator connection ID (opConnectionId).
pp_chart
Note: This capability requires lightweight instrumentation before and after forward and backward passes. Use msprof-analyze for processing and MindStudio Insight for result visualization.
Instrumentation
Taking DualpipeV2 as an example, locate the forward and backward pass code and add the following code to dualpipev_schedules.py (for reference only; ensure the code is added at the correct location):
import torch_npu
def step_wrapper(func, msg: str):
def wrapper(*args, **kwargs):
new_msg = {"name": msg}
if msg == "forward_step_with_model_graph" and kwargs.get("extra_block_kwargs") is not None:
new_msg["name"] = "forward_backward_overlaping"
if "current_microbatch" in kwargs:
new_msg["current_microbatch"] = kwargs["current_microbatch"]
if msg == "WeightGradStore_pop" and len(WeightGradStore.cache) == 0:
mstx_state_step_range_id = None
else:
mstx_state_step_range_id = torch_npu.npu.mstx.range_start(str(new_msg), torch_npu.npu.current_stream())
out = func(*args, **kwargs)
if mstx_state_step_range_id is not None:
torch_npu.npu.mstx.range_end(mstx_state_step_range_id)
mstx_state_step_range_id = None
return out
return wrapper
forward_step_with_model_graph = step_wrapper(forward_step_with_model_graph, "forward_step_with_model_graph")
forward_step_no_model_graph = step_wrapper(forward_step_no_model_graph, "forward_step_no_model_graph")
backward_step_with_model_graph = step_wrapper(backward_step_with_model_graph, "backward_step_with_model_graph")
backward_step = step_wrapper(backward_step, "backward_step")
WeightGradStore.pop = step_wrapper(WeightGradStore.pop, "WeightGradStore.pop")
Add metadata when collecting profile data:
prof.add_metadata('pp_info', json.dumps(
{
'pp_type': 'dualpipev',
'microbatch_num': 10,
}
))
# Replace microbatch_num with the actual value.
StepTaskInfo
Description:
Provides a table for visualized display. This table is generated by processing the db format cluster profile data instrumented in the previous section.
Table fields
| Field | Type | Description |
|---|---|---|
| name | TEXT | Forward and backward propagation information |
| startNs | INTEGER | Start time on the device |
| endNs | INTEGER | End time on the device |
| type | INTEGER | Type (different types are displayed in different colors) |
Communication
When profiler_level is set to Level_none, the COMMUNICATION_OP table is not generated. Use the mstx2commop analysis feature to convert built-in communication instrumentation data into communication operators to generate this table. The PP chart can also display send and recv operators.
With the COMMUNICATION_OP table, use the p2p_pairing analysis feature to display send and recv connection lines in the PP chart. This allows the PP pipeline to also display the send and recv lines. However, this feature requires level 1 or higher.
communication_group.json
Records communication group information. It is generated by parsing analysis.db. collective indicates a collective communication group, and P2P indicates point-to-point communication. Ignore this file.
stats.ipynb
-
Generated when the analysis feature is set to
cann_api_sumand stored in thecluster_analysis_output/CannApiSumdirectory.Open this file using Jupyter Notebook or MindStudio Insight to view cluster API duration information.
-
Generated when the analysis feature is set to
compute_op_sumand stored in thecluster_analysis_output/ComputeOpSumdirectory.Open this file using Jupyter Notebook or MindStudio Insight to view cluster computation operator duration analysis results (summarizing all cluster computation operators in charts) and cluster rank computation operator duration analysis results (summarizing computation operators for each rank).
-
Generated when the analysis feature is set to
hccl_sumand stored in thecluster_analysis_output/HcclSumdirectory.Open this file using Jupyter Notebook or MindStudio Insight to view cluster communication operator duration analysis results (summarizing all cluster communication operators in charts), cluster rank communication operator duration analysis results (summarizing communication operators for each rank), and top communication operator information.
-
Generated when the analysis feature is set to
mstx_sumand stored in thecluster_analysis_output/MstxSumdirectory.Open this file using Jupyter Notebook or MindStudio Insight to view MSTX instrumentation information for cluster scenarios across framework, CANN, and device sides.
-
Generated when the analysis feature is set to
slow_linkand stored in thecluster_analysis_output/SlowLinkdirectory.Open this file using Jupyter Notebook or MindStudio Insight to view abnormal slow link data analysis results for cluster scenarios (summarizing all cluster links in charts) and cluster slow link total duration analysis results (displaying data for detected potential slow links).
export_summary
When -m export_summary is set, the following files are generated in the ASCEND_PROFILER_OUTPUT directory of each rank.
api_statistic.csv
Description:
Provides the API statistics of each rank based on the cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| API Name | TEXT | API name |
| Count | INTEGER | Call count |
| Total Time(us) | REAL | Total duration (μs) |
| Avg Time(us) | REAL | Average duration (μs) |
| Min Time(us) | REAL | Minimum duration (μs) |
| Max Time(us) | REAL | Maximum duration (μs) |
kernel_details.csv
Description:
Provides the kernel details of each rank based on the cluster profile data in db format.
Table fields
| Field | Type | Description |
|---|---|---|
| op_name | TEXT | Operator name |
| op_type | TEXT | Operator type |
| task_type | TEXT | Task type |
| task_duration | REAL | Task duration (μs) |
| input_shapes | TEXT | Input shape |
| output_shapes | TEXT | Output shape |
| block_dim | TEXT | Block dimension |
| input_data_types | TEXT | Input data type |
| output_data_types | TEXT | Output data type |