First Network Overflow/Underflow Node Analysis
Overview
If INF or NAN occur, dump data of multiple steps of multiple ranks is collected. The exceptions that occur earlier are propagated to the subsequent nodes of the same rank and propagated to the subsequent nodes of other ranks through the communication operator. Therefore, it is important to analyze the location of the first node where NAN/INF occurs.
The overflow_check function (first network overflow/underflow node analysis) analyzes PyTorch dump data to detect the node that generates INF/NAN on each rank in a multi-rank scenario. If INF/NAN is caused by communication, find and analyze the first rank and node where INF/NAN occurs.
Preparations
Environment Setup
Install msProbe by referring to msProbe Installation Guide.
Data Preparation
Collect precision data by referring to Precision Data Collection in PyTorch.
Constraints
Currently, only PyTorch dump data can be analyzed.
Function Description
Description
Analyzes the dump data of PyTorch.
Syntax
msprobe overflow_check -i <input_path> -o <output_path>
Parameters
| Parameter | Required/Optional | Description |
|---|---|---|
| -i or --input_path | Required | Dump data directory, which must be specified to the step level, for example, -i /xxx/dump/step0. |
| -o or --output_path | Optional | Output file directory. By default, this parameter is not configured, indicating that the ./output directory is created. |
Example
msprobe overflow_check -i /xxx/dump/step0 -o ./output
Output Description
When the following log is printed, no abnormal node exists and no analysis file is generated.
Cannot find any anomaly node, no need to generate analyze file.
If an abnormal node exists, the anomaly_analyze_{timestamp}.json file is generated. The structure is as follows:
{
"rank_0": [ // Rank ID
{
"op_name": "Tensor.op_name.0.forward," // Node name
"data_info": {
"input_args": [], // input_args
"input_kwargs": {}, // input_kwargs
"output": [] // output
},
"construct_info": [], // Node-level data
"stack_info": {}, // Stack information
}
]
}
Exception Determination
Determining Abnormal Compute Nodes
If a compute node has normal inputs (no INF/NAN in Max or Min) but produces abnormal outputs, overflow/underflow originates at this node and may propagate backward.
Determining Abnormal Communication Nodes
Communication nodes are classified into directed nodes (e.g. send, recv, scatter, gather, broadcast, reduce) and undirected nodes (e.g. all_gather, all_reduce, reduce_scatter, all_to_all).
For a directed node, if the src node's input is abnormal, the abnormality originates from the input itself, meaning the exception resides on one or more compute nodes prior to the src node's rank. If the src node's input is normal but its output is abnormal, or if the dst node's output is abnormal, the abnormal data may be generated by the communication node's operations.
For an undirected node, if the input is abnormal, the abnormality stems from the input data itself, indicating that the exception occurs on one or more compute nodes preceding the src node's rank. If the input is normal but the output is abnormal, the communication node's operation generates the abnormal data.
Determining the Sequence
For connected directed communication operators, the exception of the src node occurs earlier than that of the dst node. For undirected communication operators, exceptions occur at the same time.
For compute nodes, sort them in the dump sequence.