First Network Overflow/Underflow Node Analysis

Overview

If INF or NAN occur, dump data of multiple steps of multiple ranks is collected. The exceptions that occur earlier are propagated to the subsequent nodes of the same rank and propagated to the subsequent nodes of other ranks through the communication operator. Therefore, it is important to analyze the location of the first node where NAN/INF occurs.

The overflow_check function (first network overflow/underflow node analysis) analyzes PyTorch dump data to detect the node that generates INF/NAN on each rank in a multi-rank scenario. If INF/NAN is caused by communication, find and analyze the first rank and node where INF/NAN occurs.

Preparations

Environment Setup

Install msProbe by referring to msProbe Installation Guide.

Data Preparation

Collect precision data by referring to Precision Data Collection in PyTorch.

Constraints

Currently, only PyTorch dump data can be analyzed.

Function Description

Description

Analyzes the dump data of PyTorch.

Syntax

msprobe overflow_check -i <input_path> -o <output_path>

Parameters

Parameter Required/Optional Description
-i or --input_path Required Dump data directory, which must be specified to the step level, for example, -i /xxx/dump/step0.
-o or --output_path Optional Output file directory. By default, this parameter is not configured, indicating that the ./output directory is created.

Example

msprobe overflow_check -i /xxx/dump/step0 -o ./output

Output Description

When the following log is printed, no abnormal node exists and no analysis file is generated.

Cannot find any anomaly node, no need to generate analyze file.

If an abnormal node exists, the anomaly_analyze_{timestamp}.json file is generated. The structure is as follows:

{
  "rank_0": [ // Rank ID
    {
      "op_name": "Tensor.op_name.0.forward,"  // Node name
      "data_info": {
        "input_args": [],  // input_args
        "input_kwargs": {},  // input_kwargs
        "output": []  // output
      },
      "construct_info": [],  // Node-level data
      "stack_info": {},  // Stack information
    }
  ]
}

Exception Determination

Determining Abnormal Compute Nodes

If a compute node has normal inputs (no INF/NAN in Max or Min) but produces abnormal outputs, overflow/underflow originates at this node and may propagate backward.

Determining Abnormal Communication Nodes

Communication nodes are classified into directed nodes (e.g. send, recv, scatter, gather, broadcast, reduce) and undirected nodes (e.g. all_gather, all_reduce, reduce_scatter, all_to_all).

For a directed node, if the src node's input is abnormal, the abnormality originates from the input itself, meaning the exception resides on one or more compute nodes prior to the src node's rank. If the src node's input is normal but its output is abnormal, or if the dst node's output is abnormal, the abnormal data may be generated by the communication node's operations.

For an undirected node, if the input is abnormal, the abnormality stems from the input data itself, indicating that the exception occurs on one or more compute nodes preceding the src node's rank. If the input is normal but the output is abnormal, the communication node's operation generates the abnormal data.

Determining the Sequence

For connected directed communication operators, the exception of the src node occurs earlier than that of the dst node. For undirected communication operators, exceptions occur at the same time.

For compute nodes, sort them in the dump sequence.