msProbe: Functional Modules, Scenarios, and Limitations

1. PyTorch

Function	Overview	Application Scenario/Advantage	Limitations
Data dump	Collect forward and backward input and output data of APIs or modules during model training, including the hierarchy, statistical values, real data, and call stacks.	1. Save the forward and backward input and output data of APIs or modules in a model for analysis. 2. When overflow/underflow occurs in a model, you can use this function to check which APIs or modules have overflowed/underflowed.	1. API-level data collection supports only APIs in the trustlist. 2. The tool performs some synchronization operations. As a result, some synchronization issues may be masked. 3. Currently, only a limited number of APIs or modules with inplace operations are supported. 4. Currently, parameters and parameter gradients cannot be collected.
Offline pre-check	Create test cases for each API on the network to check their precision and determine whether their precision on the NPU meets the requirements based on different comparison algorithms, helping quickly identify the APIs with precision differences.	1. Preliminarily check the precision of all APIs in a model. 2. The precision check is not affected by the model's accumulated error.	1. The GPU environment is required. 2. Communication operators are not supported. 3. Only some fusion operators are supported.
Network-wide comparison	Calculate precision error metrics of NPUs and benchmark data on the entire network, and mark APIs or modules with abnormal precision, helping quickly locate root causes.	1. Locate the suspicious operators with precision issues on the entire network.	1. Suspicious operators are identified based on the dump data of the entire network, and their localization is affected by accumulated error. 2. The comparison takes a long time for a large-sized model.
Training status monitoring	Collect activations, gradients, and optimizer status during model training, helping analyze exceptions in the computing, communication, and optimizer stages.	1. By monitoring module-level statistics, you can quickly identify abnormal modules—for example, those where the loss value is NaN.	1. Only module-level statistics can be analyzed. 2. Only the Megatron and DeepSpeed frameworks are supported. 3. A small amount of additional time and memory is required.
Visualized comparison	Parse dumped precision data, restore the model graph structure, and compare the precision data at each layer, helping you understand the model structure and analyze precision issues.	1. Compare network-wide precision to locate suspicious operators, display the comparison result in the browser, and support quick search for suspicious operators. 2. Allow you to view model layer results and compare differences in model layer structure.	1. Suspicious operators are identified based on the dump data of the entire network, and their localization is affected by accumulated error. 2. The comparison takes a long time for a large-sized model.

2. MindSpore

Function	Overview	Application Scenario/Advantage	Limitations
Data dump	Collect forward and backward input and output data of APIs or cells during model training, including the hierarchy, statistical values, real data, and call stacks.	1. Save the forward and backward input and output data of APIs or cells in a model for analysis. 2. When overflow/underflow occurs in a model, you can use this function to check which APIs or cells have overflowed/underflowed.	1. API-level data collection supports only APIs in the trustlist. 2. Currently, only a limited number of APIs or cells with inplace operations are supported. 3. Currently, parameters and parameter gradients cannot be collected.
Offline pre-check	Create test cases for each API on the network to check their precision and determine whether their precision on the NPU meets the requirements based on different comparison algorithms, helping quickly identify the APIs with precision differences.	1. Preliminarily check the precision of all APIs in a model. 2. The precision check is not affected by the model's accumulated error.	1. This function is only supported by `MindSpore.mint` APIs.
Network-wide comparison	Compare NPU precision data against benchmark data, with support for intra-framework comparison within MindSpore and cross-framework comparison with PyTorch. This helps you quickly locate the API or cell with precision issues.	1. Compare static graphs within MindSpore. 2. Compare dynamic graphs within MindSpore. 3. Compare dynamic graphs across MindSpore and PyTorch.	1. Some PyTorch APIs cannot be associated with MindSpore. You need to manually configure the mapping.
Visualized comparison	Parse dumped precision data, restore the model graph structure, and compare the precision data at each layer, helping you understand the model structure and analyze precision issues.	1. Compare network-wide precision to locate suspicious operators, display the comparison result in the browser, and support quick search for suspicious operators. 2. Allow you to view model layer results and compare differences in model layer structure.	1. Suspicious operators are identified based on the dump data of the entire network, and their localization is affected by accumulated error. 2. The comparison takes a long time for a large-sized model.
Training status monitoring	Collect activations, gradients, and optimizer status during model training, helping analyze exceptions in the computing, communication, and optimizer stages.	1. By monitoring module-level statistics, you can quickly identify abnormal modules—for example, those where the loss value is NaN.	1. Only module-level statistics can be analyzed. 2. Only the Megatron and DeepSpeed frameworks are supported. 3. A small amount of additional time and memory is required.