Graph Comparison in Hierarchical Visualization (PyTorch)

Overview

This function parses the precision data dumped by msProbe, restores the model graph structure, and compares the precision data at each model layer, helping you understand the model structure and analyze precision issues.

Concepts

msProbe: short for MindStudio Probe, is a precision debugging toolkit that can locate precision issues during model training or inference.
dump: a process of collecting precision data.

Usage Process

Install the tool and collect data. For details, see Preparations.
Use the command line tool to generate a graph structure file. For details, see Hierarchical Visualization Overview.
Start the TensorBoard service. For details, see Starting TensorBoard.
Use a browser to view the graph structure and analyze the model structure and precision data. For details, see Viewing Results in Browser.

Tool Features

Supports model structure reconstruction.
Supports comparison of the structure differences between two models.
Supports comparison of the precision data between two models.
Supports overflow/underflow detection of model data.
Supports batch graph construction in multi-rank scenarios, associates communication nodes of each rank, and analyzes data transfer among ranks.
Supports node name search, node filtering based on precision comparison results, and node filtering based on overflow/underflow detection results; automatically expands the level where a node is located.
Supports cross-suite model comparison.
Supports precision data comparison between two models under different parallelism policies.
Supports visualized conversion of dump data on the browser page: Visualized Dump Data Conversion.

vis_show

Preparations

Environment Setup

Install msProbe by referring to msProbe Installation Guide.

If you choose to compile and install msProbe, you must configure --include-mod tb_graph_ascend in the compilation command to build the hierarchical visualization plugin.

Note: msProbe has integrated tb_graph_ascend. If an earlier version of tb_graph_ascend has been installed in the current environment, run the pip uninstall tb_graph_ascend command to uninstall it to avoid conflicts.

Data Preparation

Collect model structure data: Set level to L0 (module information) or mix (module and API information). The content of the collection result file construct.json cannot be empty. For details, see Precision Data Collection in PyTorch.

Constraints

Only the PyTorch framework is supported.
For details about supported PyTorch versions, see Release Notes.

Hierarchical Visualization Overview

Single-Graph Construction

Function

Displays the model structure, precision data, and stack information, and provides the overflow/underflow detection function. It is applicable to scenarios where the model structure and data overflow/underflow need to be analyzed.

Precautions

The model structure data to be collected must be available. Ensure that the dump level is set to L0 (module information) or mix (module and API information). The content of the collection result file construct.json cannot be empty.

Syntax

msprobe graph_visualize -tp <target_path> -o <output_path> [-oc] [-tensor_log] [-progress_log]

Parameters

Parameter	Mandatory (Yes/No)	Description
`-tp` or `--target_path`	Yes	Comparison path on the debugging side. The value is of the string type. The tool automatically performs single-rank build, multi-rank batch build, or multi-step batch build based on the path format.
`-o` or `--output_path`	Yes	Directory for storing the graph construction result file. The value is of the string type. The file name is automatically generated based on the timestamp in the format of `build_{timestamp}.vis.db`.
`-oc` or `--overflow_check`	No	Whether to enable overflow/underflow detection. After it is enabled, the overflow/underflow level of each overflowed/underflowed node is marked in the output .db file (`build_{timestamp}.vis.db`). If this parameter is configured, the function is enabled. By default, this parameter is not configured.
`-tensor_log` or `--is_print_compare_log`	No	Whether to enable log printing for a single module or API. Only the tensor data dumped by msProbe is supported. If this parameter is configured, the function is enabled. By default, this parameter is not configured.
`-progress_log` or `--is_print_progress_log`	No	Whether to enable log printing for the detailed task progress. If this parameter is configured, the function is enabled. By default, this parameter is not configured.

Example 1: Construct a single-rank graph.

msprobe graph_visualize -tp ./target_path/step0/rank0 -o ./output_path

-The -tp format must comply with the single-rank format listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Example 2: Construct multi-rank graphs in batches.

msprobe graph_visualize -tp ./target_path/step0 -o ./output_path

-The -tp format must comply with the multi-rank format listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Example 3: Construct multi-step graphs in batches.

msprobe graph_visualize -tp ./target_path -o ./output_path

-The -tp format must comply with the multi-step format listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Example 4: Perform overflow/underflow detection on a single graph.

msprobe graph_visualize -tp ./target_path -o ./output_path -oc

In the output result, each graph node is marked with an overflow/underflow detection metric. The metrics are as follows:

medium: abnormal input; normal output
high: abnormal input; abnormal output. The norm value of the output is abnormally larger than that of the input.
critical: normal input; abnormal output

Output Description

In the configured output path, a .vis.db file is generated. The file name is automatically generated based on the timestamp in the format of build_{timestamp}.vis.db.

Dual-Graph Comparison

Function

Displays model structure, structural differences, precision data, precision comparison metrics, and suspected precision issues (where larger differences in precision metrics appear in deeper colors). Additionally, cross-suite comparison, overflow/underflow detection, and fuzzy matching are supported.

Currently, three types of dump data are supported. The hierarchical visualization tool automatically determines the data type during comparison:

Statistics: Only the input and output data statistics of APIs and modules are dumped, which occupies a small amount of drive space.
Real data (tensor): The input and output data statistics of APIs and modules are dumped, with tensors saved to drive. This type consumes significant drive space but provides more accurate comparison results.
MD5: The input and output data statistics and CRC-32 information of APIs and modules are dumped.

For details about how to configure the dump type, see Configuration File Introduction.

Precautions

Syntax

msprobe graph_visualize -tp <target_path> -gp <golden_path> -o <output_path> [-lm] [-oc] [-fm] [-tensor_log] [-progress_log]

Parameters

Parameter	Mandatory (Yes/No)	Description
`-tp` or `--target_path`	Yes	Comparison path on the debugging side. The value is of the string type. The tool automatically performs single-rank comparison, multi-rank batch comparison, or multi-step batch comparison based on the path format.
`-gp` or `--golden_path`	No (Mandatory in the dual-graph comparison scenario)	Comparison path on the benchmark side. The value is of the string type. If this parameter is not set, single-graph construction is performed.
`-o` or `--output_path`	Yes	Directory for storing the graph construction result file. The value is of the string type. The file name is automatically generated based on the timestamp in the format of `compare_{timestamp}.vis.db`.
`-lm` or `--layer_mapping`	No	Cross-suite comparison, for example, comparison of a model that uses both the DeepSpeed and Megatron suites. If this parameter is configured, cross-suite layer comparison is enabled. After the layers in the model code are specified, the corresponding modules or APIs of the dump data can be identified. You need to specify a custom mapping file in .yaml format. For details about the format and configuration of the custom mapping file, see Custom Layer Mapping File and Configuring Layer Mapping for Hierarchical Model Visualization. After this parameter is configured, comparison is performed only by node name, and the type and shape of a node are ignored. Node naming format: `{Module}.{module_name}.{class_name}.{forward/backward}.{number_of_calls}` • If the values of `module_name` are different, specify a custom mapping file using the `-lm` parameter, for example, `-lm mapping.yaml`. • If the values of `module_name` are the same but the values of `class_name` are different, directly configure the `-lm` parameter. • If the values of `module_name` and `class_name` are the same, you do not need to configure the `-lm` parameter. For details, see Data Collection and Automatic Comparison in MindSpeed and LLamaFactory.
`-oc` or `--overflow_check`	No	Whether to enable overflow/underflow detection. After it is enabled, the overflow/underflow level of each overflowed/underflowed node is marked in the output .db file (`compare_{timestamp}.vis.db`). If this parameter is configured, the function is enabled. By default, this parameter is not configured.
`-fm` or `--fuzzy_match`	No	Whether to enable fuzzy matching. If this parameter is configured, the function is enabled. By default, this parameter is not configured. For details about the differences between fuzzy matching and default matching, see Matching Description.
`-tensor_log` or `--is_print_compare_log`	No	Whether to enable log printing for a single module or API. Only the tensor data dumped by msProbe is supported. If this parameter is configured, the function is enabled. By default, this parameter is not configured.
`-progress_log` or `--is_print_progress_log`	No	Whether to enable log printing for the detailed task progress. If this parameter is configured, the function is enabled. By default, this parameter is not configured.

Example 1: Perform single-rank graph comparison.

msprobe graph_visualize -tp ./target_path/step0/rank0 -gp ./golden_path/step0/rank0 -o ./output_path

-The -tp and -gp formats must comply with the single-rank format listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Example 2: Perform multi-rank batch graph comparison.

msprobe graph_visualize -tp ./target_path/step0 -gp ./golden_path/step0 -o ./output_path

-The -tp and -gp formats must comply with the multi-rank format listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Example 3: Perform multi-step batch graph comparison.

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path

-The -tp and -gp formats must comply with the multi-step format listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Example 4: Perform cross-suite comparison.

If the node names on the debugging side are the same as those on the benchmark side, specify only the -lm parameter.

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path -lm

If the node names on the debugging side are different from those on the benchmark side, you need to configure a custom mapping file. Pass the path of the custom mapping file to the -lm parameter. For details about how to configure the mapping file, see the provided parameter description.

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path -lm ./mapping.yaml

Example 5: Perform overflow/underflow detection.

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path -oc

In the output result, each graph node is marked with an overflow/underflow detection metric. The metrics are as follows:

medium: abnormal input; normal output
high: abnormal input; abnormal output. The norm value of the output is abnormally larger than that of the input.
critical: normal input; abnormal output

Example 6: Perform fuzzy matching.

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path -fm

For details about the differences between fuzzy matching and default matching, see Matching Description.

Output Description

In the configured output path, a .vis.db file is generated. The file name is automatically generated based on the timestamp in the format of compare_{timestamp}.vis.db.

Model Structure Comparison

Function

Focuses on the model structure rather than the training process data. For example, this function ensures the consistency of the model structure before and after model migration, or determines whether the precision difference is caused by the model structure difference.

Precautions

When using msProbe to collect model data, collect only the model structure (task=structure). This configuration prevents the collection of model training process data, significantly reducing the collection time.

For details about the dump configuration, see [Dump Configuration Example](../dump/config_json_introduct.md#task = structure).

Syntax

See the syntax in Dual-Graph Comparison.

Parameters

See the parameter description in Dual-Graph Comparison.

Example

See examples 1, 2, and 3 in Dual-Graph Comparison.

Output Description

In the configured output path, a .vis.db file is generated. The file name is automatically generated based on the timestamp in the format of compare_{timestamp}.vis.db.

Graph Merging Under Different Parallelism Policies

Function

Different model parallelism policies lead to precision discrepancies between two models, requiring a network-wide data comparison. However, because partitioned data and model structures are distributed across multiple ranks, direct comparison is not possible. Therefore, the distributed data and model structures must be merged before comparison.

Precautions

The supported model parallelism policies include Tensor Parallelism (TP), Pipeline Parallelism (PP), and Virtual Pipeline Parallelism (VPP). Context Parallelism (CP) and Expert Parallelism (EP) are not supported.
Graph merging is supported for models based on Megatron and MindSpeed-LLM. The graph merging effect of models based on other suites is to be verified.
Only the statistics data dumped by msProbe is supported. The level must be set to L0 or mix.
During comparison in graph merging mode, ensure that the Data Parallelism (DP) configuration is consistent. For example, with rank=8 tp=1 pp=8, the configuration dp=1 produces a single merged graph. With rank=8 tp=1 pp=4, the corresponding dp=2 produces two merged graphs. Currently, comparison between graphs of different quantities is not supported.

Syntax

msprobe graph_visualize -tp <target_path> [-gp <golden_path>] -o <output_path> [options]

Parameters

Parameter	Mandatory (Yes/No)	Description
`-tp` or `--target_path`	Yes	Comparison path on the debugging side. The value is of the string type. The tool automatically performs single-rank comparison, multi-rank batch comparison, or multi-step batch comparison based on the path format.
`-gp` or `--golden_path`	No	Comparison path on the benchmark side. The value is of the string type. If this parameter is not set, single-graph construction is performed.
`-o` or `--output_path`	Yes	Directory for storing the graph construction result file. The value is of the string type. The file name is automatically generated based on the timestamp in the format of `compare_{timestamp}.vis.db`.
`-lm` or `--layer_mapping`	No	Cross-suite comparison, for example, comparison of a model that uses both the DeepSpeed and Megatron suites. If this parameter is configured, cross-suite layer comparison is enabled. After the layers in the model code are specified, the corresponding modules or APIs of the dump data can be identified. You need to specify a custom mapping file in .yaml format. For details about the format and configuration of the custom mapping file, see Custom Layer Mapping File and Configuring Layer Mapping for Hierarchical Model Visualization. After this parameter is configured, comparison is performed only by node name, and the type and shape of a node are ignored. Node naming format: `{Module}.{module_name}.{class_name}.{forward/backward}.{number_of_calls}` 򴎲 If the values of `module_name` are different, specify a custom mapping file using the `-lm` parameter, for example, `-lm mapping.yaml`. 򴎲 If the values of `module_name` are the same but the values of `class_name` are different, directly configure the `-lm` parameter, for example, `-lm`. 򴎲 If the values of `cell_name` and `class_name` are the same, you do not need to configure the `-lm` parameter. For details, see Data Collection and Automatic Comparison in MindSpeed and LLamaFactory.
`-oc` or `--overflow_check`	No	Whether to enable overflow/underflow detection. After it is enabled, the overflow/underflow level of each overflowed/underflowed node is marked in the output .db file (`compare_{timestamp}.vis.db`). If this parameter is configured, the function is enabled. By default, this parameter is not configured.
`-fm` or `--fuzzy_match`	No	Whether to enable fuzzy matching. If this parameter is configured, the function is enabled. By default, this parameter is not configured. For details about the differences between fuzzy matching and default matching, see Matching Description.
`-tensor_log` or `--is_print_compare_log`	No	Whether to enable log printing for a single module or API. Only the tensor data dumped by msProbe is supported. If this parameter is configured, the function is enabled. By default, this parameter is not configured.
`-progress_log` or `--is_print_progress_log`	No	Whether to enable log printing for the detailed task progress. If this parameter is configured, the function is enabled. By default, this parameter is not configured.
--rank_size	No (mandatory only in the graph merging scenario)	Number of accelerator cards used for model training. The value is of the int type. `rank_size=tpppcp*dp`. CP is not supported currently. Therefore, `cp=1` is configured by default for graph merging.
--tp	No (mandatory only in the graph merging scenario)	TP size. The value is of the int type. In the actual training script, `--tensor-model-parallel-size T` needs to be specified, where `T` indicates the TP size, that is, the `tp` parameter required for graph merging (`tp=T`).
--pp	No (mandatory only in the graph merging scenario)	Number of pipeline parallel stages. The value is of the int type. In the actual training script, `--pipeline-model-parallel-size P` needs to be specified, where `P` indicates the number of pipeline parallel stages, that is, the `pp` parameter required for graph merging (`pp=P`).
--vpp	No	Number of virtual pipeline parallel stages. The value is of the int type. VPP depends on pipeline parallelism. In the actual training script, you need to specify `--num-layers-per-virtual-pipeline-stage V`, where `V` indicates the number of layers in each virtual pipeline stage, and specify `--num-layers L`, where `L` indicates the total number of model layers. For graph merging, `vpp=L/V/P` is required. The `vpp` parameter is optional. The default value is `1`, indicating that VPP is disabled.
--order	No	Sorting order of model parallelism policies. The value is of the string type. The default value for Megatron is `tp-cp-ep-dp-pp`. If msProbe is used to dump data and the specified level is `L0`, and `order` in the actual training script is not the default value (for example, `--use-tp-pp-dp-mapping` is specified in the actual training script), pass the modified `order`. If the specified level for data dumping is `mix`, no modification is required.

Example

Example 1: Comparison in graph merging mode with different TP size

target_path: eight ranks (tp=8); golden_path: four ranks (tp=4):

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path --rank_size 8 4 --tp 8 4 --pp 1 1

Example 2: Comparison in graph merging mode with different PP size

target_path: eight ranks (pp=8); golden_path: one rank (pp=1)

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path --rank_size 8 1 --tp 1 1 --pp 8 1

Example 3: Comparison in graph merging mode with different VPP sizes

target_path: eight ranks (pp=8); golden_path: eight ranks (pp=8, vpp=2)

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path --rank_size 8 8 --tp 1 1 --pp 8 8 --vpp 1 2

Example 4: Comparison in graph merging mode with different PP and TP sizes

target_path: eight ranks (pp=8); golden_path: eight ranks (tp=8)

msprobe graph_visualize -tp ./target_path -gp ./golden_path -o ./output_path --rank_size 8 8 --tp 1 8 --pp 8 1

In all the preceding examples, the formats of npu_path and bench_path must meet the multi-rank or multi-step format requirements listed in Dump File Requirements for Graph Construction in Hierarchical Visualization.

Starting TensorBoard

Server with Direct Connectivity

Pass out_path where the vis.db file is generated to --logdir.

tensorboard --logdir out_path --bind_all

Logs are printed after TensorBoard is started.

tensorboard_1

In the preceding figure, ubuntu is the server address, and 6008 is the port number.

Replace ubuntu with the actual server address. For example, if the actual server address is 10.123.456.78, enter http://10.123.456.78:6008 in the address box of the browser.

Server Without Direct Connectivity

If the link is inaccessible (for example, the server cannot be directly connected and a VPN is required), try one of the following methods:

Manually set a proxy for the local computer network. For example, in Windows 10, add the server address (for example, 10.123.456.78) in the manual proxy settings.

Then, run the following command on the server:
```
tensorboard --logdir out_path --bind_all
```
Finally, enter http://10.123.456.78:6008 in the browser.

If the firewall is enabled on the server, this method will not work. In this case, disable the firewall or try the following methods.
Use Visual Studio Code to connect to the server and enter the following command in the Visual Studio Code terminal:
```
tensorboard --logdir out_path
```
Hold CTRL and click the link.
Transfer the graph construction result file from the server to the local computer to view the result.

Enter the following command on the PC:
```
tensorboard --logdir out_path
```
Hold CTRL and click the link.

Viewing Results in Browser

Open in Browser

Google Chrome is recommended. Enter the server address and port number in the address box of the browser and press Enter to access the TensorBoard page. The "/#graph_ascend" part is automatically appended.

vis_browser_1

If you have switched to another function of TensorBoard and want to return to the model hierarchical visualization page, click GRAPH_ASCEND in the upper left corner.

vis_browser_2

Result Check

The following figure shows the overall result.

Keyboard and mouse operations:

Left-click a node to select it, double-click a node to expand or collapse it, or right-click a node to expand the menu bar and select related functions.
Scroll the mouse wheel to move the graph up or down.
Press W/S to zoom in or out on the graph, and press A/D to move the graph left or right.

Click the icons on the left of the page to use different functions. The following figure shows the page, and the following table describes the basic operations.

No.	Description
1	Data selection: You can switch between `Directory`, `Step`, `Rank`, and `MicroStep`. A micro step refers to the process of multiple forward and backward propagation performed before a complete weight update. A complete training iteration (step) can be further divided into multiple smaller steps (micro steps). The hierarchical visualization tool identifies a complete forward and backward propagation in the first layer of the model as a micro step.
2	Precision error filtering and overflow/underflow filtering: See Precision Filtering and Overflow/Underflow Filtering.
3	Node matching: See Selecting Nodes for Mapping.
4	Node search: See Search by Name.
5	Visualized dump data conversion: See Visualized Dump Data Conversion.
6	Theme switching: You can switch the current page to light or dark theme.
7	Language switching: You can switch the language of the current page to Chinese or English.

You can click the icons in the upper right corner of the page to use different functions. The following figure shows the page, and the following table describes the basic operations.

No.	Description
1	Thumbnail on the debugging side: enabled by default.
2	Thumbnail on the benchmark side: enabled by default.
3	Sync node expansion: enabled by default.
4	Shortcut key description
5	Adaptive display

At the bottom of the page, you can switch table headers to see node information, call stack information, and data parallel merging details. The data parallel merging details are displayed only after graph merging under different parallelism policies is triggered.