Trend Visualization
Overview
The trend visualization function parses precision data collected by msProbe, identifies the module names at the model layer, and displays its location across iteration steps, ranks, and network model, helping you observe the precision data and analyze precision problems based on the overall data distribution.
Concepts
- msProbe: short for MindStudio Probe, is a precision debugging toolkit that can locate precision issues during model training or inference.
- dump: data collection function of MindStudio Probe. The collected data is called dump data.
- monitor: training status monitoring function of MindStudio Probe. The collected data is called monitor data.
- Three dimensions: dimensions of the data to be observed in the trend visualization tool, including step, rank, and module name (tensor target, distinguished by the network layer or operator name).
Usage Process
- Install the tool and collect data. For details, see Preparations.
- Use the command line tool to parse the precision data and generate an SQLite database file in .db format. For details, see Precision Data Parsing.
- Start the TensorBoard service and set the
--logdirparameter to the output path of the precision data parsing function. - Use a browser to open the TensorBoard service page and view the data in the
TREND ANALYZERplugin window.
Preparations
Environment Setup
Install msProbe by referring to msProbe Installation Guide.
If you choose to compile and install msProbe, you must configure --include-mod=trend_analyzer in the compile command to build the trend visualization plugin.
Data Preparation
- Dump data (collecting model data with
levelset toL0ormix)- For details about how to collect data in the PyTorch framework, see Precision Data Collection in PyTorch.
- For details about how to collect data in the MindSpore framework, see Precision Data Collection in MindSpore.
- monitor data (output
formatset tocsv)- For details about the collection method, see Lightweight Training Status Monitoring Tool.
Constraints
- The PyTorch and MindSpore frameworks are supported.
Precision Data Parsing
Function
Parse dump or monitor data, identifies module names at each model layer, determines the positions of module names in the step, rank, and network model based on the dump data flushing sequence, and save the parsing result to an SQLite database file in .db format.
Precautions
- dump data: Only the data collected at the
L0ormixlevel (specified bylevel) is supported. - monitor data: Only the data collected when the output
formatis set tocsvis supported. - To effectively display the data trend, the flushed data range is
[-1e9, 1e9]. Data beyond this range will be truncated. Theinfvalue will be converted to1e9+1, and the-infvalue will be converted to-1e9-1. - There may be multiple parsed files. When you start TensorBoard, the first
.trend.dbfile passed to the directory is opened by default. Currently, file selection is not supported.
Syntax
msprobe data2db --db <db_path> --data <data_path> [--format <format>] [--mapping <mapping_json>] [--micro_step <use_micro_step>] [--process_num <process_num>]
Parameters
| Parameter | Mandatory (Yes/No) | Description |
|---|---|---|
--db |
Yes | Directory for storing the parsing result file. The value is of the string type. An SQLite file with the .trend.db extension is generated in this directory. |
--data |
Yes | Input data path. The value is of the string type. The dump data directory or monitor data directory is supported. Dump data should be configured to the parent directory of the step folder, and monitor data should be configured to the parent directory of the rank folder. |
--format |
No | Data format. The value is of the string type. The options are auto (automatic detection), dump, and monitor. The default value is auto. |
--mapping |
No | Path of the JSON mapping file (the JSON file name must be specified, for example, ./mapping.json). The value is of the string type. When parsing precision data, the program uses a mapping file to convert model layer names or operator names, simplifying them or aligning names across steps. For details about how to configure a mapping file, see Mapping File Configuration. |
--micro_step |
No | Whether to enable micro-step counting. The value is of the Boolean type. The default value is true. If micro-step counting is enabled, a step can be split into multiple micro steps for analysis. |
--process_num |
No | Number of parallel processes, which is of the int type. The default value is 1. This parameter is used only to accelerate the parallel processing of monitor data. |
Example
Parse data files in the /data/dump_path directory, automatically identify monitor data and dump data, and save the parsed SQLite database file in .db format to the /data/db_path directory. The preceding operation is executed in a single process and counted by micro-step. Mapping is not used.
msprobe data2db --data /data/dump_path --db /data/db_path
Output Description
After the dump data parsing command is executed successfully, the dump_data.trend.db file is generated in the /data/db_path directory.
After the monitor data parsing command is executed successfully, the monitor_data.trend.db file is generated in the /data/db_path directory.
Trend Analysis
Function
Trend analysis is to perform visualized analysis on the statistics of tensor objectives from three dimensions: steps, ranks, and module names. This helps you observe the precision data and analyze precision issues based on the overall trend distribution.
GUI Description
The following figure shows the trend visualization GUI, including area 1 (toolbar), area 2 (heat map), and area 3 (line chart).

- Area 1: toolbar, which allows you to select metrics, statistics, display dimensions, and dimension values, and provides label filtering and heat map mode settings.
- Area 2: heat map, which displays distribution of precision data across the other two dimensions based on the selected dimension's value.
- Area 3: line chart, which shows the trend of precision data for a selected point as its dimension value varies, triggered by clicking a point in the heat map under the chosen dimension value.
Usage Description
Starting TensorBoard
Server with Direct Connectivity
Pass out_path where the .trend.db file is generated to --logdir.
tensorboard --logdir out_path --bind_all
Logs are printed after TensorBoard is started.

In the preceding figure, ubuntu is the server address, and 6008 is the port number. You can specify another port number using the --port parameter.
NOTE
Replace ubuntu with the actual server address. For example, if the actual server address is 10.123.456.78, enter http://10.123.456.78:6008 in the address box of the browser.
Server Without Direct Connectivity
If the link cannot be opened (for example, the server cannot be directly connected and a VPN is required), try one of the following methods:
-
Manually set a proxy for the local computer network. For example, in Windows 10, add the server address (for example,
10.123.456.78) in the manual proxy settings.
Then, run the following command on the server:
tensorboard --logdir out_path --bind_allFinally, enter
http://10.123.456.78:6008in the browser's address bar.>NOTE
If the firewall is enabled on the server, this method will not work. In this case, disable the firewall or try the following methods.
-
Use Visual Studio Code to connect to the server and enter the following command in the Visual Studio Code terminal:
tensorboard --logdir out_path
Press and hold
CTRLand click the link. -
Transfer the image composition result file from the server to the local computer and install msProbe on the local computer to view the image composition result.
Enter the following command on the PC:
tensorboard --logdir out_pathPress and hold
CTRLand click the link.
Browser
Google Chrome is recommended. Perform the following operations to access the trend visualization page.

- Enter the server address and port number in the address box of the browser and press
Enter, to access the TensorBoard page. - Click
TREND ANALYZERin the upper left corner to access the trend visualization page.
Heat Map
By selecting the metric, statistic, display dimension, and dimension value on the toolbar, you can view a heat map showing how precision data is distributed across the other two dimensions under the selected dimension value. The following figure shows the page, and the following table describes the detailed operations.

| No. | Description |
|---|---|
| 1 | Select a data range to be displayed by selecting the metric, statistics, dimension, and dimension value in sequence. For details, see Item Description. After selection, the corresponding heat map is loaded. |
| 2 | (Optional) Select a label from the drop-down list box or enter a label to be filtered. Only the data of the related module name is displayed. Multiple labels can be selected. For details about the label types, see Item Description |
| 3 | (Optional) Select a heat map mode from the drop-down list box, including: • Gradient mode: Precision data is displayed using a gradient from blue to red. Smaller values appear bluer. • Segmentation mode: Precision data is displayed in different colors. |
| 4 | (Optional) Drag the heat bar to adjust the value range displayed in the heat map. |
| 5 | (Optional) Drag the slider on the X-axis or Y-axis of the heat map to adjust the axis range. |
| 6 | (Optional) Hover the cursor over the heat map to view detailed information about the data block at the mouse position. |
| 7 | (Optional) Drag the dividing line between the heat map and line chart to adjust the proportion of the heat map on the page. |
When the dimension is set to Step and parallelism strategies are applied to a Megatron model, refer to Parallelism Visualization of a Megatron Model to understand how the network layer data collected under each rank maps to the actual network-wide location.
Item Description
| Item | Description |
|---|---|
| Metric | dump data: • forward: forward process data. The tensor belongs to a network layer with the suffix forward.X (X indicates the ID) or an operator API with the suffix .forward in the dump.json file.• backward: backward process data. The tensor belongs to a network layer with the suffix backward.X (X indicates the ID) or an operator API with the suffix .backward in the dump.json file.• recompute: recomputation process data. The tensor belongs to a network layer or operator API with the is_recompute attribute set to True in the dump.json file.• parameters_grad: parameter gradient data. The tensor belongs to network layer data with the suffix parameters_grad in the dump.json file.monitor data: • Automatically extracted based on the prefix of the monitor data file. Supported items include ["actv", "actv_grad", "exp_avg", "exp_avg_sq", "grad_unreduced", "grad_reduced", "param_origin", "param_updated"]. |
| Statistic | • dump data: The value is fixed to norm, max, mean, or min, indicating the L2 norm, maximum value, mean, or minimum value, respectively.• monitor data: Automatically extracted based on the content of the .csv file. If a column contains valid data, the column name is automatically extracted as a statistical option. |
| Dimension | The options are as follows: • Step: Display all data of a single step in a heatmap (the X axis represents Rank, and the Y axis represents Module Name). You can click to view the trend line chart of a single tensor target in the step dimension.• Rank: Display all data of a single rank in a heatmap (the X axis represents Step, and the Y axis represents Module Name). You can click to view the trend line chart of a single tensor target in the rank dimension.• Module Name: Display all data of a single tensor target in a heatmap (the X axis represents Step, and the Y axis represents Rank). You can click to view the trend line chart of a single tensor target in the Module Name dimension. |
| Label | The following types are available: • default: Indicates that the tensor target belongs to network layer data, such as "Module" and "Cell".• layer: Indicates a label in the format of xxx.N extracted from the network layer name, where xxx is a string and N is an integer representing the network layer ID, for example, "layers.0".• index: Indicates the input/output position of a tensor. For example, the 0th tensor in the input is labeled as "input.0".• module: Indicates a string label extracted from the network layer name, representing the network layer type, for example, "TransformerLayer".• function: Indicates the name of the operator API for which no layer label is extracted. |
Line Chart
By clicking a point in the heat map, you can view the trend of precision data for that point as its dimension value varies. The following figure shows the page, and the following table describes the detailed operations.

| No. | Description |
|---|---|
| 1 | Select the basic data range to be displayed as instructed in Heat Map, and wait until the heat map is loaded. |
| 2 | Click a point in the heat map to display the trend of its precision data as the dimension value changes. You can select multiple points, and several lines are loaded at the same time for comparison. |
| 3 | (Optional) Drag the dividing line of the line chart to adjust the proportion of the line chart on the page. |
| 4 | (Optional) Click the legend of the line chart to display or hide the corresponding line. |
| 5 | (Optional) Hover the cursor over the line chart to view detailed data information of all lines at the mouse position. |
| 6 | (Optional) Drag the slider on the X-axis or Y-axis of the line chart to adjust the axis range. |
| 7 | (Optional) Drag the line chart or scroll the mouse wheel to adjust the X-axis range. |
| 8 | (Optional) Click the clear button to clear the currently displayed line chart. |
Parallelism Visualization of a Megatron Model
Function
This function visualizes the mapping between the network layer of each rank and the entire network in a heat map.
Model parallelism in Megatron distributes a model across different ranks. As a result, the model layer data collected on each node may contain only a portion of the full model, and the location of those layers within the overall model is not immediately apparent. This function offers visualization support for multi‑node model parallelism, helping you rapidly identify how model layers are mapped to each device under the current parallelism configuration.
Precautions
Only the tensor parallelism, pipeline parallelism, virtual pipeline parallelism, and data parallelism are supported in the Megatron scenario.
The number of ranks should be less than or equal to 1024 and the number of model layers should be less than or equal to 256. That is, world_size ≤ 1024 and num_layers ≤ 256 must be met.
Example
-
Create a Python script, for example,
plot_model.py. Copy the following code toplot_model.pyand modify the configuration underParallelConfigas required.from msprobe.core.common.megatron_utils import ParallelConfig, plot_model_parallelism config = ParallelConfig( world_size=32, num_layers=48, tensor_parallel_size=4, pipeline_parallel_size=4, num_layers_per_virtual_pipeline_stage=3, order="tp-cp-ep-dp-pp", standalone_embedding_stage=False, output_path='./' ) plot_model_parallelism(config)For details about the parameters, see plot_model_parallelism.
-
Run the following command to start the conversion.
python plot_model.py
Output Description
After the plot_model_parallelism API is successfully called, a png file in the format of ws{world_size}_ln{num_layers}_tp{tensor_parallel_size}_pp{pipeline_parallel_size}_vpp{virtual_pipeline_parallel_size}.png is generated in the configured output_path. virtual_pipeline_parallel_size is the size of the virtual pipeline parallel group calculated based on the input parameters such as num_layers_per_virtual_pipeline_stage.
View the PNG file.

The table below describes the PNG file.
| Field | Description |
|---|---|
| Model Parallelism Configuration | Parallel configuration set or calculated by the user, including:Total Layers: total number of layers in a model, corresponding to num_layers in the script.DP: data parallel group size, calculated based on the input parallel parameters.TP: tensor parallel group size, corresponding to tensor_parallel_size in the script.PP: pipeline parallel group size, corresponding to pipeline_parallel_size in the script.VPP: virtual pipeline parallel group size, corresponding to virtual_pipeline_parallel_size in the file name, calculated based on the input parallel parameters. |
| TP Group | Vertical coordinate, tensor parallel group, in the format of Group{num}: Rank{start}-{end}. num indicates the group ID, and start and end indicate the IDs of the first and last ranks in the group, respectively. For example, Group0: Rank0-3 indicates group 0, which contains four ranks: rank0 to rank3. |
| Virtual Pipeline Stage | Horizontal coordinate, pipeline parallel stage or virtual pipeline parallel stage, in the format of Stage {num}. num indicates the stage ID. |
| Model Copies | Model replica legend. In data parallelism, different model replicas of input data are marked in different colors. |
Embed/ L{start}-{end}/ Out |
Text in the color matrix, indicating model layers contained in a stage of a tensor parallel group.Embed: first stage of the model, which usually contains the embedding layer.L{start}-{end}: model layers from start to end. For example, L1-3 indicates that the current stage contains the first, second, and third model layers of the entire model.Out: last stage of the model, which usually contains the output layer.If multiple stage definitions are met, use a plus sign (+) to connect them. |
Appendixes
Mapping File Configuration
Mapping configuration files provide input to the --mapping parameter of the precision data parsing function.
After the -mapping parameter is configured, the parser sequentially replaces model layer or operator names in each precision data file with the keys and values specified in mapping.json. This is intended for scenarios requiring name simplification or cross-step name alignment.
The JSON file format and an example are provided below; keys and values are strings.
{
".TE": ".",
".MindSeed": "."
}
In the preceding format, the field on the left is the key (for example, .TE), and the field on the right is the value (for example, .). The preceding configuration indicates that .TE is replaced with ., and .MindSeed is replaced with ..
Public API
plot_model_parallelism
Prototype
plot_model_parallelism(config: ParallelConfig) -> None
Parameters
Parameters need to ba passed during instance initialization when a ParallelConfig instance is configured.
| Parameter | Input/Output | Description |
|---|---|---|
| world_size | Input | (Mandatory; int) Total number of ranks for model deployment. The value range is [1, 1024]. |
| num_layers | Input | (Mandatory; int) Total number of layers in a model. The value range is [1, 256]. |
| tensor_parallel_size | Input | (Optional; int) Tensor parallel group size. The default value is 1. In the actual training script, --tensor-model-parallel-size T is specified, where T is the specified tensor parallel group size. |
| pipeline_parallel_size | Input | (Optional; int) Pipeline parallel group size. The default value is 1. In the actual training script, --pipeline-model-parallel-size P is specified, where P is the specified pipeline parallel group size. |
| num_layers_per_virtual_pipeline_stage | Input | (Optional; int) Number of layers in each virtual pipeline stage. The default value is None, indicating that virtual pipeline parallelism is disabled. In the actual training script, --num-layers-per-virtual-pipeline-stage V is specified, where V is the number of layers in each virtual pipeline stage. |
| order | Input | (Optional; str) Sorting order of model parallelism strategies. The default Megatron setting (tp-cp-ep-dp-pp) is used. |
| standalone_embedding_stage | Input | (Optional; bool) Whether to use the embedding layer as an independent pipeline stage. True for enabled; 'Falsefor disabled. The default value isFalse`. |
| output_path | Input | (Optional; str) Output path of the visualization result. The default value is './'. |
Returns
None
FAQ
-
How to use the trend visualization tool to compare the precision data files of two different experiments?
The trend visualization tool does not distinguish between benchmark experiments and comparison experiments. It only compares the precision data files based on their input paths. To compare the precision data files of two different experiments, you need to manually move the subdirectories of the two groups of files to the same directory and then use the trend visualization tool to view and compare the files.
Assume there are two dump data files:
dump_path1anddump_path2.├── dump_path1 │ ├── step0 │ | ├── rank0 │ | | ├── dump.json │ | | ├── stack.json | | | └── construct.json │ | |── rank1 │ ├── step1 ├── dump_path2 │ ├── step0 │ | ├── rank0 │ | |── rank1 │ ├── step1You can move the
dump_path1anddump_path2subdirectories to the same directory by appending steps.├── dump_path_compare │ ├── step0 # Step 0 of the original dump_path1 │ ├── step1 # Step 1 of the original dump_path1 │ ├── step2 # Step 0 of the original dump_path2 │ ├── step3 # Step 1 of the original dump_path2After the command is executed, the precision data of four steps can be obtained. In this case, comparing the precision data trends of step 0 and step 2 is equivalent to comparing the precision data trends of step 0 in the original
dump_path1anddump_path2.msprobe data2db --data dump_path_compare --db ./output --format dumpYou can also move the
dump_path1anddump_path2subdirectories to the same directory by appending ranks.├── dump_path_compare │ ├── step0 │ | ├── rank0 # step 0/rank 0 of original dump_path1 │ | ├── rank1 # step 0/rank 1 of original dump_path1 │ | ├── rank2 # step 0/rank 0 of original dump_path2 │ | ├── rank3 # step 0/rank 1 of original dump_path2 │ ├── step1 │ | ├── rank0 # step 1/rank 0 of original dump_path1 │ | ├── rank1 # step 1/rank 1 of original dump_path1 │ | ├── rank2 # step 1/rank 0 of original dump_path2 │ | ├── rank3 # step 1/rank 1 of original dump_path2After the command is executed, the precision data visualization results of four ranks are obtained. In this case, comparing the precision data trends of rank 0 and rank 2 in step 0 is equivalent to comparing the precision data trends of rank 0 in step 0 of the original
dump_path1anddump_path2.