Lightweight Training Status Monitoring Tool

Overview

Monitor, as a lightweight training status monitoring tool, can collect and record activations, weights, gradients, optimizer status, and intermediate values of communication operators during model training with low performance loss, and display training status in real time.

Preparations

Installation

Install msProbe by referring to msProbe Installation Guide.

Constraints

PyTorch: 2.1 or later
MindSpore: 2.4.10 or later. Only MindSpore dynamic graphs are supported. The MSAdapter suite is also supported.

Quick Start

This tool monitors the corresponding objects as required. For example, during abnormal training with more losses but normal gradient norm, this tool monitors the model's forward process. During training with abnormal gradient norm, it monitors the gradients of weights and activations. It is recommended that this tool be enabled for a long time in scenarios where the performance loss of weight gradient monitoring is small (full monitoring of a 20B dense model; time increase < 1%; memory increase < 1%). In scenarios where the performance loss of activation monitoring is large, enable this tool only when necessary or monitor only some of activations.

Preparing the Configuration File

Create a config.json file in the current directory. For details about each field in the configuration file, see Detailed Configuration. The following uses the configuration of weight gradient collection as an example:

{
    "targets": {},
    "wg_distribution": true,
    "format": "csv",
    "ops": ["min","max", "mean", "norm"],
    "ndigits": 16
}

Tool Enablement

Locate where the model and optimizer are defined and where training begins in the actual training code, then add the tool enablement code. The enablement code varies in different scenarios.

PyTorch

# Megatron-LM(core_r0.6.0)  training.py
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider, model_type) 
...
# Insert the monitor tool immediately after the model and optimizer are defined.
+from msprobe.pytorch import TrainerMon
+monitor = TrainerMon(
+    config_file_path="./monitor_config.json",
+    params_have_main_grad=True,  # Whether to use main_grad for the weight. Generally, the value is True for Megatron and False for others. The default value is True.
+) 
+monitor.set_monitor(
+    model,
+    grad_acc_steps=args.global_batch_size//args.data_parallel_size//args.micro_batch_size,
+    optimizer=optimizer
+)

If DeepSpeed, Accelerate, and Transformers are used together, use optimizer=optimizer.optimizer. If DeepSpeed is not used and Accelerate and Transformers are used separately, use optimizer=optimizer.

When both DeepSpeed and Accelerate are used, the tool enabling position is as follows:

model, optimizer, trainloader, evalloader, schedular = accelerator.prepare(...)
...
+monitor = TrainerMon(...)
+monitor.set_monitor(....optimizer=optimizer.optimizer)

When both DeepSpeed and Transformers are used, the tool enabling position is as follows:

# src/transformers/trainer.py
class Trainer:
    def _inner_training_loop:
        ...
+       monitor = TrainerMon(...)
+       monitor.set_monitor(....optimizer=self.optimizer.optimizer)

        for epoch in range(epochs_trained, num_train_epochs):
            ...

MindSpore

# Megatron-LM(core_r0.6.0)  training.py
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider, model_type) 
...
# Insert the monitor tool immediately after the model and optimizer are defined.
+from msprobe.mindspore import TrainerMon
+monitor = TrainerMon(
+    config_file_path="./monitor_config.json",
+    process_group=None,
+    params_have_main_grad=True,  # Whether to use main_grad for weights. Generally, the value is True for Megatron and False for others. The default value is True.
+) 
# Mount objects to be monitored.
+monitor.set_monitor(
+    model,
+    grad_acc_steps=args.global_batch_size//args.data_parallel_size//args.micro_batch_size,
+    optimizer=optimizer
+)

Precautions

If the framework is FSDP1, ensure that use_orig_params is set to True when the model is wrapped by FSDP.

Functions of the Training Status Monitoring Tool

The following table lists the tool functions.

Function	Description	Supported Scenario
Weight Monitoring	Monitors weights.	PyTorch, MindSpore
Weight Gradient Monitoring	Monitors weight gradients.	PyTorch, MindSpore
Activation Monitoring	Monitors activations.	PyTorch, MindSpore
Optimizer Status Monitoring	Monitors optimizer status.	PyTorch, MindSpore
Collecting Module Stack Info	Collects the stack information of the module in the first step to facilitate fault locating.	PyTorch, MindSpore
Specifying Monitoring Objects	Specifies the `nn.Module(nn.Cell)` to be monitored and its inputs and outputs.	PyTorch, MindSpore
Printing the Model Structure	Prints the model structure.	PyTorch
L2 Feature Interpretability Monitoring	Monitors high-level model status.	PyTorch, MindSpore
mbs-Granularity Gradient Monitoring	When gradient monitoring is enabled, gradients can be collected at the `micro_batch_size` granularity.	PyTorch, MindSpore
Alarm Monitoring	Automatically generates alarms when the monitored object metrics are abnormal and supports data flushing.	PyTorch, MindSpore
Converting CSV Data to TensorBoard Files for Visual Display	Converts CSV data to TensorBoard files for display.	PyTorch
Dynamic Start and Stop	Dynamically modifies configurations to enable monitoring during training.	PyTorch, MindSpore
Function Overloading	Monitors activations during training. This function is to be deprecated. Use the dynamic start and stop function instead.	PyTorch

Weight Monitoring

This function can be used to monitor weights. The following is a configuration example:

{  
    "targets": {
    },
    "param_distribution": true,
    "format": "csv",
    "ops": ["norm", "min", "max", "nans"]
}

All weights of the module specified in targets are monitored. If targets is empty, all modules are monitored by default. You can set param_distribution to true to enable weight monitoring. The default value is false.

Weight Gradient Monitoring

This function can be used to monitor weight gradients before and after aggregation. The following is a configuration example:

{  
    "targets": {
    },
    "wg_distribution": true,
    "format": "csv",
    "ops": ["norm", "min", "max", "nans"]
}

All weights of the module specified in targets are monitored. If targets is empty, all modules are monitored by default. You can set wg_distribution (weight grad, noted as wg) to true to enable weight gradient monitoring. The default value is false.

Activation Monitoring

This function can be used to monitor activations. The following is a configuration example:

{  
    "targets": {
    },
    "xy_distribution": true,
    "forward_only": false,
    "backward_only": false,
    "all_xy": true,
    "format": "csv",
    "ops": ["norm", "min", "max", "nans"]
}

If all_xy is set to true, the activations of all modules are monitored. To set monitoring objects for a specified module, configure them in targets. For details, see Specifying Monitoring Objects.

If xy_distribution is set to true, activation monitoring is enabled. The default value is false.

Note: If both forward_only and backward_only are set to true, a warning is triggered and neither forward nor backward data is collected. If both forward_only and backward_only are set to false, both forward and backward data is collected.

Optimizer Status Monitoring

This function can be used to monitor optimizer status. The following is a configuration example:

{  
    "targets": {
    },
    "mv_distribution": true,
    "format": "csv",
    "ops": ["norm", "min", "max", "nans"]
}

All weights of the module specified in targets are monitored. If targets is empty, all modules are monitored by default. If mv_distribution (1st moment noted as m, 2nd moment noted as v) is set to true, optimization monitoring is enabled. The default value is false. To learn about mv, see this paper.

This tool is adapted to the distributed computing frameworks Megatron and DeepSpeed. Other frameworks are not supported.

Collecting Module Stack Information

This function can be used to collect detailed module stack information. The following is a configuration example:

{  
    "targets": {
    },
    "format": "csv",
    "stack_info": true
}

After stack_info is enabled, the stack information of all modules in the first step is collected. The output format can only be .csv.

Specifying Monitoring Objects

The tool can monitor the status of a specified nn.Module, which is specified in the targets field of the configuration file. The format of targets is {module_name: {}}.

module_name can be obtained through the named_modules() API of nn.Module.

Printing the Model Structure

The tool provides the print_struct option to print the model structure for targets configurations. The tool prints the structure after the first step and stops the training process. By default, the model structure on each rank is saved in $MONITOR_OUTPUT_DIR/module_struct/rank{rank}/module_struct.json, where {rank} indicates the rank ID.

{
    "print_struct": true
}

Output example:

"0:63.mlp.linear_fc2": {
    "input": {
        "config": "tuple[1]",
        "0": "size=(4096, 4, 1024), dtype=torch.bfloat16"
    },
    "output": {
        "config": "tuple[2]",
        "0": "size=(2048, 4, 512), dtype=torch.bfloat16",
        "1": "size=(512,), dtype=torch.bfloat16"
    },
    "input_grad": {
        "config": "tuple[1]",
        "0": "size=(4096, 4, 1024), dtype=torch.bfloat16"
    },
    "output_grad": {
        "config": "tuple[2]",
        "0": "size=(2048, 4, 512), dtype=torch.bfloat16",
        "1": "size=(512,), dtype=torch.bfloat16"
    }
},

For the module object, consider inputs and outputs of forward and backward propagation.

input: forward input
output: forward output
output_grad: backward input, indicating the gradient of the forward output
input_grad: backward output, indicating the gradient of the forward input

Specifying Monitoring Objects

The following is an example of specifying monitoring objects by using targets:

// Example: a module named "module.encoder.layers.0.mlp"
"targets": {
    "module.encoder.layers.0.mlp": {}
}

For the parameter object, pay attention to the gradient (weight grad) in a training iteration and the momentum (1st moment, 2nd moment) of an Adam optimizer. A parameter belongs to a module. You can specify module_name to monitor all parameters contained in a module.

You can obtain param_name by calling named_parameters() of nn.Module.

// Example: Monitor all parameters of "module.encoder.layers.0.mlp" and the "module.embedding.word_embedding.weight" parameter.
{
    "targets": {
        "module.encoder.layers.0.mlp": {},
        "module.embedding.word_embedding.weight": {}
    }
}

Full Monitoring

The tool provides a simple full mode to monitor module objects.

{
    "targets": {}
}

L2 Feature Interpretability Monitoring

This function can be used to monitor model status at a high level. The following is a configuration example:

{
    "l2_targets": {
        "attention_hook": ["0:0.self_attention.core_attention.flash_attention"],
        "linear_hook": ["0:0.self_attention.linear_qkv", "0:1.self_attention.linear_qkv"]
    },
    "recording_l2_features": true,
    "sa_order": "b,s,h,d"
}

Configuration Item	Type	Description	Mandatory (Yes/No)
l2_targets	Dict[str, List[str]]	Specifies the model layer to be monitored. Supported hook types: · `attention_hook`: attention layer. ▪️ Metrics: `entropy`, `softmax_max` ▪️ The accurate layer name must be obtained by printing the model structure. ▪️ If this parameter is not set or is set to an empty list, no data is collected. • `linear_hook`: linear layer ▪️ Metrics: `sr`, `kernel_norm` ▪️ The accurate layer name must be obtained by printing the model structure. If this parameter is not set, no data is collected. ▪️ If an empty list is configured, the system automatically identifies the layers that meet the conditions (including the layers with the `weight` or `wg 2D` attributes).	Yes
recording_l2_features	bool	Specifies whether to enable L2 feature data collection. The default value is `false`, indicating that L2 feature data is not collected.	No
sa_order	str	Specifies the tensor dimension sequence of the Attention input (Q, K) when calculating metrics in `attention_hook`. The value can be `s,b,h,d` or `b,s,h,d`. The default value is `s,b,h,d`, indicating that the input dimension sequence is `sequence_len->batch_size->num_heads->head_dim`.	No

L2 Feature Interpretability Monitoring Metrics

Metric	Applicable Hook	Mathematical Definition/Calculation Method	Monitoring Significance
entropy	attention_hook	$H(p)=−∑pilog⁡piH(p)=-\sum p_i \log p_i$ , where $p_i$ is the attention weight.	Measures the uncertainty of attention distribution. A low entropy value indicates that the attention is concentrated.
softmax_max	attention_hook	$max⁡(softmax(QKT/d))\max(\text{softmax}(QK^T/\sqrt{d}))$	Reflects the focus degree of the attention mechanism. A large value indicates that there is a dominant attention token.
sr(stable_rank)	linear_hook	$∣W∣F∣W∣2\frac{\|W\|_F}{\|W\|_2}$ (Stable rank, Frobenius norm divided by spectral norm)	Evaluates the effective rank of the weight matrix. A small value indicates that the matrix is close to the low-rank unstable state.
kernel_norm	linear_hook	$W\|_F$ (Frobenius norm)	Spectral norm of the weight matrix, which reflects the amplification coefficient of the input in the space formed by the maximum singular vector of the matrix.

mbs-Granularity Gradient Monitoring

When a gradient monitoring task is configured, the tool monitors gradients at the global_batch_size granularity by default. To monitor gradient information at the micro_batch_size granularity, set monitor_mbs_grad to true in the configuration file. The following is a configuration example:

{
    "wg_distribution": true,
    "monitor_mbs_grad": true
}

Application Scope

Only gradients before aggregation can be collected. In the gradient accumulation scenario, micro_batch data cannot be distinguished after aggregation.
In the PyTorch scenario, Megatron and DeepSpeed training frameworks are supported, while the FSDP training framework is not supported.
In the MindSpore scenario, the preceding training frameworks are supported.

Alarm Monitoring

The tool can automatically determine exceptions during training. You can configure alert in the configuration file to specify alarm rules. During training, the tool displays alarms on the screen in a timely manner based on the rules.

Alarm Rules

The table below lists the supported alarm rules.

Alarm	Description	Rule Name	args Required or Not
Historical mean deviation	Compare the current value with the historical mean. If the relative deviation exceeds the threshold, a message is displayed, indicating that the metric deviates. This rule is valid only for the `norm` and `mean` metrics.	AnomalyTurbulence	Required. It must be passed to `threshold`. If the metric exceeds `(1+threshold)*avg`, the metric deviates from the historical mean.
NaN value/Maximum value alarm	Determine the NaN value or maximum value based on whether `threshold` is provided.	AnomalyNan	Optional. If `args` or `threshold` is not configured, NaN is detected by default. If `threshold` is provided, the NaN value and the maximum value whose absolute value exceeds the threshold are detected.

In addition, the dump configuration item is supported in alert. If dump is enabled, the exception information is flushed to the monitor_output/anomaly_detected directory.

The following is an example of the historical mean deviation alarm:

    "alert": {
        "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}], // 0.5 indicates that a deviation message is displayed when the deviation is 50%.
        "dump": true
    },

The following is an example of the NaN value/maximum value alarm:

    "alert": {
        "rules": [{"rule_name": "AnomalyNan", "args": {"threshold": 1e10}}],
        "dump": true
    },

Note: When multiple alarm rules are configured, the first rule is preferentially reported. As shown in the following example, the AnomalyNan alarm is preferentially reported at each layer. (Generally, you are not advised to configure multiple rules.)

    "alert": {
        "rules": [
                  {"rule_name": "AnomalyNan", "args": {"threshold": 1e10}},
                  {"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}
        ],
        "dump": true
    },

Exception Description

During training, if an exception is detected, a message is displayed on the screen and the exception information is written into a JSON file by rank. The default file path is monitor_output/anomaly_detected. The following is an example of the exception information:

{
    "0:1.self_attention.core_attention_flash_0/rank0/input_grad_step_1_call_112": {
        "rank": 0,
        "step": 1,
        "micro_step": 0,
        "pp_stage": 0,
        "vpp_stage": 0,
        "call_id": 112,
        "tag_name": "0:1.self_attention.core_attention_flash_0/rank0/input_grad",
        "message": "Rule AnomalyTurbulence reports anomaly signal in ('0:1.self_attention.core_attention_flash_0/rank0/input_grad', 'min') at step 1.",
        "group_mates": [0, 1]
    },
    ...
}

xxx in call_{xxx} indicates the API execution sequence, which is used for subsequent exception sorting.

Exception Sorting

If a large amount of abnormal data is generated during model training, you need to sort the abnormal events. The tool provides the topk exception sorting capability to sort exceptions based on the API execution sequence, facilitating the demarcation of the first exception point. Example of the exception analysis command:

python3 -m msprobe.core.monitor.anomaly_processor -d $MONITOR_OUTPUT_DIR/anomaly_detected

After the exception analysis is complete, the topk events are written to anomaly_detected/anomaly_analyse.json. The following fields can be configured for exception analysis.

Field	Description	Mandatory (Yes/No)
-d or --data_path	Folder where exceptions are flushed, which is used to monitor function outputs. Generally, the value is $MONITOR_OUTPUT_DIR/anomaly_detected.	Yes
-o or --out_path	Path of the sorted exception files. By default, an anomaly_analyse.json file is flushed to the --data_path directory.	No
-k or --topk	Top K exceptions to be retained. The default value is 8.	No
-s or --step_list	Range of steps to be analyzed. The default value is [].	No

Converting CSV Data to TensorBoard for Visualization

The following describes how to convert CSV data to TensorBoard data.

from msprobe.pytorch.monitor.csv2tb import csv2tensorboard_by_step
# The first three parameters specify a batch of files to be converted, the monitor output directory, and a time range. Files within the range will be converted.
# process_num specifies the number of processes to be started. The default value is 1. More processes can accelerate conversion.
# data_type_list is a list that specifies the data types to be converted. By default, all data is converted. The data types should come from the prefix of the output file. Data types include:
#     ["actv", "actv_grad", "exp_avg", "exp_avg_sq", "grad_unreduced", "grad_reduced", "param_origin", "param_updated"]
# output_dirpath specifies the output directory. By default, the result is saved to the {curtime}_csv2tensorboard_by_step folder. curtime is the current timestamp that is automatically obtained.
csv2tensorboard_by_step(
    monitor_path="~/monitor_output,"  # Mandatory
    time_start="Dec03_21-34-40,"  # Mandatory
    time_end="Dec03_21-34-42,"  # Mandatory
    process_num=8,
    data_type_list=["param_origin"]
)

For details about the parameters, see "Converting CSV Output to TensorBoard Output" in Public APIs.

Dynamic Start and Stop

This function allows users to start or update monitoring operations at any time during training.

Before training, you can set DYNAMIC_MONITOR=True to trigger dynamic start and stop, which needs to be used together with dynamic_on in the config.json file.

In dynamic start and stop mode, the start and stop operations are controlled as follows:

Start:
- First monitoring: Check dynamic_on in the config.json file. If the value is true, go to the next step to enable monitoring.
- Non-first monitoring: Check the timestamp of the config.json file. If the timestamp is updated and dynamic_on is true, go to the next step to enable monitoring.
Stop: After the threshold specified by collect_times is reached, monitoring automatically stops and the value of dynamic_on is changed to false. You can perform the preceding operations to restart monitoring.

Precautions:

By default, monitoring is started after the configuration is initialized or the next step after an update is queried. That is, if the hook is attached in step n, collection is started in step n+1. To collect data in step 0, use the static mode.
If an error occurs when config.json is modified and monitoring is not performed, the modification does not take effect. If monitoring is performed, the original configuration is used.
When the value of collect_times is reached, the program automatically sets the parameter to false. The next time the value is changed to true, monitoring restarts.

The table below describes the supported application scenarios.

Scenario	Monitoring Mode	Procedure	Result
Scenario 1: default static mode	Static	1. Configure `export DYNAMIC_MONITOR=False` or do not set this environment variable.	Data is collected and saved in the default branch, which is not affected by `dynamic_on` in the config.json file.
Scenario 2: dynamic start and stop mode, with monitoring not started initially	Dynamic	1. Configure `export DYNAMIC_MONITOR=True`. 2. Set `dynamic_on: false` in the `config.json` file or do not set this field.	In the initial state, no monitoring is performed, and data is not collected or saved.
Scenario 3: dynamic start and stop mode, with monitoring started initially	Dynamic	1. Configure `export DYNAMIC_MONITOR=True`. 2. Set `dynamic_on: true` in the `config.json` file.	Enable monitoring and save the monitoring result based on the initial configuration in step 1 (where the initial count is 0). After the value of `collect_times` is reached, the monitoring ends.
Scenario 4: dynamic start and stop mode, with monitoring started during training	Dynamic	1. Configure `export DYNAMIC_MONITOR=True`. 2. Set `dynamic_on: false` in the `config.json` file or do not set this field. 3. Change `dynamic_on` to `true` during training.	Enable monitoring and save the monitoring result based on the latest configuration in the next step during training. After the value of `collect_times` is reached, the monitoring ends.
Scenario 5: dynamic start and stop mode, with `config.json` file modified before monitoring ends	Dynamic	1. Configure `export DYNAMIC_MONITOR=True`. 2. Set `dynamic_on: true` to start collection. 3. Modify the `config.json` file before the number of collection times reaches the value of `collect_times`	Before the update, data is collected and saved based on the old configuration. After the update, data is collected based on the latest `config.json` file and `collect_times` is counted from 0. This function can be used together with `collect_times` setting to 0 to stop monitoring in advance.
Scenario 6: dynamic start and stop mode, with monitoring restarted after monitoring ends by `collect_times`	Dynamic	1. Configure `export DYNAMIC_MONITOR=True`. 2. Set `dynamic_on: true` to start collection. 3. After the number of collection times reaches the value of `collect_times`, monitoring ends, and the program automatically changes the value of `dynamic_on` to `false`. 4. Set `dynamic_on` to `true` to restart monitoring.	Before the update, data is collected and saved based on the old configuration. After the monitoring is stopped, no data is collected. After monitoring restarts, the configuration in the latest `config.json` is used for collection and `collect_times` starts from 0.

Function Overloading

This feature will be deprecated in 2026. Use dynamic start and stop instead.

Statistics You can modify the ops attribute of the TrainerMon instance during training to adjust the monitored statistics.

if {some condition}:
    monitor.ops = ["min", "max"]

Enabling or disabling activation monitoring during training Activation monitoring has a large performance loss. It is recommended to enable it only when necessary. For example, if a loss spike occurs, enable activation monitoring based on the loss exception.

if {some condition}:
    monitor.reload_xy(xy_distribution=True)

Output Description

Output Path

You can set the MONITOR_OUTPUT_DIR environment variable to specify the monitor output path. The default value is ./monitor_output/.

export MONITOR_OUTPUT_DIR=/xxx/output_dir

Output Format

You can specify the output format by setting format, which supports csv, tensorboard, and api. csv is the default value.

tensorboard: The monitoring result is written to the event file of TensorBoard, which can be viewed using TensorBoard. The tag of an activation monitoring task is {vpp_stage}:{module_name}.{input or output}:{micro_step}/{rank}/{task}\_{ops}. The tag of other monitoring tasks is {vpp_stage}:{param_name}/{rank}/{task}\_{ops}.
```
tensorboard --logdir=$MONITOR_OUTPUT_DIR
```
Then, run the following SSH command to set up port forwarding. You can access TensorBoard locally through http://localhost:6006.
```
ssh -N -L localhost:6006:localhost:6006 your_username@remote_server_address
```
csv: The monitoring result is written into a .csv file. You can set the number of decimal places by using the ndigits field. The header is vpp_stage | name | step | micro_step(optional) | *ops |. micor_step is contained only in the output file of activation monitoring. name of activation monitoring is <module_name>.<input or output>, and name of other tasks is <param_name>.
api: The monitoring result is not flushed. During training, you can obtain the monitoring result by calling APIs such as generate_wgrad_metrics and generate_xy_metrics. For details, see Public APIs.

Merging .csv Output Files

Multiple .csv output files can be merged by setting step_count_per_record in the JSON configuration file, which specifies the number of steps whose monitoring data is stored in each .csv file. The default value is 1, indicating that the monitoring data of one step is recorded in each .csv file.

The following figure shows an example of the gradient monitoring result. If step_count_per_record is set to 5 and 10 steps are monitored continuously, each .csv file records the gradient data of five steps. grad_reduced_0-4.csv is the aggregated gradient data of five steps from step 0 to step 4, and grad_unreduced_0-4.csv is the gradient data before aggregation of five steps from step 0 to step 4.

step_count_per_record

Public APIs

monitor initialization

TrainerMon.__init__(config_file_path, process_group=None, params_have_main_grad=True) -> None

Parameter	Description	Mandatory (Yes/No)
config_file_path	JSON configuration file path.	Yes
process_group	ProcessGroup object, which is used to determine the time sequence of different rank exceptions in pipeline parallelism. It is obtained through `core.parallel_state.get_pipeline_model_parallel_group()` in Megatron. This parameter is used only to judge the abnormal time sequence.	No
params_have_main_grad	Whether to use main_grad for weights. Generally, the value is True for Megatron and False for DeepSpeed. The default value is True.	No
opt_ty (discarded)	Optimizer type.	No

monitor mounted to a model

TrainerMon.set_monitor(model, grad_acc_steps, optimizer, dp_group=None, tp_group=None, start_iteration=0) -> None

Parameter	Description	Mandatory (Yes/No)
model	Model to be monitored, which must be a `torch.nn.Module` or `mindspore.nn.Cell`.	Yes
grad_acc_steps	Gradient accumulation steps	Yes
optimizer	Optimizer to be patched	Yes
dp_group	Communication group for data parallelism. After DP domain communication, if no distributed optimizer is used, the gradients across all ranks in the group are identical, making the dumped data redundant. After dp_group is used, the tool retains only the gradient of the first rank in each dp_group.	No
tp_group	Communication group for tensor parallelism. After TP domain communication, the gradients of some parameters across all ranks in groups are identical, making the dumped data redundant. After tp_group is used, the tool retains only the gradient of redundant parameters on the first rank in each tp_group. Currently, Megatron core_r0.6.0 is supported. The weight attribute `tensor_model_parallel` is used to determine data redundancy.	No
start_iteration	Start iteration of training, which affects tool counting. This parameter is supported only in PyTorch scenarios.	No

Converting .csv Output Files to TensorBoard Output Files

csv2tensorboard_by_step(monitor_path, time_start, time_end, process_num=1, data_type_list=None) -> None

Parameter	Description	Mandatory (Yes/No)
monitor_path	Directory for storing the .csv files to be converted.	Yes
time_start	Start timestamp. This parameter is used together with `time_end`. It specifies a time range, and files within the range will be converted. The value is an inclusive range (closed on both ends).	Yes
time_end	End timestamp. This parameter is used together with `time_start`. It specifies a time range, and files within the range will be converted. The value is an inclusive range (closed on both ends).	Yes
process_num	Number of processes to be started. The default value is 1. More processes can accelerate conversion.	No
data_type_list	Data type to be converted. The data type should come from the prefix of the output file. Data types include: ["actv", "actv_grad", "exp_avg", "exp_avg_sq", "grad_unreduced", "grad_reduced", "param_origin", "param_updated"]. If this parameter is not specified, all data is converted.	No
output_dirpath	Output path after conversion. By default, the result is output to the `{curtime}_csv2tensorboard_by_step` folder. `curtime` is the current timestamp that is automatically obtained.	No

Obtain the gradient statistics of the current parameter at any position in the model.

TrainerMon.generate_wgrad_metrics() -> tuple[dict, dict]

Usage:

reduced, unreduced = monitor.generate_wgrad_metrics()

Obtain the activation statistics of the current parameter at any position in the model.

TrainerMon.generate_xy_metrics() -> tuple[dict, dict]

Usage:

actv, actv_grad = monitor.generate_xy_metrics()

Description of the old API, which will be deprecated in 2026:

TrainerMon.set_wrapped_optimizer(optimizer) -> None

Parameter	Description	Mandatory (Yes/No)
optimizer	Mixed precision optimizer provided by Megatron and DeepSpeed	Yes

TrainerMon.monitor_gnorm_with_ad(model, grad_acc_steps, optimizer, dp_group, tp_group, start_iteration) -> None

Parameter	Description	Mandatory (Yes/No)
model	Model to be monitored, which must be a `torch.nn.Module` or `mindspore.nn.Cell`.	Yes
grad_acc_steps	Gradient accumulation steps	Yes
optimizer	Optimizer to be patched	No
dp_group	Communication group for data parallelism. After DP domain communication, if no distributed optimizer is used, the gradients across all ranks in the group are identical, making the dumped data redundant. After dp_group is used, the tool retains only the gradient of the first rank in each dp_group.	No
tp_group	Communication group for tensor parallelism. After TP domain communication, the gradients of some parameters across all ranks in groups are identical, making the dumped data redundant. After tp_group is used, the tool retains only the gradient of redundant parameters on the first rank in each tp_group. Currently, Megatron core_r0.6.0 is supported. The weight attribute `tensor_model_parallel` is used to determine data redundancy.	No
start_iteration	Start iteration of training, which affects tool counting. This parameter is supported only in PyTorch scenarios.	No

The table below describes API changes.

Change	Description
Simplified initialization API	TrainerMon.init(config_file_path, process_group=None, param_have_main_grad=True)
Main API modified	`monitor_gnorm_with_ad(...)` is `renamed set_monitor(...)`, and `optimizer` is changed from an optional parameter to a mandatory parameter.
Optimizer packaging API deprecated	`set_wrapped_optimizer` is deprecated, and `optimizer` is passed by `set_monitor`.

Detailed Configuration

{  
    "targets": {  
        "language_model.encoder.layers.0": {"input": "tuple[2]:0", "output": "tensor", "input_grad":"tuple[2]:0", "output_grad":"tuple[1]:0"}  
    },
    "dynamic_on": false,  
    "start_step": 0,
    "collect_times": 100000000,
    "step_interval": 1,
    "print_struct": false,
    "module_ranks": [0,1,2,3],
    "ur_distribution": true,
    "xy_distribution": true,
    "all_xy": true,
    "forward_only": false,
    "backward_only": false,
    "mv_distribution": true,
    "param_distribution": true,
    "wg_distribution": true,
    "monitor_mbs_grad": true,
    "cc_distribution": {"enable":true, "cc_codeline":[]},
    "alert": {
        "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}],
        "dump": false
    },
    "format": "csv",
    "ops": ["min", "max", "norm", "zeros", "nans", "mean"],
    "eps": 1e-8,
    "ndigits": 12,
    "step_count_per_record": 1,
    "append_output": [],
    "squash_name": false
}

The table below describes the fields in detail.

Field	Mandatory (Yes/No)	Description
"targets"	No	Model layer and object to be monitored. For example, for the 0th layer `language_model.encoder.layers.0` of a transformer, you can monitor `input`, `output`, `input_grad`, and `output_grad`. If you are not clear about the model structure, set `print_struct` to `true`. The monitoring tool prints the name and detailed structure of the torch module in the model and exits after the first step. If this parameter is not set, full monitoring is performed by default.
"input"	No	`tuple[2]:0` means the forward input of the target module is a tuple of length 2, and we focus on the first element (index 0).
"output"	Yes	`tensor` means the forward output parameter of the target module is of the tensor type.
"input_grad"	No	`tuple[2]:0` means the backward `input_grad` of the target module is a tuple of length 2, and we focus on the first element (index 0).
"output_grad"	Yes	`tuple[1]:0` means the backward `output_grad` of the target module is a tuple of length 1, and we focus on the first element (index 0).
"dynamic_on"	No	This parameter is used for dynamic start and stop function. The value true indicates that monitoring is enabled, and the value false indicates that monitoring is disabled. The default value is false. When the value of `collect_times` is reached, the value is automatically set to false. Monitoring will be restarted after the value is changed to true.
"collect_times"	No	Number of collection times. When the number of collection times reaches the value of this parameter, monitoring stops. The default value is 100000000, indicating that collection is performed continuously.
"start_step"	No	Start step of collection. When the model training step reaches the value of this parameter, monitoring and collection start. The default value is 0, indicating that monitoring and collection start from step 0. Note: This parameter does not take effect in dynamic start/stop mode. Monitoring and collection start from the next step.
"step_interval"	No	Collection step interval. The default value is 1, indicating that monitoring data is collected at each step.
"print_struct"	No	If this parameter is set to true, the monitoring tool prints the name and detailed structure of the module on each rank and exits after the first step. If this parameter is left blank, the default value false is used.
"module_ranks"	No	Ranks on which module monitoring is enabled in distributed training scenarios. If this parameter is left blank, module monitoring is enabled on all ranks by default. The rank in the list must be of the int type.
"ur_distribution"	No	If this parameter is set to true, the value distribution of the `update` and `ratio` vectors of parameters of the specified module (specified in `targets`) of the Adam optimizer is collected and displayed in the heatmap. In addition, the `format` field must be set to `tensorboard`. The default value is false. CANN 8.0.rc2 or later must be used for the histc operator; otherwise, serious performance problems may occur. This parameter is supported only in PyTorch scenarios.
"xy_distribution"	No	If this parameter is set to true, the input and output tensors of the specified module (specified in `targets`) are monitored. The default value is false.
"all_xy"	No	This parameter is valid only when `xy_distribution` is enabled. If this parameter is set to true, all modules are monitored. The default value is false. This parameter takes effect together with `targets`. When `all_xy` is set to true and `module_xx` and specified objects are configured in `targets`, `module_xx` takes effect based on the `targets` configuration. For other modules, all objects, including `input`, `output`, `input_grad`, and `output_grad`, are monitored.
"forward_only"	No	This parameter is valid only when `xy_distribution` is enabled. If the value is true, only the forward propagation of the specified module is monitored, and `input_grad` and `output_grad` in `targets` do not take effect. The default value is false.
"backward_only"	No	This parameter is valid only when `xy_distribution` is enabled. If the value is true, only the backward propagation of the specified module is monitored, and `input` and `output` in `targets` do not take effect. The default value is false.
"mv_distribution"	No	If the value is True, the optimizer status of parameters in the specified module is monitored. The default value is False.
"wg_distribution"	No	If the value is True, the parameter gradient of the specified module is monitored. The default value is False.
"monitor_mbs_grad"	No	If the value is True, the mbs-granularity gradients are monitored. The default value is False.
"param_distribution"	No	If the value is True, the parameter of the specified module is monitored. The default value is False.
"alert"	No	`rules` specifies the exception detection mechanism and threshold for automatic alarm reporting. Currently, AnomalyTurbulence can be detected. If the statistical scalar deviates beyond the allowed floating range from the historical average value, an alarm is printed to the console. The `threshold` is set to 0.5, indicating that the allowable floating range is 50%. If the `dump` field is set to true, the exception is written into a file. The default value is false. This parameter is supported only in PyTorch scenarios.
"cc_distribution"	No	The `enable` field controls whether to enable the communication monitoring module, which is enabled only during multi-rank training. To monitor communication operators, instantiate `TrainerMon` as early as possible. The monitoring is implemented by hijacking the original function and then mounting a hook. During the initialization of some acceleration libraries, the original function is saved to prevent the monitoring from becoming invalid. The `cc_codeline field` specifies monitoring code lines, for example, `train.py\\[23\\]`. By default, that value is an empty list. The `cc_pre_hook` field controls whether to monitor inputs. The module prints communication logs before the second `optimize.step`, including the call stack, input dtype, and communication group of the communication API. If `cc_log_only` is set to true, only logs are printed, the input and output of communication are not monitored, and training is interrupted after the logs are printed. You can set `cc_codeline` based on communication logs to avoid communication irrelevant to the training process, such as time and metric synchronization.
"mg_direction"	No	If this parameter is set to true, the ratio of the weight gradient in alignment with the momentum direction is calculated. The default value is false.
"format"	No	Data flushing format. The value can be csv (default), tensorboard, or api. This parameter is supported only in Python and MindSpore dynamic graph scenarios. In MindSpore dynamic graph scenarios, only csv is supported.
"ops"	No	The value is a list, which works with `ur_distribution`, `xy_distribution`, `mv_distribution`, `wg_distribution`, `mg_direction`, and `cc_distribution` to monitor the statistical metrics of the selected tensor. Currently, `min`, `max`, `norm`, `mean`, `zeros`, and `nans` are supported. `zeros` represents the ratio of monitored elements in the selected tensor that is less than `eps`, and `nans` represents the number of NaN values in the tensor. If there is no valid metric in `ops`, the `norm` metric is monitored by default.
"eps"	No	If `ops` contains `zeros`, this parameter needs to be configured. The default value is 1e-8.
"ndigits"	No	Number of decimal places in the flushed file when `format` is csv. The default value is 6.
"step_count_per_record"	No	Number of step records in each .csv file. This parameter is valid only when `format` is csv. The default value is 1.
"append_output"	No	This parameter applies to resumable training scenarios. In multi-rank scenarios, it specifies a range of two timestamps. Outputs are written to the output files only for ranks with timestamps within the range; outputs from ranks outside the range are not written. The timestamp should be from the prefix of the original output directory, for example, ["Dec03_21-34-40", "Dec03_21-34-41"]. The default value is [], indicating that the output is not written. This parameter is supported only in PyTorch scenarios.
"squash_name"	No	Whether to simplify parameter or module names. It is recommended that this function be disabled in multimodal scenarios. The default value is False.