Lightweight Training Status Monitoring Tool

Overview

Monitor V2 is a lightweight training status monitoring tool of msProbe. It enables you to collect key intermediate values (such as module input/output, weight gradient, optimizer momentum, and communication operator statistics) during training as required and flush them to drives in CSV format for training stability evaluation and exception locating.

Tool Usage Process

(1) Configure the monitoring items to be collected. (2) Initialize mon.step() in the training code and call it once by step. (3) View the CSV result in the output directory.

Applicable Scenarios

When the loss increases or spikes occur, use module to observe the value distribution of the model forward input/output and backward gradients to determine whether the exception is caused by a specific layer or phase.
When the gradient norm is abnormal, use weight_grad to locate the abnormal parameter and phase (unreduced: gradient accumulation phase; reduced: before gradient aggregation).
When convergence deteriorates or oscillation occurs, use optimizer to check whether the momentum (exp_avg/exp_avg_sq) distribution changes abruptly.
When a synchronization or communication exception occurs during distributed training, use cc to collect statistics on communication operators and narrow down the check scope based on code line filtering.

Recommended Tool Enabling Policy

Enable weight_grad for long-term monitoring. Then, enable module/cc as needed when an issue occurs to reduce the overhead by filtering targets.

Preparations

Installation

Install msProbe by referring to msProbe Installation Guide.

Constraints

PyTorch: torch >= 2.1
MindSpore: mindspore >= 2.4.10 (A dynamic graph environment is required. If MSAdapter or MindTorch is used in the project, refer to the actual project requirements.)
Monitor V2 currently supports only format=csv.

Version Requirements and Restrictions

To avoid confusion with Monitor V1, Monitor V2 must meet the following requirements:

Output: Only format=csv is supported (tensorboard/api not supported).
Configuration: monitors.<name> is used for organization, and naming of v1 such asxy_distribution/wg_distribution/... is not used. is not used.
Toolchain: Toolchains of v1 such as print_struct, stack_info, alert, csv2tensorboard, and csv2db in v1 are not supported.

Quick Start

Configuration File Preparation

Create a monitor_v2_config.json file in the directory where the training script is located. In the following example, only weight_grad is collected (a recommended configuration for long-term low-overhead monitoring).

For more configuration fields and their meanings, see Detailed Configuration.

{
  "framework": "pytorch",
  "output_dir": "./monitor_v2_out",
  "rank": [0],
  "start_step": 0,
  "step_interval": 1,
  "step_count_per_record": 1,
  "collect_times": 100,
  "format": "csv",
  "monitors": {
    "weight_grad": {
      "enabled": true,
      "ops": ["min", "max", "mean", "norm", "nans"],
      "eps": 1e-8,
      "monitor_mbs_grad": false
    }
  }
}

Example in PyTorch Scenario

First, initialize the monitor after the model and optimizer are ready. Then, call mon.step() once at the end of each training step. Finally, call mon.stop() to release resources when the training is complete. The key is to ensure that the function is called once for each step (usually after optimizer.step() and before or after optimizer.zero_grad()).

NOTE

If patch_optimizer_step=true is configured (or optimizer is passed without explicit configuration), optimizer.step() is automatically wrapped to trigger data collection. In this case, do not manually call mon.step(). If you need to manually call it, explicitly set patch_optimizer_step=false.

from msprobe.core.monitor_v2.trainer import TrainerMonitorV2

mon = TrainerMonitorV2("./monitor_v2_config.json", fr="pytorch")  # fr can be omitted. By default, config.framework is read.
mon.start(model=model, optimizer=optimizer, grad_acc_steps=grad_acc_steps)

for _ in range(num_steps):
    loss = forward(...)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)

    mon.step()

mon.stop()

Example in MindSpore Scenario

from msprobe.core.monitor_v2.trainer import TrainerMonitorV2

mon = TrainerMonitorV2("./monitor_v2_config.json", fr="mindspore")
mon.start(model=model, optimizer=optimizer, grad_acc_steps=grad_acc_steps)
for _ in range(num_steps):
    ...
    mon.step()
mon.stop()

Usage Tips

To improve locating efficiency, you are advised to narrow down the problem scope in the following sequence:

If the loss is abnormal but the gradient norm is normal, preferentially enable module and use targets to focus on suspicious modules to observe the forward and backward input and output distributions.
If the gradient norm is abnormal, preferentially enable weight_grad and use unreduced or reduced to determine whether the anomaly occurred during backward accumulation or before the step.
If you suspect a communication-related issue, instantiate TrainerMonitorV2 and invoke start() as early in the training process as possible. Doing so minimizes the risk of interception failure due to the acceleration library caching the original communication API. Also, use cc_codeline for filtering.

Functions

Monitor V2 enables sub-functions as required by setting the monitors field in the configuration file. Each sub-function supports independent start, stop, and output operations, enabling flexible combinations tailored to specific problems.

module (Module Input/Output and Backward Gradient)

Function

Collects the forward input/output and backward grad_input/grad_output statistics of a specified module to locate the layer and type of tensor where problems such as spikes, NaN/Inf, and abrupt scale changes occur.

Precautions

The overhead is usually higher than that of weight_grad. You are advised to narrow down the monitoring scope (by using targets to monitor only suspicious modules).
module_name in the output contains input/output/grad_input/grad_output, which is used to distinguish the collection scope.

Example

The following displays monitors.module configurations. For details about how to combine them into a complete monitor_v2_config.json, see Configuration File Preparation.

{
  "monitors": {
    "module": {
      "enabled": true,
      "targets": ["encoder.layers.0", "mlp"],
      "ops": ["min", "max", "mean", "norm", "nans"],
      "eps": 1e-8
    }
  }
}

Configuration field description: See Detailed Configuration > module.

Code example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

Output Description

For details, see module.csv.

weight_grad (Weight Gradient)

Function

Collects weight gradient statistics to determine which parameter's gradient becomes abnormal first and whether the exception occurs during backward accumulation or before the step.

unreduced: data collection during backpropagation (closer to the gradient generation process).
reduced: data collection before optimizer.step() is called (closer to the final gradient form before the step).

Precautions

If gradient accumulation or micro-batch exists, you are advised to use grad_acc_steps or micro_batch_number to specify the number of micro-batches. Enable monitor_mbs_grad if micro-batch is required.

Example

The following displays monitors.weight_grad configurations. For details about how to combine them, see Configuration File Preparation.

{
  "monitors": {
    "weight_grad": {
      "enabled": true,
      "monitor_mbs_grad": true,
      "grad_acc_steps": 8
    }
  }
}

Configuration field description: See Detailed Configuration > weight_grad.

Code example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

Output Description

See weight_grad.csv.

optimizer (Optimizer Momentum)

Function

Collects statistical metrics of the optimizer status. Currently, Adam momentum (exp_avg/exp_avg_sq) is mainly used to determine whether the optimizer status changes abruptly or whether abnormal distribution occurs.

Precautions

This function takes effect only when optimizer is provided (mon.start(model=..., optimizer=...)).
Currently, momentum information (mv_distribution) is mainly covered. Other extended capabilities depend on the version.

Example

The following displays monitors.optimizer configurations. For details about how to combine them, see Configuration File Preparation.

{
  "monitors": {
    "optimizer": {
      "enabled": true,
      "mv_distribution": true,
      "ops": ["min", "max", "mean", "norm", "nans"]
    }
  }
}

Configuration field description: See Detailed Configuration > optimizer.

Code example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

Output Description

See optimizer.csv.

param (Parameter Distribution)

Function

Collects statistics on the distribution of parameters before and after the optimizer step to locate parameter exceptions or update exceptions.

Precautions

This function takes effect only when <idp:inline displayname="code" id="code1018513631318">optimizer</idp:inline> is provided (<idp:inline displayname="code" id="code131857619132">mon.start(model=..., optimizer=...)</idp:inline>).
param_distribution is enabled by default and can be disabled as required.

Example

The following displays monitors.param configurations. For details about how to combine them, see Configuration File Preparation.

{
  "monitors": {
    "param": {
      "enabled": true,
      "param_distribution": true,
      "ops": ["min", "max", "mean", "norm", "nans"]
    }
  }
}

Configuration field description: See Detailed Configuration > monitors.param.

Code example: See Example in PyTorch Scenario or Example in MindSpore Scenarios.

Output Description

See param.csv.

cc (Communication Operator)

Function

Collects statistics and logs of communication operators during distributed training to locate problems such as abnormal communication calls, abnormal input and output, and communication irrelevant to training. You can also filter code lines to narrow down the check scope.

Precautions

This function takes effect only after the distributed environment is initialized (for example, torch.distributed.is_initialized() in PyTorch is set to true).
You are advised to perform instantiation and invoke start() as early as possible in the training process to prevent interception failure caused by some acceleration libraries caching the original communication API.
cc_log_only=true is suitable for scenarios where logs are recorded before filtering rules are applied. It may interrupt training, so use it with caution.

Example

The following displays monitors.cc configurations. For details about how to combine them, see Configuration File Preparation.

{
  "monitors": {
    "cc": {
      "enabled": true,
      "ops": ["min", "max", "mean", "norm", "nans"],
      "cc_codeline": [],
      "cc_pre_hook": false,
      "cc_log_only": false
    }
  }
}

The following displays monitors.cc configurations. For details about how to combine them, see Configuration File Preparation. Logs are printed for cc_codeline filtering.

{
  "monitors": {
    "cc": {
      "enabled": true,
      "cc_log_only": true
    }
  }
}

Configuration field description: See Detailed Configuration > cc.

Code example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

Output Description

See cc.csv.

Output File Description

Output Directory Description

The output directory is specified by output_dir. To facilitate multi-rank analysis, each rank outputs independently to rank_<rank_id>/.

Currently, only format=csv is supported. As each rank outputs independently to rank_<rank_id>/, each monitoring module has a CSV file.

Directory structure:

<output_dir>/
  rank_<rank_id>/
    module_step0-0.csv
    weight_grad_step0-0.csv
    optimizer_step0-0.csv
    param_step0-0.csv
    cc_step0-0.csv

NOTE

CSV files are generated only for modules enabled in monitors.

CSV Table Header and Field Description

To facilitate direct comparison of results across different monitoring items, the CSV output follows a unified rule:

Each line contains at least vpp_stage and step.
The common fields written by the monitoring module usually include module_name, scope, and micro_step (in some scenarios).
The statistical metrics are expanded into columns by ops: min/max/mean/norm/nans (only the metrics configured/enabled by the user are written).

Therefore, a common CSV table header is as follows:

Field description:

step: training step (increasing by TrainerMonitorV2.step())
vpp_stage: stage ID in the multi-model/multi-stage scenario (deduced from the name prefix <idx><NAME_SEP>; 0 by default if there is no prefix)
module_name: tag of the monitored object (The tag rules vary depending on the module. For details, see Fields Specific to Each Function.)
scope: monitoring scope/phase (The semantics vary depending on the module. For details, see Fields Specific to Each Function.)
micro_step: This field is available only when micro-batch is enabled, for example, weight_grad.monitor_mbs_grad=true.

Fields Specific to Each Function

module.csv

scope: forward/backward
module_name: The value is in the format of <module_name>.<io_kind>.<idx>, where io_kind may be input/output/grad_input/grad_output.
Tips: Pay attention to whether the module in the abnormal step is abnormal in both forward and backward to determine whether the exception is caused by abnormal forward activation propagation or gradient link exception.

weight_grad.csv

scope: unreduced/reduced
module_name: parameter name (without the scope suffix). micro_step field is used to distinguish micro-batch.
micro_step: records the current micro-batch index when micro-batch monitoring is enabled; records the total number of accumulated micro-batches (see micro_batch_number/grad_acc_steps) when micro-batch monitoring is disabled.
Tips: If unreduced is normal but reduced is abnormal, the gradient likely changes or is modified before the step. If unreduced is abnormal, the exception likely occurs during the backward link.

optimizer.csv

scope: exp_avg/exp_avg_sq (same as the suffix of the module_name file)
module_name: in the format of <name>.exp_avg/<name>.exp_avg_sq
Tips: When the training fluctuates or does not converge, compare exp_avg/exp_avg_sq changes before and after the exception to determine whether it is caused by a sudden change in the optimizer status.

param.csv

scope: param_origin/param_updated
module_name: parameter name
Tips: Compare the parameter distribution before and after the step to locate update exceptions or abrupt value changes.

cc.csv

scope: comm
module_name: communication tag generated by communication monitoring (usually containing the communication operator, index, and code location information)
Tips: Use cc_log_only=true to obtain communication logs, set cc_codeline to filter out communication irrelevant to training, and then enable statistics collection to locate the input and output distribution of abnormal communication.

Public APIs

This section lists the main user-facing APIs that can be called by Monitor V2. Note: "Prototype" is for documentation purposes only, where Any/Optional/Dict/Type is the Python's typing name.

TrainerMonitorV2

Function: Trains monitoring orchestrator, loads configurations, initializes the monitoring module, and collects and writes monitoring results in each step.

Prototype:

TrainerMonitorV2(config_path: str, fr: Optional[str] = None) -> TrainerMonitorV2

Parameters:
- config_path: configuration file path (JSON).
- fr: framework type, which can be pytorch or mindspore or pt/torch/ms. If this parameter is not passed, the framework field in the configuration file is read.
Returns: TrainerMonitorV2 instance.
Example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

TrainerMonitorV2.start

Function: Starts monitoring, that is, creates and starts the modules enabled in monitors based on the configuration, and establishes the write context.

Prototype:

TrainerMonitorV2.start(model: Any = None, optimizer: Any = None, **context: Any) -> None

Parameters
- model: model object to be monitored (nn.Module for PyTorch/nn.Cell for MindSpore; model list in some scenarios is also supported).
- optimizer: optimizer object to be monitored. This parameter is mandatory when weight_grad/optimizer is enabled.
- context: optional context information, which is used to supplement the running parameters required by the monitoring module.
  - grad_acc_steps/micro_batch_number: gradient accumulation/micro-batch count (affecting the micro_step semantics of weight_grad).
  - Other custom fields: Will be transparently transmitted to each monitoring module.
Returns: None
Example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

TrainerMonitorV2.step

Function: Advances the training step and triggers the collection and writing of the current step (controlled by start_step/stop_step/step_interval/collect_times).
Prototype:
```
TrainerMonitorV2.step() -> None
```
Parameters: None
Returns: None
Example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

TrainerMonitorV2.stop

Function: Stops monitoring and releases resources (removes internal registration/interception operations within monitoring modules and closes the writer).
Prototype:
```
TrainerMonitorV2.stop() -> Non
```
Parameters: None
Returns: None
Example: See Example in PyTorch Scenario or Example in MindSpore Scenario.

Detailed Configuration

monitor_v2_config.json

Field	Mandatory (Yes/No)	Type	Description
`framework`	No	string	Framework type: `pytorch`/`mindspore` (also supports `pt/torch/ms` alias).
`output_dir`	No	string	Output directory. The default value is `./`.
`format`	No	string	Output format. Currently, only `csv` is supported.
`async_write`	No	bool	Reserved (synchronous writing in CSV is used).
`rank`	No	int / list[int]	Rank to be monitored. If this parameter is left empty or not set, all ranks are monitored.
`start_step`	No	int	Start step (included) for writing. The default value is `0`.
`stop_step`	No	int	End step (excluded) for writing. If this parameter is not set, it can be derived from `collect_times`.
`step_interval`	No	int	Write frequency: Write once every N steps (default value: `1`).
`step_count_per_record`	No	int	Number of steps to be combined into a CSV file (default value: `1`).
`patch_optimizer_step`	No	bool	Whether to automatically wrap `optimizer.step()` to trigger collection. If this parameter is not explicitly configured and `optimizer` is passed, it is enabled by default.
`collect_times`	No	int	Maximum number of write times. When the maximum number is reached, write operations stop. (The default value is large, meaning collection is effectively always performed.)
`monitors`	No	dict	Configuration set of monitoring modules. The key is the module name (see the following table).

monitors Configurations

The format of each subitem in monitors is as follows:

{
  "enabled": true,
  "...": "Custom fields for each module"
}

The table below list common fields that are applicable to most modules.

Field	Mandatory (Yes/No)	Type	Description
`enabled`	No	bool	Whether to enable the module. If this parameter is not set, `module` is enabled by default, and other modules are disabled by default.
`ops`	No	list[string]	Statistical metric. The value can be `min/max/mean/norm/nans`. If there is no valid item, the default value is used.
`eps`	No	number	Stable value. The default value is `1e-8`.

module (Module Input/Output and Backward Gradient)

Field	Mandatory (Yes/No)	Type	Description
`targets`	No	list[string]	Filters target modules. If this parameter is left empty, all modules are included. Otherwise, modules whose names contain the specified keyword are included.
`ops`/`eps`	No	-	See description in the "common fields" table.

weight_grad (Weight Gradient)

Field	Mandatory (Yes/No)	Type	Description
`monitor_mbs_grad`	No	bool	Whether to record micro-batch gradients. The default value is `false`. In the PyTorch FSDP scenario, `weight_grad` automatically detects and collects gradients (`scope=unreduced`) before `reduce`. No separate `fsdp_grad` module is required.
`micro_batch_number`	No	int	Number of micro-batches, with a higher priority than `grad_acc_steps`.
`grad_acc_steps`	No	int	Number of gradient accumulation steps, which can be passed through `TrainerMonitorV2.start(..., grad_acc_steps=...)`.
<idp:inline displayname="code" id="code8652331599">ops</idp:inline>/<idp:inline displayname="code" id="code106528335916">eps</idp:inline>	No	-	See description in the "common fields" table.

NOTE

weight_grad records unreduced in the backward phase and captures and records reduced before optimizer.step() is called.

optimizer (Optimizer Momentum)

Field	Mandatory (Yes/No)	Type	Description
`mv_distribution`	No	bool	Whether to collect momentum (m/v, typically `exp_avg/exp_avg_sq` of Adam). The default value is `true`.
<idp:inline displayname="code" id="code1265519315590">ops</idp:inline>/<idp:inline displayname="code" id="code2655335597">eps</idp:inline>	No	-	See description in the "common fields" table.

param (Parameter Distribution)

Field	Mandatory (Yes/No)	Type	Description
`param_distribution`	No	bool	Whether to collect parameter distribution. The default value is `true`.
<idp:inline displayname="code" id="code18655431596">ops</idp:inline>/<idp:inline displayname="code" id="code4655163125920">eps</idp:inline>	No	-	See description in the "common fields" table.

NOTE

param collects parameter distribution before and after optimizer.step() and outputs scope=param_origin/param_updated.

cc (Communication Operator)

It takes effect only when the distributed environment has been initialized (for example, torch.distributed.is_initialized() in PyTorch is set to true).

Field	Mandatory (Yes/No)	Type	Description
`cc_codeline`	No	list[string]	Monitors only the specified code lines (for example, `train.py[23]`). If this parameter is left empty, no filtering is performed.
`cc_log_only`	No	bool	Whether to print logs only without collecting data. (Some implementations may interrupt training after printing. Exercise caution when using this parameter.)
`cc_pre_hook`	No	bool	Whether to monitor communication input (pre-collection).
`module_ranks`	No	list[int]	This parameter takes effect only on specified ranks. (If this parameter is not set, the list is empty by default.)
`ops`/`eps`	No	-	See description in the "common fields" table.

NOTE

monitors.cc supports two formats: configuring the preceding fields directly, or nesting them within cc_distribution (for compatibility with the legacy structure).