Single-Point Saving Tool
Overview
The dump capabilities at the L0, L1, and mix levels have blind spots. The input and output of non-API or non-module on the network cannot be dumped in batches. The single-point saving tool provides functions similar to np.save and print, to save specified variables. In addition, the tool is enhanced for foundation model scenarios and has the following features:
- The backward gradient result of a variable can be saved.
- The nested structure data (such as list and dict) can be directly saved without manual traversal.
- Data can be automatically saved by rank.
- Data can be saved by step.
- The number of calls can be automatically counted.
- Statistics or tensors can be saved (not supported in MindSpore static graph mode).
- Asynchronous saving is supported.
The single-point saving tool may be used across files. For details about how to enable the tool, see Cross-File Data Collection.
Preparations
Environment Setup
Install msProbe by referring to msProbe Installation Guide.
Constraints
The PyTorch and MindSpore frameworks are supported.
Application Scenarios
Dynamic Graph (PyTorch & MindSpore)
Enablement Method
Configuration File Description
For more details, see General Configuration.
| Parameter | Description | Mandatory (Yes/No) |
|---|---|---|
| task | Dump task type, which is of the string type. In the single-point saving scenario, the value can only be statistics or tensor. | Yes |
| level | Dump level, which is of the string type. Data is collected at different levels. In the single-point saving scenario, the value is debug. | Yes |
| dump_path | Dump data directory, which is of the string type. | Yes |
| rank | Rank whose data is to be collected, which is of the list[Union[int, str]] type. | No |
| step | Step whose data is to be collected, which is of the list[Union[int, str]] type. | No |
| async_dump | Asynchronous dump switch, which is of the bool type. In this mode, summary_mode does not support MD5 values or statistical calculation of complex tensors. | No |
The table below lists a sub-configuration item for a statistics task.
| Parameter | Description | Mandatory (Yes/No) |
|---|---|---|
| summary_mode | Dump file output mode, which is of the string type. The value can be statistics or md5. For details, see [task = statistics](./config_json_introduct.md#task = statistics). | No |
There is no sub-configuration items for a tensor task.
API Call Description
Call PrecisionDebugger.save, pass the variables to be saved, specify variable names, and specify whether to save the backward data. For details about the input parameters of the API, see PyTorch Single-Point Saving API or MindSpore Single-Point Saving API.
Example
Configuration file: (The PyTorch scenario is used as an example. In the MindSpore scenario, you only need to import the package from the msprobe.mindspore module.)
{
"task": "statistics",
"dump_path": "./dump_path",
"rank": [],
"step": [],
"level": "debug",
"async_dump": false,
"statistics": {
"summary_mode": "statistics"
}
}
Initialization:
# Training startup script
from msprobe.pytorch import PrecisionDebugger
debugger = PrecisionDebugger("./config.json")
for data, label in data_loader:
# Perform model training.
train(data, label)
Initialization (without a configuration file):
# Training startup script
from msprobe.pytorch import PrecisionDebugger
debugger = PrecisionDebugger(dump_path="dump_path", level="debug")
for data, label in data_loader:
# Perform model training.
train(data, label)
Example of calling the save API (using PyTorch code as an example; the usage is the same for MindSpore):
import torch
import torch.nn as nn
import torch.nn.functional as F
from msprobe.pytorch import PrecisionDebugger, seed_all
# Instantiate PrecisionDebugger before model training.
debugger = PrecisionDebugger(dump_path="dump_path", level="debug")
# Define a network.
class ModuleOP(nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear_1 = nn.Linear(in_features=8, out_features=4)
self.linear_2 = nn.Linear(in_features=4, out_features=2)
def forward(self, x):
x1 = self.linear_1(x)
x2 = self.linear_2(x1)
debugger.save(x2, "x2", save_backward=True) # Call the save API.
r1 = F.relu(x2)
return r1
if __name__ == "__main__":
module = ModuleOP()
x = torch.randn(10, 8)
out = module(x)
loss = out.sum()
loss.backward()
Save data by step (using PyTorch code as an example; the usage is the same for MindSpore):
import torch
import torch.nn as nn
import torch.nn.functional as F
from msprobe.pytorch import PrecisionDebugger
# Instantiate PrecisionDebugger before model training.
debugger = PrecisionDebugger(dump_path="dump_path", level="debug")
# Define a network.
class ModuleOP(nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear_1 = nn.Linear(in_features=8, out_features=4)
self.linear_2 = nn.Linear(in_features=4, out_features=2)
def forward(self, x):
x1 = self.linear_1(x)
x2 = self.linear_2(x1)
debugger.save(x2, "x2", save_backward=True) # Call the save API.
r1 = F.relu(x2)
return r1
if __name__ == "__main__":
module = ModuleOP()
train_iter = 10
for i in range(train_iter):
x = torch.randn(10, 8)
out = module(x)
loss = out.sum()
loss.backward()
debugger.step() # Call debugger.step to distinguish steps for data saving.
Static Graph (MindSpore)
Enablement Method
API Description
The tool provides three external APIs for saving data during training.
| API | Description | Supported Device | MindSpore Version | Application Scenario |
|---|---|---|---|---|
| save | Saves tensor data of forward propagation. | Ascend/GPU/CPU | >= 2.6.0 | Ascend is supported only in graph mode. All platforms are supported in PyNative mode. |
| save_grad | Saves gradient data of backpropagation. | Ascend/GPU/CPU | >= 2.6.0 | Ascend is supported only in graph mode. All platforms are supported in PyNative mode. |
| step | Updates the number of training steps. | Ascend/GPU/CPU | >= 2.6.0 | Managing step directory for storing data |
API Details
1. save
save(save_dir: str, name: str, data: Union[Tensor, List, Tuple, Dict])
Parameters:
save_dir: data storage directoryname: data identifier (used as the prefix of the file name)data: multiple data types are supported.mindspore.Tensor: single tensorList/Tuple/Dict: nested structure (automatically expanded and saved)
Example:
from msprobe.mindspore import save
class Net(nn.Cell):
def construct(self, x):
save("./dump_data", 'input', x) # Save the input data.
return x * 2
2. save_grad
save_grad(save_dir: str, name: str, data: Tensor) -> Tensor
Parameters:
save_dir: gradient storage directoryname: gradient identifier (used as the prefix of the file name)data: Must be of typemindspore.Tensor.
Precautions:
- The return value must be received and passed back to the original computational graph.
- This operation does not affect computing precision.
Example:
from msprobe.mindspore import save_grad
class Net(nn.Cell):
def construct(self, x):
x = save_grad("./dump_data", 'grad', x) # Save the gradient data.
return x * 2
3. step
step()
Description:
- Counts the number of incremental training steps.
- Manage step directories (such as step0/ and step1/) for storing data.
- If this API is not called, all data is stored in the same step directory.
Example:
from msprobe.mindspore import save, step
# Training epoch
for epoch in range(epochs):
train_one_epoch()
step() # Update the step after each epoch.
Return Value
Dynamic Graph (PyTorch & MindSpore)
-
task = statistics:
debug.jsonthat contains variable statistics is generated in the dump directory. The key of the statistics indebug.jsonis named in the format of {{variable_name}{grad_flag}.{count}.debug. -
task = tensor: In addition to
debug.jsonthat contains variable statistics in the dump directory, a tensor binary file is stored indump_tensor_data. The file name is in the format of{variable_name}{grad_flag}.{count}.debug.{indexes}.{file_suffix}.- variable_name: name of the variable passed to the save API.
- grad_flag: backward flag. _grad indicates backward data, and "" indicates forward data.
- count: number of calls with the same variable name.
- indexes: index of the nested structure data to be saved. For example, if the nested structure is
{"key1": "value1", "key2": ["value2", "value3"]}, the index of value2 is key2.0. - file_suffix: file name extension. The value is pt in the PyTorch scenario and npy in the MindSpore scenario.
Static Graph (MindSpore)
The {step}/{rank} directory is generated in the specified directory save_dir, and a.npy file with the specified {name} is generated in the directory. If the save_grad API is called, a .npy file with {name}_grad is generated.
For example, save("./test_dump", 'x', x) -> ./test_dump/step0/rank0/x_float32_0.npy
or z = save_grad("./test_dump", 'z', z) -> ./test_dump/step0/rank0/z_grad_float32_0.npy.
The structure is as follows:
./save_dir/
├── step0/
│ ├── rank0/
│ │ ├── x_float32_0.npy # Forward data saved by save
│ │ └── z_grad_float32_0.npy # Gradient data saved by save_grad
├── step1/
│ ├── rank0/
│ │ ├── ...