Network-wide Operator Precision Comparison in TorchAir Mode

Overview

In TorchAir mode (Torch graph mode), the input and output data of intermediate operators of a model is collected to check whether two inference results are consistent, thereby determining whether model precision is consistent on different operators. In this mode, network-wide operator precision comparison supports the comparison between GE data and FX data and data comparison in GE fusion mode.

Concepts For details, see Inference in TorchAir Mode.

Preparations

Environment Setup and General Constraints

For details, see Environment Setup and Constraints of inference in TorchAir Mode.

Additional Constraints

Specify different dump paths for two GE dump operations or two FX dump operations. Otherwise, data may be disordered and cannot be distinguished, affecting data comparison and analysis.

Precision Comparison Between Dump Data in GE Fusion Mode (Default) and FX Dump Data

For details about the dump data collection method, API parameters, examples, and result directory structure, see Dumping Data in GE Fusion Mode and Dumping Data in FX Mode.

Precision Comparison

Run the msprobe compare --target_path <GE_dump_data_path> --golden_path <FX_dump_data_path> [--output_path <output_path>] --mode torchair command to generate the comparison result in CSV format in the path specified by output_path. If --output_path is not used, the result is saved in the current directory by default.

# Use Ascend Extension for PyTorch 7.1.0 or later
msprobe compare --target_path ${dump_path}/msprobe_ge_dump --golden_path ${dump_path}/msprobe_fx_dump --mode torchair

# Use Ascend Extension for PyTorch 7.1.0.
msprobe compare --target_path ${dump_path}/msprobe_ge_dump --golden_path ${dump_path}/msprobe_fx_dump/data_dump --mode torchair

# Use a version earlier than Ascend Extension for PyTorch 7.1.0.
msprobe compare --target_path ${dump_path}/msprobe_ge_dump --golden_path data_dump --mode torchair

Note: When a version earlier than Ascend Extension for PyTorch 7.1.0 is used, the name of the token ID directory in the dump result file in FX mode is one greater than the actual token ID. Therefore, during comparison, the directory name is decremented by one to obtain the correct token ID.

Precision Comparison Between Dump Data in GE Fusion Mode (Default) and in GE Mode with Fusion Disabled

For details about the dump collection methods, examples, and directory structures for GE fusion and non-fusion modes, see Dumping Data in GE Fusion Mode and Dumping Data in GE Mode with Fusion Disabled.

Precision Comparison

Run the msprobe compare --target_path <GE_dump_data_path> --golden_path <fusion_off_GE_dump_data_path> [--output_path <output_path>] --mode torchair command to generate the comparison result in CSV format in the path specified by output_path. If --output_path is not used, the result is saved in the current directory by default.

msprobe compare --target_path ${dump_path in GE dump}/msprobe_ge_dump --golden_path ${dump_path in fusion off GE dump}/msprobe_ge_dump --mode torchair

Result Viewing

For details about the fields, determination criteria, and color marks in the accuracy comparison result, see Appendix.

Appendix

Converting Dump Data into Targeted Information to Reduce Volume

Note

This section applies only to specific scenarios.

The data generated during the dump process may occupy a large amount of drive space. You can enable a background process during dump to extract complete data as specified information. The following script demonstrates how to convert the data into the maximum and minimum values and delete the original data.

#!/bin/env python3
import os
import time
import argparse

surfix = "_min_max"  # Converted data save surfix

# Define how single data is converted
def convert_data_to_info(data):
    return [data.min(), data.max()]

def convert(data_path):
    import numpy as np
    from components.utils.acc_cmp import parse_torchair_dump_data

    npz_surfix, npy_surfix = "{}.npz".format(surfix), "{}.npy".format(surfix)
    for cur_path, dirs, files in os.walk(data_path):
        for file in files:
            if file.endswith(npy_surfix):  # already converted FX data
                continue

            cur = os.path.join(cur_path, file)
            if file.endswith(".npy"):  # FX saved npy data
                file_name = os.path.splitext(cur)[0]
                np.save(file_name + surfix, convert_data_to_info(np.load(cur)))
                os.remove(cur)
                print("Converted: {} -> {}{}".format(cur, file_name, npy_surfix))
            elif not file.endswith(npz_surfix) and not file.endswith(".txt") and not file.endswith(".swp"):
                inputs, outputs = parse_torchair_dump_data(cur)
                inputs = [convert_data_to_info(ii) for ii in inputs]
                outputs = [convert_data_to_info(ii) for ii in outputs]

                np.savez(cur + npz_surfix, inputs=inputs, outputs=outputs)
                os.remove(cur)
                print("Converted: {} -> {}{}".format(cur, cur, npz_surfix))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("data_path", help="GE or FX data dump path")
    args = parser.parse_args()
    while True:
        convert(args.data_path)
        time.sleep(0.5)
        print("Waiting...")

Execute this script in the background during the dump process to convert the dump data into info, thereby reducing memory usage.

# Convert the GE dump data in msprobe_ge_dump into info.
python3 convert.py msprobe_ge_dump

# If Ascend Extension for PyTorch 7.1.0 or later is used, convert the FX dump data in msprobe_fx_dump into info.
python3 convert.py msprobe_fx_dump

# If an Ascend Extension for PyTorch version earlier than 7.1.0 is used, convert the FX dump data in data_dump into info.
python3 convert.py data_dump

Comparison Result File Format

The precision comparison result in the TorchAir scenario is output in CSV format, including the following information:

Basic Information

API Name: operator or API name.
Stack Info: stack information, which is used to locate the code.
Data Name: data name, in the format of [NPU data name, Bench data name].

Metrics in Real Data Mode

The table below lists metrics of dump data in real data mode.

Metric	Meaning	Normal Range
Cosine	Cosine that measures the similarity between the directions of two vectors.	0.99-1.0
EucDist	Euclidean distance that measures the absolute distance between two vectors.	The smaller, the better.
MaxAbsErr	Maximum absolute error.	The smaller, the better.
MaxRelativeErr	Maximum relative error.	< 0.01 (generally)
One Thousandth Err Ratio	Proportion of elements with relative errors less than one per thousand.	The higher, the better.
Five Thousandth Err Ratio	Proportion of elements with relative errors less than five per thousand.	The higher, the better.
Requires_grad Consistent	Checks whether gradients are consistent.	True

Metrics in Statistics Mode

The table below lists metrics of dump data in statistics mode.

Metric	Meaning
Max diff	Maximum value difference
Min diff	Minimum value difference
Mean diff	Mean difference
L2norm diff	L2 norm difference
MaxRelativeErr	Maximum relative error
MinRelativeErr	Minimum relative error
MeanRelativeErr	Mean relative error
NormRelativeErr	Norm relative error

Metrics in MD5 Mode

The table below lists metrics of dump data in MD5 mode.

Metric	Meaning
NPU MD5	CRC-32 value of NPU data
BENCH MD5	CRC-32 value of benchmark data

Result Information

Result: comparison result (PASS/FAIL)
Accuracy Reached or Not: precision meets the requirement or not (Yes/No)
Err_message: error message

Result Determination Criteria

Determination in Real Data Mode

PASS: Cosine ≥ 0.99 and MaxRelativeErr < 0.01
FAIL: Cosine < 0.99 or MaxRelativeErr ≥ 0.01

Determination in Statistics Mode

PASS: difference metrics are within the acceptable range.
FAIL: significant differences exist.

Determination in MD5 Mode

PASS: NPU MD5 == BENCH MD5
FAIL: NPU MD5 != BENCH MD5

Color Marks

When highlight colors are used:

Red: abnormal precision, requiring special attention.
Yellow: suspicious precision, requiring further analysis.
Green: normal precision.

Special Value Processing

N/A: Metric value cannot be calculated.
NaN: The calculation result is not a number, usually because NaN values exist.
inf: The calculation result is infinite, usually due to division by zero.

If the dump data contains 0 or NaN, the maximum relative error in the comparison result may be inf or NaN, which is normal.

Result File Location

The comparison result CSV file is saved in the current directory by default or the directory specified by the --output_path parameter.