msOpProf Simulator Mode User Guide

Overview

MindStudio Ops Profiler (msOpProf, an operator tuning tool) is used to collect and analyze the key performance metrics of operators running on AI Processors. Based on the output profile data, you can quickly locate the hardware and software performance bottlenecks of operators, improving the efficiency of operator performance analysis.

Currently, profile data for different file formats (executable files or operator binary .o files) can be collected and automatically parsed in on-board (msopprof) and simulator (msopprof simulator) modes. This document describes how to use the msopprof simulator mode.

Features

msOpProf demonstrates single-operator tuning capabilities such as instruction pipeline chart, operator code hot spot maps, memory channel throughput waveform charts, and profile data files through MindStudio Insight. For details, see Table 1 msopprof simulator mode features.

Table 1 msopprof simulator mode features

Feature	Link
Instruction pipeline chart	Instruction Pipeline Chart
Operator code hot spot map	Operator Code Hot Spot Map
Memory channel throughput waveform chart	Memory Channel Throughput Waveform Chart
Profile data files	msopprof Simulator Profile Data

Scenarios

The following scenarios are supported. For details, see Collecting Profile Data of Ascend C Operators and Collecting Profile Data of MC2 Operators.

NOTE

Refer to Chip SoC Type Acquisition Method to obtain the chip type, and use it as the value of the --soc-version parameter.

Kernel launch operator development: kernel launch
- In the kernel launch scenario, for details, see Kernel Launch Operator Development in the Ascend C Operator Development Guide.
- In the kernel launch scenario, configure the prerequisites and then run the following command:
```
msprof op simulator --soc-version=Ascendxxxyy ./main # main indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
```
- If you need to perform simulation-based tuning on an operator that runs on the board without recompilation, perform the following steps:
  - Create a soft link named libruntime.so pointing to libruntime_camodel.so in any directory.
```
ln -s /{simulator_path}/lib/libruntime_camodel.so /{so_path}/libruntime.so  
 # For example, if the CANN package is installed in the default path of the root user, simulator_path is /usr/local/Ascend/cann/tools/simulator/Ascendxxxyy.
```
  - Add the parent directory of the created soft link to the environment variable LD_LIBRARY_PATH.
```
export LD_LIBRARY_PATH={so_path}:$LD_LIBRARY_PATH
```
Project-based operator development: single-operator API calling
- In the single-operator API execution scenario, see the Project-based Operator Development > Single-Operator API Execution in the Ascend C Operator Development Guide.
- In the single-operator API execution scenario, configure the prerequisites and then run the following command:
```
msprof op simulator --soc-version=Ascendxxxyy ./main # main indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
```
AI framework operator adaptation: PyTorch framework
- When msOpProf is used for simulated tuning of the operators in the PyTorch script on Atlas inference products, only the Kernels-based operator package calling mode is supported. Refer to the content related to Kernels operator package installation in the Installing CANN of the CANN Software Installation Guide. Install the binary Kernels operator package, and modify the script entry file (such as main.py) by adding the bold information below import torch_npu to ensure that the operators in the Kernels operator package are used.
```
import torch
import torch_npu
torch_npu.npu.set_compile_mode(jit_compile=False)
......
```
- In the single-operator calling scenario through the PyTorch framework, for details, see the OpPlugin in Ascend-developed Plugins of the Ascend Extension for PyTorch Suite and Third-party Library Support List.
- When the PyTorch framework is used to call a single-operator, configure the prerequisites and then run the following command:
```
msprof op simulator --soc-version=Ascendxxxyy python a.py   # a.py indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
```
Triton operator development: Triton operator calling
- Install and configure Triton and the Triton-Ascend plug-in. For details, see Triton Ascend.
- The Triton operator calling scenario does not apply to Atlas inference products.

Preparations

Preparing the environment

Configure related environment variables by referring to the MindStudio Ops Profiler Installation Guide.

To use MindStudio Insight for viewing, install the MindStudio Insight software package separately. For download links, see the MindStudio Insight Installation Guide.
For Atlas A2 training products/Atlas A2 inference products, if you want to use the template library for simulation, add the --simulator option to the compilation script to compile the operator in simulator mode. For details, see this link.
```
bash scripts/build.sh --simulator 00_basic_matmul
```

Constraints

You are advised to collect profile data within 5 minutes and ensure that the set memory size is greater than 20 GB (for example, container configuration docker run --memory=20g container_name).
Ensure that the profile data is stored in the current user directory that does not contain soft links. Otherwise, security issues may occur.

Precautions

msOpProf depends on the msopprof executable file in the CANN package. The API usage in this file is the same as that in msopprof. This file is provided by the CANN package and does not need to be installed separately.
After you press CTRL+C, the operator execution stops, and the tool generates a profile data file based on existing information. If you do not need to generate the file, press Ctrl+C again.
If the --output option is not specified, ensure that other users do not have the write permission on the upper-level directory of the current path.
Before using msopprof simulator, ensure that the application functions properly.
Do not initiate more than one profile data collection task on the same device.
The simulation result of msopprof simulator in the document is for reference only. The actual running status of the operator is subject to the actual simulation data.
You need to ensure the execution security of executable files or applications.
- You are advised to restrict the operation permission on executable files or applications to avoid privilege escalation risks.
- Avoid high-risk operations (such as deleting files, deleting directories, changing passwords, and running privilege escalation commands) to prevent security risks.

Command Reference

Log in to the operating environment, use msopprof simulator to enable the operator simulation and tuning function, and use the optional simulation parameters and the program to be tuned (blockdim 1) for tuning. For details about the optional simulation parameters, see Table 1 Optional msopprof simulator parameters. Refer to Chip SoC Type Acquisition Method to obtain the chip type, and use it as the value of the --soc-version parameter. An example command is as follows:

msprof op simulator --soc-version=Ascendxxxyy --output=/home/projects/output /home/projects/MyApp/out/main blockdim 1   # --output is an optional parameter, /home/projects/MyApp/out/main is the used app, blockdim 1 is an optional parameter of the user application, and xxxyy is the type of the processor used by the user.

Table 1 Optional msopprof simulator parameters

Optional Parameter	Description	Mandatory
--application	Specifies th e executable file to profile. You are advised to use `msprof op simulator --soc-version=Ascendxxxyy [msopprof simulator parameters] ./app`, where `xxxyy` indicates the processor type and `./app` is a user-specified executable file path. If no path is provided, the current directory is used. When using `./app`, add msopprof simulator parameters before `./app` to ensure that the related functions take effect. Currently, this command is compatible with `./app [arguments]`. In the future, it will be changed to `./app [arguments]`.	Yes. Choose one of `--application`, `--config`, or `--export`.
--config	Specifies the absolute or relative path of the binary file `.o` generated after operator compilation. For details, see JSON Configuration File Description. Before operator tuning, you can obtain the operator binary `.o` file in either of the following ways: Refer to Modifying and Executing One-Click Compilation and Execution Script in Kernel Launch Operator Development > Kernel Launch of the Ascend C Operator Development Guide to obtain the NPU executable file, and then manually extract the .o file from the executable file. Refer to Operator Compilation and Deployment. The .o file is automatically generated during operator compilation. Ensure that users in the group and other groups do not have the write permission on the JSON file specified by `--config` and its parent directory. In addition, ensure that the owner of the parent directory of the JSON file is the current user. You need to use the `LD_LIBRARY_PATH` environment variable to set the simulator type. export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH // xxxyy indicates the type of the processor used by the user.
--export	Specifies a folder containing single-operator simulation results, which will be directly parsed for MindStudio Insight to display the single-core or multi-core instruction pipeline chart of a single operator. Note: The specified folder should only contain multi-core data and operator kernel file `aicore_binary.o`. Rename the binary file specified in `--config` (`*.o`) to `aicore_binary.o`. If only dump files are provided, code line mapping cannot be generated in the instruction pipeline chart. To view code lines, store the operator kernel file `aicore_binary.o` in the dump. Ensure that users in the group and other groups do not have the write permission on the directory specified by `--export` and all files in the directory. In addition, ensure that the owner of the specified directory is the current user.
--kernel-name	Specifies the operator name to collect. Fuzzy matching using operator name prefixes is supported. If this option is not specified, only data of the first operator scheduled during program running is collected. Note: This option must be used with `--application`. The value can contain a maximum of 1,024 characters, restricted to letters, digits, and underscores (_). If multiple operators need to be collected, use vertical bars (\|) to combine them. For example, `--kernel-name="add\|abs"` indicates that operators whose prefixes are `add` and `abs` are collected. The number of operators collected is determined by the value of `--launch-count`. Wildcards (`*`) can be used match strings of any length.	No
--launch-count	Sets the maximum number of operators that can be collected. The default value is 1, and the value is an integer ranging from 1 to 5000.	No
--aic-metrics	Enables operator performance metric collection. The following performance metrics can be collected. `PipeUtilization` (collected by default): computing and transfer instruction pipelines. When `--aic-metrics=PipeUtilization` is configured, `ResourceConflictRatio` is disabled. That is, only the instruction pipeline is displayed, and the details of synchronization event instructions are not included. `ResourceConflictRatio` (collected by default): displays details about synchronization event instructions. For Atlas A3 training products, Atlas A3 inference products, Atlas A2 training products, and Atlas A2 inference products, `SET_FLAG` and `WAIT_FLAG` instructions are displayed. For Atlas inference products, `set_event` and `wait_event` instruction are displayed. `PMSampling`: enables and visualizes the memory channel throughput waveform, for example, `--aic-metrics=PMSampling`. For details, see Memory Channel Throughput Waveform Chart `--core-id` does not take effect for the `PMSampling` parameter. `PMSampling` parses all cores. This feature is disabled by default.	No
--core-id	This parameter is used when the operators are evenly distributed. You can use `--core-id` to specify the IDs of some logical cores to parse their simulation data. The core ID range is [0,49]. If the simulation data of multiple cores needs to be parsed, use vertical bars (\|) to combine them. For example, `--core-id="0\|31"` parses simulation data of cores whose IDs are 0 and 31. `--core-id` does not take effect for the `PMSampling` parameter. `PMSampling` parses all cores.	No
--timeout	This parameter is applicable to operators with a large amount of data and repetitive computation. Running such operators to completion takes significant time, but partial pipeline data provides sufficient information. Set `--timeout` to reduce running duration and capture necessary pipeline information. The implementation is as follows: When simulation duration reaches the `--timeout` value, msOpProf terminates the simulation process and begins parsing. Only the simulated data is analyzed. At the same time, msOpProf displays: [INFO] The timeout has reached and the application will be forcibly killed. If the process completes normally before reaching the timeout, the simulation ends and parsing proceeds. The value is an integer ranging from 1 to 2880, in minutes. An example is as follows: msprof op simulator --soc-version=Ascendxxxyy --timeout=1 ./add_custom // xxxyy indicates the type of the processor used by the user.	No
--mstx	Determines whether the operator tuning tool enables the mstx APIs used in the user code program. The default value is `off`, indicating that the mstx APIs are disabled. When `--mstx=on` is set, the operator tuning tool enables the mstx API used in the user program. For example: msprof op simulator --soc-version=Ascendxxxyy --mstx=on ./add_custom // xxxyy indicates the type of the processor used by the user. The `mstxRangeStartA` and `mstxRangeEnd` interfaces in the mstx API are supported, allowing for the enabling of operator tuning in specified ranges. For details about parameters, see the mstxRangeStartA and mstxRangeEnd interfaces in the MindStudio mstx API Reference.	No
--mstx-include	Enables the specified mstx APIs in msOpProf. If this parameter is not set, all mstx APIs used in user code are enabled by default. If this parameter is set, only the specified mstx APIs are enabled. The input of `--mstx-include` is the message character string transferred when the user calls the `mstx` function. Multiple character strings must be separated by vertical bars (\|). For example: --mstx=on --mstx-include="hello\|hi" // This enables only mstx APIs where the message parameter is "hello" or "hi". This parameter must be used with `--mstx`. The message can only contain letters, digits, and underscores (_). Use vertical bars (\|) to combine multiple messages.	No
--soc-version	Use this parameter or the `LD_LIBRARY_PATH` environment variable to specify the simulator type. The details are as follows: `--soc-version`: specifies the simulator type in `--application` and `--export` modes. For details about the value range, see the simulator type in the `${INSTALL_DIR}/tools/simulator` directory. `LD_LIBRARY_PATH` environment variable: specifies the emulator type in `--config` mode or when `--soc-version` is not used. export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH Replace `${INSTALL_DIR}` with the file storage path after the CANN software is installed. For example, if the installation is performed by the `root` user, the default file storage path is `/usr/local/Ascend/cann`.	No
--output	Specifies the path for storing the collected performance data, which defaults to the current directory. Ensure that users in the group and other groups do not have the write permission on the parent directory of the path specified by `--output`. In addition, ensure that the owner of the parent directory of the directory specified by `--output` is the current user.	No
--dump	Specifies whether to generate the dump file of the simulator. The value can be `on` or `off`. The default value is `off`, indicating that the simulator dump file is not generated. Note: This parameter is valid only for Atlas A2 training products, Atlas A2 inference products, Atlas A3 training products, and Atlas A3 inference products. For Atlas inference products, this parameter does not take effect. The dump files are saved to drives as usual. This parameter applies only to the single-process scenario and does not support the scenario where two operators run at the same time.	No
-h, --help	Outputs help information.	No

Usage

msOpProf assists in identifying exceptions in the operator memory, code, and instructions, enabling comprehensive operator tuning. For details about the usage, see Table 1 msopprof simulator functions.

Table 1 msopprof simulator functions

Scenario	Usage	Displayed Graphs
It is applicable to the development and debugging phases for detailed simulation tuning, allowing you to analyze operator instructions and code hotspots.	Configure environment variables (such as `LD_LIBRARY_PATH`) and compilation options (such as `-g` to generate debugging information) as detailed in msopprof simulator configuration. This enables detailed analysis of operator behavior in a simulated environment.	Instruction Pipeline Chart Operator Code Hot Spot Map Memory Channel Throughput Waveform Chart

msopprof simulator configuration

NOTE

The simulation function of the msOpProf tool only supports single-device scenarios and cannot simulate multi-device environments. Refer to Chip SoC Type Acquisition Method to obtain the chip type, and use it as the value of the --soc-version parameter.

Before using msOpProf to perform operator simulation-based tuning in --config mode, run the following command to configure environment variables:
```
export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH 
```
Modify the preceding environment variables based on the actual installation path of the CANN package and the AI processor type.
Add the -g compilation option to enable the operator code hot spot map and code call stack features.
NOTE
- If the -g compilation option is added, the generated binary file contains debugging information. You are advised to restrict access to user programs with debugging information to authorized personnel only.
- If the functions provided by the llvm-symbolizer component are not used, do not include -g when compiling the program that is input to msOpProf. In this case, msOpProf does not call the functions of the llvm-symbolizer component.
- For an operator project created by referring to the msOpGen tool, edit the CMakeLists.txt file in the op_kernel directory of the operator project. For details, see Creating an Operator Project.
```
add_ops_compile_options(ALL OPTIONS -g)
```
- For a project created by referring to the complete example, for example, the sample here, add the following code to the cmake/npu_lib.cmake file in the sample project directory.
  NOTE
  - This sample project does not support Atlas A3 training products.
  - When downloading the code sample, run the following command to specify the branch version:
```
git clone https://gitee.com/ascend/samples.git -b v1.9-8.3.RC1
```
```
ascendc_compile_options(ascendc_kernels_${RUN_MODE} PRIVATE
-g
-O2
)
```
- - For Triton operators, add -g by configuring the following environment variable.
```
export TRITON_DISABLE_LINE_INFO=0
```
When msOpProf is used to perform simulation-based tuning on the operator of the PyTorch script, the built-in print function of Python cannot print the variables and values on the device.
For the simulators of the Atlas A3 training products, Atlas A3 inference products, Atlas A2 training products, and Atlas A2 inference products, if the simulated blockdim exceeds the number of physical cores during running, the simulator may report the following error. You can resolve this issue by configuring the core_ostd_num parameter in the pem_config_cloud.toml file. The path to the pem_config_cloud.toml file is $\{INSTALL\_DIR\}/tools/simulator/Ascendxxxyy/lib/pem_config_cloud.toml.
```
[ARCH]
    cube_core_num           = 1
    vec_core_num            = 2
    core_ostd_num           = 2             # 2 early end  1 normal mode
```
When using the msProf tool for operator simulation and tuning on Ascend 950 products, you need to change the flush_level parameter in the config.json file to the info level. That is, change "flush_level": 3 to "flush_level": 2 in the file. The path of the config.json file is ${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib/config.json.

Startup

Configure msopprof simulator, and then perform the following steps to enable the simulation-based tuning function of the msOpProf tool. The operator tuning tool supports profile data collection and automatic parsing in a simulation environment.

NOTE

Currently, msOpProf does not support the -O0 compilation option.
The collection of MC2 and HCCL operators is not supported in the simulation environment.
The number of simulation cores set by the user cannot exceed the number of physical cores.
If you only need to focus on the performance of specific operators, invoke the TRACE_START and TRACE_STOP APIs within a single core on Atlas A3 training products, Atlas A3 inference products, Atlas inference products, Atlas A2 training products, and Atlas A2 inference products. These interfaces are described in the "Operator Debugging APIs" section of the Ascend C Operator Development API. Additionally, add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the method for adding -DASCENDC_TRACE_ON. Only after this can pipeline chart information for the specified range be generated. For details on the pipeline chart content, see Instruction Pipeline Chart.
Add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the following sample project. For the AddKernelInvocationNeo Operator Project, add the following code to the $\{git\_clone\_path\}/samples/operator/ascendc/0\_introduction/3\_add\_kernellaunch/AddKernelInvocationNeo/cmake/npu\_lib.cmake file.
```
ascendc_compile_definitions
(
    ...
    -DASCENDC_TRACE_ON
)
```

Log in to the operating environment. Use msopprof simulator to start operator simulation and tuning, combined with the optional simulation parameters and the program to be tuned (app [arguments]). For details about the optional simulation parameters, see Command Reference. You can use either of the following methods for operator simulation-based tuning:
- Based on an executable file
  - Single-operator scenario (using test as an example)
    
    NOTE
    The executable file name test in the example is for demonstration only. Use the actual name of the executable file generated by compilation in the current project.
```
msprof op simulator --soc-version=Ascendxxxyy --output=./output_data ./test # xxxyy indicates the type of the processor used by the user.
```
  - Multi-operator scenario
    
    If the test executable contains Add, MatMul, and Sub operators, you can use --launch-count and --kernel-name to specify collecting data for the Add and Sub operators only.
```
msprof op simulator --soc-version=Ascendxxxyy --launch-count=10 --kernel-name="Add|Sub" --output=./output_data ./test # xxxyy indicates the type of the processor used by the user. ./test must be placed at the end of the command.
```
- Based on a JSON configuration file of the input operator binary file *.o
  
  NOTE
  --When using --config, you can import environment variables only via LD_LIBRARY_PATH. The --soc-version parameter is not supported.
```
export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH # xxxyy indicates the type of the processor used by the user.
msprof op simulator --config=./add_test.json --output=./output_data
```

After the command is executed, a folder named OPPROF__{timestamp}___XXX_ is generated in the specified --output directory. An example of the folder structure is as follows:

Collecting data of a single-operator

OPPROF_{timestamp}_XXX
├── dump
└── simulator
    ├── core0.veccore0 // Stores data files for each core in directories named "core*.veccore*" or "core*.cubecore*".
    │   ├── core0.veccore0_code_exe.csv
    │   ├── core0.veccore0_instr_exe.csv
    │   └── trace.json     // Simulation instruction pipeline chart file of this core.
    ├── core0.veccore1
    │   ├── core0.veccore1_code_exe.csv
    │   ├── core0.veccore1_instr_exe.csv
    │   └── trace.json
    ├── core1.veccore0
    │   ├── core1.veccore0_code_exe.csv
    │   ├── core1.veccore0_instr_exe.csv
    │   └── trace.json
    ├── ... 
    ├── visualize_data.bin 
    └── trace.json // Simulation instruction pipeline chart file for all cores.

Collecting data of multiple operators

└──OPPROF_{timestamp}_XXX
├── OpName1           // "OpName1" is the name of the operator to be collected.
│ ├── 0              // Sequence in which the operator is scheduled.
│ │ ├── dump          // Folder storing intermediate files, which functions in the same way as in single-operator collection.
│ │ ├── simulator     // The content is the same as that in the single-operator simulator scenario, but the .csv files in the simulator folder have timestamp suffixes added, for example, core*_code_exe_20240429111143146.csv.
│ ├── 1
│ │ ├── dump        
│ │ └──simulator
│ ├── dump          // Folder storing intermediate files.
├── OpName2         
│ ├── 0
│ │ ├── dump       
│ │ └── simulator
│ ├── dump

Table 2 msopprof simulator files

Name		Description
dump folder		Folder for storing dump data generated by the simulation.
simulator folder (storing analysis results of dump data files)	core*_code_exe.csv	Code line time consumption. The asterisk (*) represents cores 0 to n, allowing for quick identification of the most time-consuming sections of the code. For details, see Code Line Time Consumption Data Files.
	core*_instr_exe.csv	Records detailed code instruction information. The asterisk (*) represents cores 0 to n, allowing for quick identification of the most time-consuming instructions. For details, see Code Instruction Information Files.
	visualize_data.bin	Visualization file for information such as the simulation pipeline chart and simulation hot spot functions.
	trace.json	Simulation instruction pipeline chart file, including sub-files for each core and a summary file for all cores.

After the visualize_data.bin file is imported to MindStudio Insight, the instruction pipeline chart, operator code hot spot map, and memory channel throughput waveform chart are displayed.
After the trace.json file is imported to the Chrome browser or MindStudio Insight, the instruction pipeline chart and Memory channel throughput waveform chart are displayed.

Instruction Pipeline Chart

Description

Visualizes the visualize_data.bin or trace.json files generated by msopprof simulator. The instruction pipeline chart displays timing relationship by instruction and associates with the call stack to quickly locate bottlenecks.

Precautions

For detailed MindStudio Insight operations and field explanations, see Timeline in MindStudio Insight Operator Tuning.
If the -g compilation option is added, the generated binary file contains debugging information. You are advised to restrict access to user programs with debugging information to authorized personnel only.
If the functions provided by the llvm-symbolizer component are not used, do not include -g when compiling the program that is input to msOpProf. In this case, msOpProf does not call the functions of the llvm-symbolizer component.
If you only need to focus on the performance of specific operators, invoke the TRACE_START and TRACE_STOP APIs within a single core on Atlas A3 training products, Atlas A3 inference products, Atlas inference products, Atlas A2 training products, and Atlas A2 inference products. These interfaces are described in the "Operator Debugging APIs" section of the Ascend C Operator Development API. Additionally, add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the method for adding -DASCENDC_TRACE_ON. Only after this can pipeline chart information for the specified range be generated.

Usage Instructions

The trace.json file can be visualized using either the Chrome browser or MindStudio Insight, while the visualize_data.bin file can be visualized only using MindStudio Insight.

Chrome

Enter chrome://tracing in the address box, drag and drop the instruction pipeline chart file (trace.json) generated by msopprof simulator into the blank area to view the file. Use the keyboard shortcuts to navigate: W (zoom in), S (zoom out), A (pan left), and D (pan right). For details about the key fields, see Table 1 Key fields.

Table 1 Key fields

Field	Description
VECTOR	Vector unit.
SCALAR	Scalar unit.
Cube	Cube unit.
MTE1	Pipeline of data transfer from L1 to {L0A/L0B, UBUF}.
MTE2	Pipeline of data transfer from {DDR/GM, L2} to {L1, L0A/B, UBUF}.
MTE3	Pipeline of data transfer from UBUF to {DDR/GM, L2, L1} or L1 to {DDR/L2}.
FIXP	Pipeline of data transfer from FIXPIPE L0C to OUT/L1 (displayed only for Atlas A2 training products and Atlas A2 inference products).
FLOWCTRL	Control flow instruction.
CACHEMISS	iCache miss.
USEMASK	Custom instrumentation range. If there are nested ranges within the same USEMASK, or if there is only `TRACE_START` but no `TRACE_STOP`, the instruction pipeline chart cannot be drawn correctly.
ALL	Indicates that instructions in this channel are executed in all channels.
PUSHQ	VF/SMIT_VF instructions.
RVECLP	Vector register LOOP instructions.
RVECSU	Vector register ASU instructions, including jumps and scalar data processing.
RVECLD	Vector register LOAD instructions.
RVECEX	Vector register EXECUTE instructions.
RVECST	Vector register SET instructions.

MindStudio Insight

Visualizes the generated trace.json or visualize_data.bin files.

MindStudio Insight provides a timeline view of instruction execution on Ascend AI Processors. You can identify the timing optimization opportunities of micro instructions by analyzing the instruction details, execution times, call stacks of the code associated with the instruction, and synchronization lines between instructions and pipelines. By observing pipeline arrangements on the timeline, you can identify potential performance issues during operator execution, such as ineffective parallelization between instructions.

Figure 1 Timeline page
- Shows the execution duration of each instruction within each pipeline and the instruction dependencies across different pipelines, helping you to identify potential performance optimization opportunities of pipelines.
- Associates pipeline instruction information with code to guide you through code-based pipeline layout optimization.
- Displays the data transfer volume for instructions related to GM in the selected details.

Operator Code Hot Spot Map

Description

Visualizes the visualize_data.bin files generated by msopprof simulator. On the page, you can view the mapping between operator source code and instructions, as well as the time consumption. This helps developers identify hot spot code distribution and analyze the feasibility of hot spot function optimization.

Precautions

For detailed MindStudio Insight operations and field explanations, see Source in MindStudio Insight Operator Tuning.
If the -g compilation option is added, the generated binary file contains debugging information. You are advised to restrict access to user programs with debugging information to authorized personnel only.
The operator program must be compiled with the -g option. Otherwise, msOpProf will not display the hot spot map and will not call the relevant functions of the llvm-symbolizer component to implement code-to-PC mapping.
Operator code hotspot maps cannot be generated for MC2 or LCCL operators.

Usage Instructions

The following figure shows the operator code hotspot map.

Figure 1 msopprof simulator source code page

On the top of the page, you can switch between compute units and kernel function files.
The left pane displays the time consumed by each line of code of the operator kernel, register usage, read and write conflicts of vector instructions on the UB Bank, Vector unit usage, and GM-related data transfer along with the number of corresponding instructions, helping developers quickly locate bottlenecks.
The right pane displays the time consumed by each instruction, register usage, GM-related data transfer, read and write conflicts of vector instructions on the UB Bank, Vector unit usage, execution counts, and code associations, helping developers further analyze the cause of long code execution times.

NOTE

The maximum number of general-purpose registers is 32. When the number of used registers reaches 32, the simulation can be performed only after the registers in use are released.
Register usage for certain operators using the TRACE_START and TRACE_STOP APIs cannot be displayed.
"NA" is displayed if no GM-related unit is involved when Process Bytes is checked.

For details about the features supported by msopprof simulator, see Table 1 msopprof simulator hot spot map features.

Table 1 msopprof simulator hot spot map features;

Column	Atlas A2 training products/Atlas A2 inference products	Atlas A3 training products/Atlas A3 inference products:	Atlas inference products	Ascend 950 products	Description
Source Code	Supported	Supported	Supported	Supported	-
Instruction PC Address	Supported	Supported	Supported	Supported	-
Pipeline	Supported	Supported	Supported	Supported	-
Execution Cycles	Supported	Supported	Supported	Supported	Execution time (cycles) of operator source code and instructions.
Execution Count	Supported	Supported	Supported	Supported	Execution count of operator source code and instructions.
GPR Count	Supported	Supported	Supported	Not supported	Register usage. Register usage for certain operators using the `TRACE_START` and `TRACE_STOP` APIs cannot be displayed.
UB Bank Conflict	Supported	Supported	Supported	Not supported	-
Vector Unit Utilization	Supported	Supported	Supported	Not supported	-
Process Bytes	Supported	Supported	Not supported	Not supported	GM-related data transfer volume.
Stall_Cycles (NOP Stall)	Not supported	Not supported	Not supported	Supported	Ratio chart comparing expected stalls with actual stalls. A stall refers to the waiting time incurred during instruction execution due to resource conflicts, data dependencies, or other reasons.

Memory Channel Throughput Waveform Chart

Description

Visualizes the visualize_data.bin files generated by msopprof simulator. On the page, you can view the statistical analysis of the memory bandwidth of the operator MTE log channel over time, helping you identify the bandwidth usage of the operator during different operator stages and evaluate the feasibility of bandwidth optimization.

Precautions

For detailed MindStudio Insight operations and field explanations, see Timeline in MindStudio Insight Operator Tuning.
Memory channel throughput waveform charts can only be displayed for Atlas A2 training products, Atlas A2 inference products, Atlas A3 training products, and Atlas A3 inference products.
This feature is disabled by default. The --core-id setting has no effect on this feature.

Usage Instructions

The following figure shows the memory channel throughput waveform chart.

Figure 1 Memory channel throughput waveform chart

Displays the data throughput (in MB/s) for various types of memory channels (currently limited to GM_TO_L1, GM_TO_TOTAL, GM_TO_UB, L1_TO_GM, TOTAL_TO_GM, and UB_TO_GM). For example, GM_TO_UB represents the throughput from GM to UB, while GM_TO_TOTAL represents the throughput from GM to each memory unit.
By combining this with MTE-related instructions, you can observe the throughput during execution of related commands to help identify operator performance issues.
NOTE
- The data used for throughput calculation corresponds to the completion of multiple requests for a specific instruction.
- The throughput waveform may appear within the time range between the start and end of an instruction (inclusive). For example, for an instruction with a duration of 1 to 3 µs, the throughput data might be distributed across three bar charts covering the 1 to 2 µs, 2 to 3 µs, and 3 to 4 µs intervals.