msOpProf Simulator Mode User Guide

Overview

MindStudio Ops Profiler (msOpProf, an operator tuning tool) is used to collect and analyze the key performance metrics of operators running on AI Processors. Based on the output profile data, you can quickly locate the hardware and software performance bottlenecks of operators, improving the efficiency of operator performance analysis.

Currently, profile data for different file formats (executable files or operator binary .o files) can be collected and automatically parsed in on-board (msopprof) and simulator (msopprof simulator) modes. This document describes how to use the msopprof simulator mode.

Features

msOpProf demonstrates single-operator tuning capabilities such as instruction pipeline chart, operator code hot spot maps, memory channel throughput waveform charts, and profile data files through MindStudio Insight. For details, see Table 1 msopprof simulator mode features.

Table 1 msopprof simulator mode features

Feature Link
Instruction pipeline chart Instruction Pipeline Chart
Operator code hot spot map Operator Code Hot Spot Map
Memory channel throughput waveform chart Memory Channel Throughput Waveform Chart
Profile data files msopprof Simulator Profile Data

Scenarios

The following scenarios are supported. For details, see Collecting Profile Data of Ascend C Operators and Collecting Profile Data of MC2 Operators.

NOTE

Refer to Chip SoC Type Acquisition Method to obtain the chip type, and use it as the value of the --soc-version parameter.

  • Kernel launch operator development: kernel launch

    • In the kernel launch scenario, for details, see Kernel Launch Operator Development in the Ascend C Operator Development Guide.

    • In the kernel launch scenario, configure the prerequisites and then run the following command:

      msprof op simulator --soc-version=Ascendxxxyy ./main # main indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
      
    • If you need to perform simulation-based tuning on an operator that runs on the board without recompilation, perform the following steps:

      • Create a soft link named libruntime.so pointing to libruntime_camodel.so in any directory.

        ln -s /{simulator_path}/lib/libruntime_camodel.so /{so_path}/libruntime.so  
         # For example, if the CANN package is installed in the default path of the root user, simulator_path is /usr/local/Ascend/cann/tools/simulator/Ascendxxxyy.
        
      • Add the parent directory of the created soft link to the environment variable LD_LIBRARY_PATH.

        export LD_LIBRARY_PATH={so_path}:$LD_LIBRARY_PATH
        
  • Project-based operator development: single-operator API calling

    • In the single-operator API execution scenario, see the Project-based Operator Development > Single-Operator API Execution in the Ascend C Operator Development Guide.

    • In the single-operator API execution scenario, configure the prerequisites and then run the following command:

      msprof op simulator --soc-version=Ascendxxxyy ./main # main indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
      
  • AI framework operator adaptation: PyTorch framework

    • When msOpProf is used for simulated tuning of the operators in the PyTorch script on Atlas inference products, only the Kernels-based operator package calling mode is supported. Refer to the content related to Kernels operator package installation in the Installing CANN of the CANN Software Installation Guide. Install the binary Kernels operator package, and modify the script entry file (such as main.py) by adding the bold information below import torch_npu to ensure that the operators in the Kernels operator package are used.

      import torch
      import torch_npu
      torch_npu.npu.set_compile_mode(jit_compile=False)
      ......
      
    • In the single-operator calling scenario through the PyTorch framework, for details, see the OpPlugin in Ascend-developed Plugins of the Ascend Extension for PyTorch Suite and Third-party Library Support List.

    • When the PyTorch framework is used to call a single-operator, configure the prerequisites and then run the following command:

      msprof op simulator --soc-version=Ascendxxxyy python a.py   # a.py indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
      
  • Triton operator development: Triton operator calling

    • Install and configure Triton and the Triton-Ascend plug-in. For details, see Triton Ascend.
    • The Triton operator calling scenario does not apply to Atlas inference products.

Preparations

Preparing the environment

Configure related environment variables by referring to the MindStudio Ops Profiler Installation Guide.

  • To use MindStudio Insight for viewing, install the MindStudio Insight software package separately. For download links, see the MindStudio Insight Installation Guide.

  • For Atlas A2 training products/Atlas A2 inference products, if you want to use the template library for simulation, add the --simulator option to the compilation script to compile the operator in simulator mode. For details, see this link.

    bash scripts/build.sh --simulator 00_basic_matmul
    

Constraints

  • You are advised to collect profile data within 5 minutes and ensure that the set memory size is greater than 20 GB (for example, container configuration docker run --memory=20g container_name).
  • Ensure that the profile data is stored in the current user directory that does not contain soft links. Otherwise, security issues may occur.

Precautions

  • msOpProf depends on the msopprof executable file in the CANN package. The API usage in this file is the same as that in msopprof. This file is provided by the CANN package and does not need to be installed separately.
  • After you press CTRL+C, the operator execution stops, and the tool generates a profile data file based on existing information. If you do not need to generate the file, press Ctrl+C again.
  • If the --output option is not specified, ensure that other users do not have the write permission on the upper-level directory of the current path.
  • Before using msopprof simulator, ensure that the application functions properly.
  • Do not initiate more than one profile data collection task on the same device.
  • The simulation result of msopprof simulator in the document is for reference only. The actual running status of the operator is subject to the actual simulation data.
  • You need to ensure the execution security of executable files or applications.
    • You are advised to restrict the operation permission on executable files or applications to avoid privilege escalation risks.
    • Avoid high-risk operations (such as deleting files, deleting directories, changing passwords, and running privilege escalation commands) to prevent security risks.

Command Reference

Log in to the operating environment, use msopprof simulator to enable the operator simulation and tuning function, and use the optional simulation parameters and the program to be tuned (blockdim 1) for tuning. For details about the optional simulation parameters, see Table 1 Optional msopprof simulator parameters. Refer to Chip SoC Type Acquisition Method to obtain the chip type, and use it as the value of the --soc-version parameter. An example command is as follows:

msprof op simulator --soc-version=Ascendxxxyy --output=/home/projects/output /home/projects/MyApp/out/main blockdim 1   # --output is an optional parameter, /home/projects/MyApp/out/main is the used app, blockdim 1 is an optional parameter of the user application, and xxxyy is the type of the processor used by the user.

Table 1 Optional msopprof simulator parameters

Optional Parameter

Description

Mandatory

--application

Specifies th e executable file to profile. You are advised to use msprof op simulator --soc-version=Ascendxxxyy [msopprof simulator parameters] ./app, where xxxyy indicates the processor type and ./app is a user-specified executable file path. If no path is provided, the current directory is used.

When using ./app, add msopprof simulator parameters before ./app to ensure that the related functions take effect.

Currently, this command is compatible with ./app [arguments]. In the future, it will be changed to ./app [arguments].

Yes. Choose one of --application, --config, or --export.

--config

Specifies the absolute or relative path of the binary file *.o generated after operator compilation. For details, see JSON Configuration File Description.

Before operator tuning, you can obtain the operator binary *.o file in either of the following ways:

  • Refer to Modifying and Executing One-Click Compilation and Execution Script in Kernel Launch Operator Development > Kernel Launch of the Ascend C Operator Development Guide to obtain the NPU executable file, and then manually extract the .o file from the executable file.
  • Refer to Operator Compilation and Deployment. The .o file is automatically generated during operator compilation.

Ensure that users in the group and other groups do not have the write permission on the JSON file specified by --config and its parent directory. In addition, ensure that the owner of the parent directory of the JSON file is the current user.

You need to use the LD_LIBRARY_PATH environment variable to set the simulator type.
export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH // xxxyy indicates the type of the processor used by the user.

--export

Specifies a folder containing single-operator simulation results, which will be directly parsed for MindStudio Insight to display the single-core or multi-core instruction pipeline chart of a single operator.

Note:

  • The specified folder should only contain multi-core data and operator kernel file aicore_binary.o. Rename the binary file specified in --config (*.o) to aicore_binary.o.
  • If only dump files are provided, code line mapping cannot be generated in the instruction pipeline chart. To view code lines, store the operator kernel file aicore_binary.o in the dump.
  • Ensure that users in the group and other groups do not have the write permission on the directory specified by --export and all files in the directory. In addition, ensure that the owner of the specified directory is the current user.

--kernel-name

Specifies the operator name to collect. Fuzzy matching using operator name prefixes is supported. If this option is not specified, only data of the first operator scheduled during program running is collected.

Note:

  • This option must be used with --application. The value can contain a maximum of 1,024 characters, restricted to letters, digits, and underscores (_).
  • If multiple operators need to be collected, use vertical bars (|) to combine them. For example, --kernel-name="add|abs" indicates that operators whose prefixes are add and abs are collected. The number of operators collected is determined by the value of --launch-count.
  • Wildcards (*) can be used match strings of any length.

No

--launch-count

Sets the maximum number of operators that can be collected. The default value is 1, and the value is an integer ranging from 1 to 5000.

No

--aic-metrics

Enables operator performance metric collection. The following performance metrics can be collected.
  • PipeUtilization (collected by default): computing and transfer instruction pipelines.

    When --aic-metrics=PipeUtilization is configured, ResourceConflictRatio is disabled. That is, only the instruction pipeline is displayed, and the details of synchronization event instructions are not included.

  • ResourceConflictRatio (collected by default): displays details about synchronization event instructions.
    • For Atlas A3 training products, Atlas A3 inference products, Atlas A2 training products, and Atlas A2 inference products, SET_FLAG and WAIT_FLAG instructions are displayed.
    • For Atlas inference products, set_event and wait_event instruction are displayed.
  • PMSampling: enables and visualizes the memory channel throughput waveform, for example, --aic-metrics=PMSampling. For details, see Memory Channel Throughput Waveform Chart
    • --core-id does not take effect for the PMSampling parameter. PMSampling parses all cores.
    • This feature is disabled by default.

No

--core-id

This parameter is used when the operators are evenly distributed. You can use --core-id to specify the IDs of some logical cores to parse their simulation data.

The core ID range is [0,49].

If the simulation data of multiple cores needs to be parsed, use vertical bars (|) to combine them. For example, --core-id="0|31" parses simulation data of cores whose IDs are 0 and 31.

--core-id does not take effect for the PMSampling parameter. PMSampling parses all cores.

No

--timeout

This parameter is applicable to operators with a large amount of data and repetitive computation. Running such operators to completion takes significant time, but partial pipeline data provides sufficient information. Set --timeout to reduce running duration and capture necessary pipeline information. The implementation is as follows:

  • When simulation duration reaches the --timeout value, msOpProf terminates the simulation process and begins parsing. Only the simulated data is analyzed. At the same time, msOpProf displays:
    [INFO]  The timeout has reached and the application will be forcibly killed.
  • If the process completes normally before reaching the timeout, the simulation ends and parsing proceeds.

The value is an integer ranging from 1 to 2880, in minutes. An example is as follows:

msprof op simulator --soc-version=Ascendxxxyy --timeout=1 ./add_custom // xxxyy indicates the type of the processor used by the user.

No

--mstx

Determines whether the operator tuning tool enables the mstx APIs used in the user code program.

The default value is off, indicating that the mstx APIs are disabled.

When --mstx=on is set, the operator tuning tool enables the mstx API used in the user program.

For example:

msprof op simulator --soc-version=Ascendxxxyy --mstx=on ./add_custom // xxxyy indicates the type of the processor used by the user.

The mstxRangeStartA and mstxRangeEnd interfaces in the mstx API are supported, allowing for the enabling of operator tuning in specified ranges. For details about parameters, see the mstxRangeStartA and mstxRangeEnd interfaces in the MindStudio mstx API Reference.

No

--mstx-include

Enables the specified mstx APIs in msOpProf.

If this parameter is not set, all mstx APIs used in user code are enabled by default.

If this parameter is set, only the specified mstx APIs are enabled. The input of --mstx-include is the message character string transferred when the user calls the mstx function. Multiple character strings must be separated by vertical bars (|).

For example:

--mstx=on --mstx-include="hello|hi" // This enables only mstx APIs where the message parameter is "hello" or "hi".

This parameter must be used with --mstx.

The message can only contain letters, digits, and underscores (_). Use vertical bars (|) to combine multiple messages.

No

--soc-version

Use this parameter or the LD_LIBRARY_PATH environment variable to specify the simulator type. The details are as follows:

  • --soc-version: specifies the simulator type in --application and --export modes. For details about the value range, see the simulator type in the ${INSTALL_DIR}/tools/simulator directory.
  • LD_LIBRARY_PATH environment variable: specifies the emulator type in --config mode or when --soc-version is not used.
    export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH 

    Replace ${INSTALL_DIR} with the file storage path after the CANN software is installed. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.

No

--output

Specifies the path for storing the collected performance data, which defaults to the current directory.

Ensure that users in the group and other groups do not have the write permission on the parent directory of the path specified by --output. In addition, ensure that the owner of the parent directory of the directory specified by --output is the current user.

No

--dump

Specifies whether to generate the dump file of the simulator.

The value can be on or off. The default value is off, indicating that the simulator dump file is not generated.

Note:

  • This parameter is valid only for Atlas A2 training products, Atlas A2 inference products, Atlas A3 training products, and Atlas A3 inference products. For Atlas inference products, this parameter does not take effect. The dump files are saved to drives as usual.
  • This parameter applies only to the single-process scenario and does not support the scenario where two operators run at the same time.

No

-h, --help

Outputs help information.

No

Usage

msOpProf assists in identifying exceptions in the operator memory, code, and instructions, enabling comprehensive operator tuning. For details about the usage, see Table 1 msopprof simulator functions.

Table 1 msopprof simulator functions

Scenario Usage Displayed Graphs
It is applicable to the development and debugging phases for detailed simulation tuning, allowing you to analyze operator instructions and code hotspots. Configure environment variables (such as LD_LIBRARY_PATH) and compilation options (such as -g to generate debugging information) as detailed in msopprof simulator configuration. This enables detailed analysis of operator behavior in a simulated environment. Instruction Pipeline Chart
Operator Code Hot Spot Map
Memory Channel Throughput Waveform Chart

msopprof simulator configuration

NOTE

The simulation function of the msOpProf tool only supports single-device scenarios and cannot simulate multi-device environments. Refer to Chip SoC Type Acquisition Method to obtain the chip type, and use it as the value of the --soc-version parameter.

  • Before using msOpProf to perform operator simulation-based tuning in --config mode, run the following command to configure environment variables:

    export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH 
    

    Modify the preceding environment variables based on the actual installation path of the CANN package and the AI processor type.

  • Add the -g compilation option to enable the operator code hot spot map and code call stack features.

    NOTE

    • If the -g compilation option is added, the generated binary file contains debugging information. You are advised to restrict access to user programs with debugging information to authorized personnel only.
    • If the functions provided by the llvm-symbolizer component are not used, do not include -g when compiling the program that is input to msOpProf. In this case, msOpProf does not call the functions of the llvm-symbolizer component.
    • For an operator project created by referring to the msOpGen tool, edit the CMakeLists.txt file in the op_kernel directory of the operator project. For details, see Creating an Operator Project.

      add_ops_compile_options(ALL OPTIONS -g)
      
    • For a project created by referring to the complete example, for example, the sample here, add the following code to the cmake/npu_lib.cmake file in the sample project directory.

      NOTE

      • This sample project does not support Atlas A3 training products.

      • When downloading the code sample, run the following command to specify the branch version:

        git clone https://gitee.com/ascend/samples.git -b v1.9-8.3.RC1
        
      ascendc_compile_options(ascendc_kernels_${RUN_MODE} PRIVATE
      -g
      -O2
      )
      
      • For Triton operators, add -g by configuring the following environment variable.

        export TRITON_DISABLE_LINE_INFO=0
        
  • When msOpProf is used to perform simulation-based tuning on the operator of the PyTorch script, the built-in print function of Python cannot print the variables and values on the device.

  • For the simulators of the Atlas A3 training products, Atlas A3 inference products, Atlas A2 training products, and Atlas A2 inference products, if the simulated blockdim exceeds the number of physical cores during running, the simulator may report the following error. You can resolve this issue by configuring the core_ostd_num parameter in the pem_config_cloud.toml file. The path to the pem_config_cloud.toml file is $\{INSTALL\_DIR\}/tools/simulator/Ascendxxxyy/lib/pem_config_cloud.toml.

    [ARCH]
        cube_core_num           = 1
        vec_core_num            = 2
        core_ostd_num           = 2             # 2 early end  1 normal mode
    
  • When using the msProf tool for operator simulation and tuning on Ascend 950 products, you need to change the flush_level parameter in the config.json file to the info level. That is, change "flush_level": 3 to "flush_level": 2 in the file. The path of the config.json file is ${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib/config.json.

Startup

Configure msopprof simulator, and then perform the following steps to enable the simulation-based tuning function of the msOpProf tool. The operator tuning tool supports profile data collection and automatic parsing in a simulation environment.

NOTE

  • Currently, msOpProf does not support the -O0 compilation option.

  • The collection of MC2 and HCCL operators is not supported in the simulation environment.

  • The number of simulation cores set by the user cannot exceed the number of physical cores.

  • If you only need to focus on the performance of specific operators, invoke the TRACE_START and TRACE_STOP APIs within a single core on Atlas A3 training products, Atlas A3 inference products, Atlas inference products, Atlas A2 training products, and Atlas A2 inference products. These interfaces are described in the "Operator Debugging APIs" section of the Ascend C Operator Development API. Additionally, add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the method for adding -DASCENDC_TRACE_ON. Only after this can pipeline chart information for the specified range be generated. For details on the pipeline chart content, see Instruction Pipeline Chart.

  • Add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the following sample project. For the AddKernelInvocationNeo Operator Project, add the following code to the $\{git\_clone\_path\}/samples/operator/ascendc/0\_introduction/3\_add\_kernellaunch/AddKernelInvocationNeo/cmake/npu\_lib.cmake file.

    ascendc_compile_definitions
    (
        ...
        -DASCENDC_TRACE_ON
    )
    
  1. Log in to the operating environment. Use msopprof simulator to start operator simulation and tuning, combined with the optional simulation parameters and the program to be tuned (app [arguments]). For details about the optional simulation parameters, see Command Reference. You can use either of the following methods for operator simulation-based tuning:

    • Based on an executable file

      • Single-operator scenario (using test as an example)

        NOTE

        The executable file name test in the example is for demonstration only. Use the actual name of the executable file generated by compilation in the current project.

        msprof op simulator --soc-version=Ascendxxxyy --output=./output_data ./test # xxxyy indicates the type of the processor used by the user.
        
      • Multi-operator scenario

        If the test executable contains Add, MatMul, and Sub operators, you can use --launch-count and --kernel-name to specify collecting data for the Add and Sub operators only.

        msprof op simulator --soc-version=Ascendxxxyy --launch-count=10 --kernel-name="Add|Sub" --output=./output_data ./test # xxxyy indicates the type of the processor used by the user. ./test must be placed at the end of the command.
        
    • Based on a JSON configuration file of the input operator binary file *.o

      NOTE

      --When using --config, you can import environment variables only via LD_LIBRARY_PATH. The --soc-version parameter is not supported.

      export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH # xxxyy indicates the type of the processor used by the user.
      msprof op simulator --config=./add_test.json --output=./output_data
      
  2. After the command is executed, a folder named OPPROF__{timestamp}___XXX_ is generated in the specified --output directory. An example of the folder structure is as follows:

    • Collecting data of a single-operator

      OPPROF_{timestamp}_XXX
      ├── dump
      └── simulator
          ├── core0.veccore0 // Stores data files for each core in directories named "core*.veccore*" or "core*.cubecore*".
          │   ├── core0.veccore0_code_exe.csv
          │   ├── core0.veccore0_instr_exe.csv
          │   └── trace.json     // Simulation instruction pipeline chart file of this core.
          ├── core0.veccore1
          │   ├── core0.veccore1_code_exe.csv
          │   ├── core0.veccore1_instr_exe.csv
          │   └── trace.json
          ├── core1.veccore0
          │   ├── core1.veccore0_code_exe.csv
          │   ├── core1.veccore0_instr_exe.csv
          │   └── trace.json
          ├── ... 
          ├── visualize_data.bin 
          └── trace.json // Simulation instruction pipeline chart file for all cores.
      
    • Collecting data of multiple operators

      └──OPPROF_{timestamp}_XXX
      ├── OpName1           // "OpName1" is the name of the operator to be collected.
      │ ├── 0              // Sequence in which the operator is scheduled.
      │ │ ├── dump          // Folder storing intermediate files, which functions in the same way as in single-operator collection.
      │ │ ├── simulator     // The content is the same as that in the single-operator simulator scenario, but the .csv files in the simulator folder have timestamp suffixes added, for example, core*_code_exe_20240429111143146.csv.
      │ ├── 1
      │ │ ├── dump        
      │ │ └──simulator
      │ ├── dump          // Folder storing intermediate files.
      ├── OpName2         
      │ ├── 0
      │ │ ├── dump       
      │ │ └── simulator
      │ ├── dump  
      

    Table 2 msopprof simulator files

    Name

    Description

    dump folder

    Folder for storing dump data generated by the simulation.

    simulator folder (storing analysis results of dump data files)

    core*_code_exe.csv

    Code line time consumption. The asterisk (*) represents cores 0 to n, allowing for quick identification of the most time-consuming sections of the code. For details, see Code Line Time Consumption Data Files.

    core*_instr_exe.csv

    Records detailed code instruction information. The asterisk (*) represents cores 0 to n, allowing for quick identification of the most time-consuming instructions. For details, see Code Instruction Information Files.

    visualize_data.bin

    Visualization file for information such as the simulation pipeline chart and simulation hot spot functions.

    trace.json

    Simulation instruction pipeline chart file, including sub-files for each core and a summary file for all cores.

  3. After the visualize_data.bin file is imported to MindStudio Insight, the instruction pipeline chart, operator code hot spot map, and memory channel throughput waveform chart are displayed.

  4. After the trace.json file is imported to the Chrome browser or MindStudio Insight, the instruction pipeline chart and Memory channel throughput waveform chart are displayed.

Instruction Pipeline Chart

Description

Visualizes the visualize_data.bin or trace.json files generated by msopprof simulator. The instruction pipeline chart displays timing relationship by instruction and associates with the call stack to quickly locate bottlenecks.

Precautions

  • For detailed MindStudio Insight operations and field explanations, see Timeline in MindStudio Insight Operator Tuning.
  • If the -g compilation option is added, the generated binary file contains debugging information. You are advised to restrict access to user programs with debugging information to authorized personnel only.
  • If the functions provided by the llvm-symbolizer component are not used, do not include -g when compiling the program that is input to msOpProf. In this case, msOpProf does not call the functions of the llvm-symbolizer component.
  • If you only need to focus on the performance of specific operators, invoke the TRACE_START and TRACE_STOP APIs within a single core on Atlas A3 training products, Atlas A3 inference products, Atlas inference products, Atlas A2 training products, and Atlas A2 inference products. These interfaces are described in the "Operator Debugging APIs" section of the Ascend C Operator Development API. Additionally, add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the method for adding -DASCENDC_TRACE_ON. Only after this can pipeline chart information for the specified range be generated.

Usage Instructions

The trace.json file can be visualized using either the Chrome browser or MindStudio Insight, while the visualize_data.bin file can be visualized only using MindStudio Insight.

  • Chrome

    Enter chrome://tracing in the address box, drag and drop the instruction pipeline chart file (trace.json) generated by msopprof simulator into the blank area to view the file. Use the keyboard shortcuts to navigate: W (zoom in), S (zoom out), A (pan left), and D (pan right). For details about the key fields, see Table 1 Key fields.

    Table 1 Key fields

    Field Description
    VECTOR Vector unit.
    SCALAR Scalar unit.
    Cube Cube unit.
    MTE1 Pipeline of data transfer from L1 to {L0A/L0B, UBUF}.
    MTE2 Pipeline of data transfer from {DDR/GM, L2} to {L1, L0A/B, UBUF}.
    MTE3 Pipeline of data transfer from UBUF to {DDR/GM, L2, L1} or L1 to {DDR/L2}.
    FIXP Pipeline of data transfer from FIXPIPE L0C to OUT/L1 (displayed only for Atlas A2 training products and Atlas A2 inference products).
    FLOWCTRL Control flow instruction.
    CACHEMISS iCache miss.
    USEMASK Custom instrumentation range. If there are nested ranges within the same USEMASK, or if there is only TRACE_START but no TRACE_STOP, the instruction pipeline chart cannot be drawn correctly.
    ALL Indicates that instructions in this channel are executed in all channels.
    PUSHQ VF/SMIT_VF instructions.
    RVECLP Vector register LOOP instructions.
    RVECSU Vector register ASU instructions, including jumps and scalar data processing.
    RVECLD Vector register LOAD instructions.
    RVECEX Vector register EXECUTE instructions.
    RVECST Vector register SET instructions.
  • MindStudio Insight

    Visualizes the generated trace.json or visualize_data.bin files.

    MindStudio Insight provides a timeline view of instruction execution on Ascend AI Processors. You can identify the timing optimization opportunities of micro instructions by analyzing the instruction details, execution times, call stacks of the code associated with the instruction, and synchronization lines between instructions and pipelines. By observing pipeline arrangements on the timeline, you can identify potential performance issues during operator execution, such as ineffective parallelization between instructions.

    Figure 1 Timeline page

    • Shows the execution duration of each instruction within each pipeline and the instruction dependencies across different pipelines, helping you to identify potential performance optimization opportunities of pipelines.
    • Associates pipeline instruction information with code to guide you through code-based pipeline layout optimization.
    • Displays the data transfer volume for instructions related to GM in the selected details.

Operator Code Hot Spot Map

Description

Visualizes the visualize_data.bin files generated by msopprof simulator. On the page, you can view the mapping between operator source code and instructions, as well as the time consumption. This helps developers identify hot spot code distribution and analyze the feasibility of hot spot function optimization.

Precautions

  • For detailed MindStudio Insight operations and field explanations, see Source in MindStudio Insight Operator Tuning.
  • If the -g compilation option is added, the generated binary file contains debugging information. You are advised to restrict access to user programs with debugging information to authorized personnel only.
  • The operator program must be compiled with the -g option. Otherwise, msOpProf will not display the hot spot map and will not call the relevant functions of the llvm-symbolizer component to implement code-to-PC mapping.
  • Operator code hotspot maps cannot be generated for MC2 or LCCL operators.

Usage Instructions

The following figure shows the operator code hotspot map.

Figure 1 msopprof simulator source code page

  • On the top of the page, you can switch between compute units and kernel function files.
  • The left pane displays the time consumed by each line of code of the operator kernel, register usage, read and write conflicts of vector instructions on the UB Bank, Vector unit usage, and GM-related data transfer along with the number of corresponding instructions, helping developers quickly locate bottlenecks.
  • The right pane displays the time consumed by each instruction, register usage, GM-related data transfer, read and write conflicts of vector instructions on the UB Bank, Vector unit usage, execution counts, and code associations, helping developers further analyze the cause of long code execution times.

NOTE

  • The maximum number of general-purpose registers is 32. When the number of used registers reaches 32, the simulation can be performed only after the registers in use are released.
  • Register usage for certain operators using the TRACE_START and TRACE_STOP APIs cannot be displayed.
  • "NA" is displayed if no GM-related unit is involved when Process Bytes is checked.
  • For details about the features supported by msopprof simulator, see Table 1 msopprof simulator hot spot map features.

    Table 1 msopprof simulator hot spot map features;

    Column Atlas A2 training products/Atlas A2 inference products Atlas A3 training products/Atlas A3 inference products: Atlas inference products Ascend 950 products Description
    Source Code Supported Supported Supported Supported -
    Instruction PC Address Supported Supported Supported Supported -
    Pipeline Supported Supported Supported Supported -
    Execution Cycles Supported Supported Supported Supported Execution time (cycles) of operator source code and instructions.
    Execution Count Supported Supported Supported Supported Execution count of operator source code and instructions.
    GPR Count Supported Supported Supported Not supported Register usage.
    Register usage for certain operators using the TRACE_START and TRACE_STOP APIs cannot be displayed.
    UB Bank Conflict Supported Supported Supported Not supported -
    Vector Unit Utilization Supported Supported Supported Not supported -
    Process Bytes Supported Supported Not supported Not supported GM-related data transfer volume.
    Stall_Cycles (NOP Stall) Not supported Not supported Not supported Supported Ratio chart comparing expected stalls with actual stalls. A stall refers to the waiting time incurred during instruction execution due to resource conflicts, data dependencies, or other reasons.

Memory Channel Throughput Waveform Chart

Description

Visualizes the visualize_data.bin files generated by msopprof simulator. On the page, you can view the statistical analysis of the memory bandwidth of the operator MTE log channel over time, helping you identify the bandwidth usage of the operator during different operator stages and evaluate the feasibility of bandwidth optimization.

Precautions

  • For detailed MindStudio Insight operations and field explanations, see Timeline in MindStudio Insight Operator Tuning.
  • Memory channel throughput waveform charts can only be displayed for Atlas A2 training products, Atlas A2 inference products, Atlas A3 training products, and Atlas A3 inference products.
  • This feature is disabled by default. The --core-id setting has no effect on this feature.

Usage Instructions

The following figure shows the memory channel throughput waveform chart.

Figure 1 Memory channel throughput waveform chart

  • Displays the data throughput (in MB/s) for various types of memory channels (currently limited to GM_TO_L1, GM_TO_TOTAL, GM_TO_UB, L1_TO_GM, TOTAL_TO_GM, and UB_TO_GM). For example, GM_TO_UB represents the throughput from GM to UB, while GM_TO_TOTAL represents the throughput from GM to each memory unit.

  • By combining this with MTE-related instructions, you can observe the throughput during execution of related commands to help identify operator performance issues.

    NOTE

    • The data used for throughput calculation corresponds to the completion of multiple requests for a specific instruction.
    • The throughput waveform may appear within the time range between the start and end of an instruction (inclusive). For example, for an instruction with a duration of 1 to 3 µs, the throughput data might be distributed across three bar charts covering the 1 to 2 µs, 2 to 3 µs, and 3 to 4 µs intervals.