nputrace

Overview

The nputrace tool is used to obtain detailed profile data of the framework, CANN, and devices.

Preparations

Install msMonitor. For details, see msMonitor Installation Guide. You are advised to download the software package for installation.

nputrace Functions

Function

Collects profile data.

Precautions

As a subcommand of the dyno command, nputrace requires --certs-dir. The value of --certs-dir must be the same as that of --certs-dir in dyno and dynolog.

Syntax

dyno --certs-dir <CERT_DIR> nputrace [options]

CERT_DIR indicates the certificate path. If the TLS certificate key is not used, set CERT_DIR to NO_CERTS. [options] is described as follows.

Option Description

Option Required/Optional Description Supported by PyTorch (Y/N) Supported by MindSpore (Y/N)
--job-id Optional ID of a collection task. The value is of the u64 type. The default value is 0. Native dynolog option. N N
--pids Optional PID list of a collection task. The value is of the string type. Multiple PIDs must be separated by commas (,). The default value is 0. Native dynolog option. N N
--process-limit Optional Maximum number of collection processes. The value is of the u64 type. The default value is 3. Native dynolog option. N N
--profile-start-time Optional Unix timestamp for synchronous collection. The value is of the u64 type, in milliseconds. The default value is 0. Native dynolog option. N N
--duration-ms Optional Collection period. The value is of the u64 type. The default value is 500, in milliseconds. Native dynolog option. N N
--iterations Mandatory Total number of steps for collection. The value is of the i64 type. The value must be a positive integer. Native dynolog option. Must be used together with the --start-step option. Y Y
--log-file Mandatory Path for outputting collected data. The value is of the string type. Y Y
--start-step Mandatory Start step for collection. The value is of the i64 type. The value must be a positive integer or -1. If the value is set to -1, the collection starts from the next step. Y Y
--record-shapes Optional InputShapes and InputTypes collection switch of an operator. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y Y
--profile-memory Optional Operator memory information collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y Y
--with-stack Optional Python call stack collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y Y
--with-flops Optional Operator flops collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y N
--with-modules Optional Python call stack collection switch at the modules level. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y N
--analyse Optional Automatic analysis switch after collection. The value is of the action type. If this parameter is set, automatic analysis is enabled. If this parameter is not set, automatic analysis is disabled. Y Y
--async-mode Optional Asynchronous analysis switch. The value is of the action type. If this parameter is set, asynchronous analysis is enabled. If this parameter is not set, synchronous analysis is used. This option does not take effect if --analyse is not configured. Y Y
--l2-cache Optional L2 cache data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y Y
--op-attr Optional Operator attribute information collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y N
--msprof-tx Optional mstx dotting data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled.
In the PyTorch or MindSpore scenario, after this function is enabled, the mstx dotting collects the time consumed by the communication operators (domain: communication) and dataloader, and saves the time consumed by the checkpoint APIs (domain: default) by default.
Y Y
--mstx-domain-include Optional When --msprof-tx is enabled to collect mstx dotting data, set this parameter to specify the domain range to be collected. By default, the domain range to be collected is not configured.
This option is mutually exclusive with the --mstx-domain-exclude option. If both options are set, only the --mstx-domain-include option takes effect.
You can configure one or more domains, for example, --mstx-domain-include domain1, domain2.
Y Y
--mstx-domain-exclude Optional When --msprof-tx is enabled to collect mstx dotting data, set this parameter to specify the domain range excluded from collection. By default, the domain range excluded from collection is not configured.
This option is mutually exclusive with the --mstx-domain-include option. If both options are set, only the --mstx-domain-include option takes effect.
You can configure one or more domains, for example, --mstx-domain-exclude domain1, domain2.
Y Y
--data-simplification Optional Data simplification mode. The value can be:
true: enables data simplification. After this function is enabled, redundant data is deleted after profile data is exported. Only the profiler_*.json file, ASCEND_PROFILER_OUTPUT directory, original profile data in the PROF_XXX directory, FRAMEWORK directory, and logs directory are retained to save storage space.
false: disables data simplification.
The default value is true.
Y Y
--activities Optional CPU and NPU event collection scope. The values are as follows:
CPU: data collection switch of the framework.
NPU: data collection switch of the CANN software stack and NPU.
By default, CPU and NPU events are collected concurrently. That is, --activities CPU,NPU is configured.
Y Y
--profiler-level Optional Collection level of profiler. The values are as follows:
Level_none: Does not collect data at all levels. That is, --profiler_level is disabled.
Level0: Collects upper-layer application data, bottom-layer NPU data, and information about operators executed on the NPU.
Level1: Collects the data at level 0, AscendCL data at the CANN layer, and AI Core performance metrics executed on the NPU, enables --aic-metrics PipeUtilization, and generates the communication.json, communication_matrix.json, and api_statistic.csv files of the communication operator.
Level2: Collects the data at level 1, runtime data at the CANN layer, and AI CPU data (data_preprocess.csv).
• The default value is Level0.
Y Y
--aic-metrics Optional AI Core metrics to be collected. The values are as follows:
AiCoreNone: Disables AI Core performance metric collection.
PipeUtilization: percentages of time taken by compute units and MTEs.
ArithmeticUtilization: percentages of arithmetic utilization.
Memory: ratio of external memory read/write instructions.
MemoryL0: ratio of internal memory L0 read/write instructions.
ResourceConflictRatio: percentages of pipeline queue instructions.
MemoryUB: ratio of internal memory UB read/write instructions.
L2Cache: cache re-allocations upon missing of the read/write cache hit count.
MemoryAccess: bandwidth of the operator's memory access on cores.
If --profiler-level is set to Level_none or Level0, the default value is AiCoreNone. If --profiler-level is set to Level1 or Level2, the default value is PipeUtilization.
Y Y
--export-type Optional Type of the data analyzed and exported by the profiler. The values are as follows:
Text: timeline and summary files in .json and .csv formats and .db files that summarize all profile data.
Db: Only .db files that summarize all profile data are analyzed and displayed using MindStudio Insight.
The default value is Text.
Y Y
--gc-detect-threshold Optional GC detection threshold. The value is of the Option<f32> type, in milliseconds. GC events are collected only when their occurrence exceeds the threshold. By default, GC detection is disabled when this option is not set. Y N
--host-sys Optional Host-side system data to be collected. The values are as follows:
cpu: process CPU usage
mem: process memory usage
disk: process disk I/O usage
network*: network I/O usage
osrt: process syscall and pthreadcall
You can set one or more types. Use commas (,) to separate multiple types, for example, --host-sys cpu,mem.
By default, this option is not set, indicating that host-side system data collection is disabled.
Y Y
--sys-io Optional NIC and RoCE data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y Y
--sys-interconnection Optional Collective communication bandwidth data (HCCS), PCIe, and inter-chip transmission bandwidth data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. Y Y

Example

  1. Start the dynolog daemon process. For details, see dynolog.

    # Enable the dynolog daemon in CLI mode.
    dynolog --enable-ipc-monitor --certs-dir /home/server_certs
    
  2. Enable the dynolog environment variable in the training or inference job startup window.

    export MSMONITOR_USE_DAEMON=1
    
  3. Start a training or inference job.

    # The PyTorch optimizer or native optimizer is required in the training job.
    bash train.sh
    
  4. Use the dyno CLI to dynamically trigger trace dump.

    # Example 1: Collect data of two steps starting from the 10th step, including the framework, CANN, and device data. After the collection is complete, the data is automatically analyzed and not simplified. The output path is /tmp/profile_data.
    dyno --certs-dir /home/client_certs nputrace --start-step 10 --iterations 2 --activities CPU,NPU --analyse --data-simplification false --log-file /tmp/profile_data
    
    # Example 2: Collect data of two steps starting from the next step, including the framework, CANN, and device data. After the collection is complete, the data is automatically analyzed and not simplified. The output path is /tmp/profile_data.
    dyno --certs-dir /home/client_certs nputrace --start-step -1 --iterations 2 --activities CPU,NPU --analyse --data-simplification false --log-file /tmp/profile_data
    
    # Example 3: Collect data of two steps starting from the 10th step, including only the CANN and device data. After the collection is complete, the data is automatically analyzed and simplified. The output path is /tmp/profile_data.
    dyno --certs-dir /home/client_certs nputrace --start-step 10 --iterations 2 --activities NPU --analyse --data-simplification true --log-file /tmp/profile_data
    
    # Example 4: Collect data of two steps starting from the 10th step. Only CANN and device data is collected but not analyzed. The data is output to /tmp/profile_data.
    dyno --certs-dir /home/client_certs nputrace --start-step 10 --iterations 2 --activities NPU --log-file /tmp/profile_data
    
    # Example 5: In the multi-server scenario, send parameter information to a specific server x.x.x.x. The parameters indicate that data of two steps starting from the 10th step is collected. Only CANN and device data is collected but not analyzed. The data is output to /tmp/profile_data.
    dyno --certs-dir /home/client_certs --hostname x.x.x.x nputrace --start-step 10 --iterations 2 --activities NPU --log-file /tmp/profile_data
    

Output File Description

For details about the output data format and deliverables of nputrace, see MindSpore and PyTorch framework profile data file reference.