nputrace
Overview
The nputrace tool is used to obtain detailed profile data of the framework, CANN, and devices.
Preparations
Install msMonitor. For details, see msMonitor Installation Guide. You are advised to download the software package for installation.
nputrace Functions
Function
Collects profile data.
Precautions
As a subcommand of the dyno command, nputrace requires --certs-dir. The value of --certs-dir must be the same as that of --certs-dir in dyno and dynolog.
Syntax
dyno --certs-dir <CERT_DIR> nputrace [options]
CERT_DIR indicates the certificate path. If the TLS certificate key is not used, set CERT_DIR to NO_CERTS. [options] is described as follows.
Option Description
| Option | Required/Optional | Description | Supported by PyTorch (Y/N) | Supported by MindSpore (Y/N) |
|---|---|---|---|---|
| --job-id | Optional | ID of a collection task. The value is of the u64 type. The default value is 0. Native dynolog option. |
N | N |
| --pids | Optional | PID list of a collection task. The value is of the string type. Multiple PIDs must be separated by commas (,). The default value is 0. Native dynolog option. |
N | N |
| --process-limit | Optional | Maximum number of collection processes. The value is of the u64 type. The default value is 3. Native dynolog option. |
N | N |
| --profile-start-time | Optional | Unix timestamp for synchronous collection. The value is of the u64 type, in milliseconds. The default value is 0. Native dynolog option. |
N | N |
| --duration-ms | Optional | Collection period. The value is of the u64 type. The default value is 500, in milliseconds. Native dynolog option. |
N | N |
| --iterations | Mandatory | Total number of steps for collection. The value is of the i64 type. The value must be a positive integer. Native dynolog option. Must be used together with the --start-step option. |
Y | Y |
| --log-file | Mandatory | Path for outputting collected data. The value is of the string type. | Y | Y |
| --start-step | Mandatory | Start step for collection. The value is of the i64 type. The value must be a positive integer or -1. If the value is set to -1, the collection starts from the next step. |
Y | Y |
| --record-shapes | Optional | InputShapes and InputTypes collection switch of an operator. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | Y |
| --profile-memory | Optional | Operator memory information collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | Y |
| --with-stack | Optional | Python call stack collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | Y |
| --with-flops | Optional | Operator flops collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | N |
| --with-modules | Optional | Python call stack collection switch at the modules level. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | N |
| --analyse | Optional | Automatic analysis switch after collection. The value is of the action type. If this parameter is set, automatic analysis is enabled. If this parameter is not set, automatic analysis is disabled. | Y | Y |
| --async-mode | Optional | Asynchronous analysis switch. The value is of the action type. If this parameter is set, asynchronous analysis is enabled. If this parameter is not set, synchronous analysis is used. This option does not take effect if --analyse is not configured. |
Y | Y |
| --l2-cache | Optional | L2 cache data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | Y |
| --op-attr | Optional | Operator attribute information collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | N |
| --msprof-tx | Optional | mstx dotting data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. In the PyTorch or MindSpore scenario, after this function is enabled, the mstx dotting collects the time consumed by the communication operators (domain: communication) and dataloader, and saves the time consumed by the checkpoint APIs (domain: default) by default. |
Y | Y |
| --mstx-domain-include | Optional | When --msprof-tx is enabled to collect mstx dotting data, set this parameter to specify the domain range to be collected. By default, the domain range to be collected is not configured.This option is mutually exclusive with the --mstx-domain-exclude option. If both options are set, only the --mstx-domain-include option takes effect.You can configure one or more domains, for example, --mstx-domain-include domain1, domain2. |
Y | Y |
| --mstx-domain-exclude | Optional | When --msprof-tx is enabled to collect mstx dotting data, set this parameter to specify the domain range excluded from collection. By default, the domain range excluded from collection is not configured.This option is mutually exclusive with the --mstx-domain-include option. If both options are set, only the --mstx-domain-include option takes effect.You can configure one or more domains, for example, --mstx-domain-exclude domain1, domain2. |
Y | Y |
| --data-simplification | Optional | Data simplification mode. The value can be: • true: enables data simplification. After this function is enabled, redundant data is deleted after profile data is exported. Only the profiler_*.json file, ASCEND_PROFILER_OUTPUT directory, original profile data in the PROF_XXX directory, FRAMEWORK directory, and logs directory are retained to save storage space.• false: disables data simplification.The default value is true. |
Y | Y |
| --activities | Optional | CPU and NPU event collection scope. The values are as follows: • CPU: data collection switch of the framework.• NPU: data collection switch of the CANN software stack and NPU.By default, CPU and NPU events are collected concurrently. That is, --activities CPU,NPU is configured. |
Y | Y |
| --profiler-level | Optional | Collection level of profiler. The values are as follows: • Level_none: Does not collect data at all levels. That is, --profiler_level is disabled.• Level0: Collects upper-layer application data, bottom-layer NPU data, and information about operators executed on the NPU.• Level1: Collects the data at level 0, AscendCL data at the CANN layer, and AI Core performance metrics executed on the NPU, enables --aic-metrics PipeUtilization, and generates the communication.json, communication_matrix.json, and api_statistic.csv files of the communication operator.• Level2: Collects the data at level 1, runtime data at the CANN layer, and AI CPU data (data_preprocess.csv).• The default value is Level0. |
Y | Y |
| --aic-metrics | Optional | AI Core metrics to be collected. The values are as follows: • AiCoreNone: Disables AI Core performance metric collection.• PipeUtilization: percentages of time taken by compute units and MTEs.• ArithmeticUtilization: percentages of arithmetic utilization.• Memory: ratio of external memory read/write instructions.• MemoryL0: ratio of internal memory L0 read/write instructions.• ResourceConflictRatio: percentages of pipeline queue instructions.• MemoryUB: ratio of internal memory UB read/write instructions.• L2Cache: cache re-allocations upon missing of the read/write cache hit count.• MemoryAccess: bandwidth of the operator's memory access on cores.If --profiler-level is set to Level_none or Level0, the default value is AiCoreNone. If --profiler-level is set to Level1 or Level2, the default value is PipeUtilization. |
Y | Y |
| --export-type | Optional | Type of the data analyzed and exported by the profiler. The values are as follows: • Text: timeline and summary files in .json and .csv formats and .db files that summarize all profile data.• Db: Only .db files that summarize all profile data are analyzed and displayed using MindStudio Insight.The default value is Text. |
Y | Y |
| --gc-detect-threshold | Optional | GC detection threshold. The value is of the Option<f32> type, in milliseconds. GC events are collected only when their occurrence exceeds the threshold. By default, GC detection is disabled when this option is not set. | Y | N |
| --host-sys | Optional | Host-side system data to be collected. The values are as follows: • cpu: process CPU usage• mem: process memory usage• disk: process disk I/O usage• network*: network I/O usage• osrt: process syscall and pthreadcallYou can set one or more types. Use commas (,) to separate multiple types, for example, --host-sys cpu,mem.By default, this option is not set, indicating that host-side system data collection is disabled. |
Y | Y |
| --sys-io | Optional | NIC and RoCE data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | Y |
| --sys-interconnection | Optional | Collective communication bandwidth data (HCCS), PCIe, and inter-chip transmission bandwidth data collection switch. The value is of the action type. If this option is set, the collection is enabled. If this option is not set, the collection is disabled. | Y | Y |
Example
-
Start the dynolog daemon process. For details, see dynolog.
# Enable the dynolog daemon in CLI mode. dynolog --enable-ipc-monitor --certs-dir /home/server_certs -
Enable the dynolog environment variable in the training or inference job startup window.
export MSMONITOR_USE_DAEMON=1 -
Start a training or inference job.
# The PyTorch optimizer or native optimizer is required in the training job. bash train.sh -
Use the dyno CLI to dynamically trigger trace dump.
# Example 1: Collect data of two steps starting from the 10th step, including the framework, CANN, and device data. After the collection is complete, the data is automatically analyzed and not simplified. The output path is /tmp/profile_data. dyno --certs-dir /home/client_certs nputrace --start-step 10 --iterations 2 --activities CPU,NPU --analyse --data-simplification false --log-file /tmp/profile_data # Example 2: Collect data of two steps starting from the next step, including the framework, CANN, and device data. After the collection is complete, the data is automatically analyzed and not simplified. The output path is /tmp/profile_data. dyno --certs-dir /home/client_certs nputrace --start-step -1 --iterations 2 --activities CPU,NPU --analyse --data-simplification false --log-file /tmp/profile_data # Example 3: Collect data of two steps starting from the 10th step, including only the CANN and device data. After the collection is complete, the data is automatically analyzed and simplified. The output path is /tmp/profile_data. dyno --certs-dir /home/client_certs nputrace --start-step 10 --iterations 2 --activities NPU --analyse --data-simplification true --log-file /tmp/profile_data # Example 4: Collect data of two steps starting from the 10th step. Only CANN and device data is collected but not analyzed. The data is output to /tmp/profile_data. dyno --certs-dir /home/client_certs nputrace --start-step 10 --iterations 2 --activities NPU --log-file /tmp/profile_data # Example 5: In the multi-server scenario, send parameter information to a specific server x.x.x.x. The parameters indicate that data of two steps starting from the 10th step is collected. Only CANN and device data is collected but not analyzed. The data is output to /tmp/profile_data. dyno --certs-dir /home/client_certs --hostname x.x.x.x nputrace --start-step 10 --iterations 2 --activities NPU --log-file /tmp/profile_data
Output File Description
For details about the output data format and deliverables of nputrace, see MindSpore and PyTorch framework profile data file reference.