MindStudio Insight Memory Tuning

Overview

MindStudio Insight displays the detailed memory allocation on the device in graphics. Currently, two types of memory data sources are supported:

memscope data source

Based on the visualization of the memory allocation and release lifetime, the tool marks the details of various memory allocations using the Python call stack and customized tracing labels, to locate memory problems, analyze peak values, and identify inefficient memory blocks to achieve the tuning objective.

PyTorch Snapshot data source

Based on the visualization of the memory allocation and release lifetime, the tool locates and tunes memory fragmentation problems based on the memory pool allocation status.

NOTE

The PyTorch Snapshot function is referred to as the memory snapshot function in this document.

Preparations

Collecting memscope Data

Environment Setup

Install MindStudio Insight first. For details, see MindStudio Insight Installation Guide.

Data preparation

Import profile data in the correct format. For details about the data, see [Data Description](#Data Description). For details about how to import data, see Importing Data.

Collecting PyTorch Snapshot Data

Basic Collection Process

torch_npu provides APIs for collecting memory snapshots. The basic collection process is as follows:

Enable the memory history function. Before running the model code, call torch_npu.npu.memory._record_memory_history() to enable the memory history function. This function records memory allocation and release events and stack information.
Run the target code, that is, the model code for analyzing the memory usage scenario (such as training or inference).
Export the memory snapshot. After the code execution is complete, call torch_npu.npu.memory._dump_snapshot("snapshot.pickle") to save the memory snapshot as a pickle file.

Core API Parameters Description

Core API Parameters	Description
Parameters of the `torch_npu.npu.memory._record_memory_history()` function	`enabled`: controls the range of the content to be recorded. `None`: disables memory history. `state`: records only information about the currently allocated memory. `all`: records the history of all memory allocation and deallocation events (default). `context`: controls the scope of recorded stack information. `None`: No stack information is recorded. `state`: records only the stack information of the currently allocated memory. `alloc`: also records the stack information of the memory allocation operation. `all`: also records the stack information of the memory deallocation operation (default). `stacks`: controls the depth of the stack information. `python`: includes the stack information of the Python, TorchScript, and Inductor frameworks. `all`: also includes the stack information of the C++ framework in addition (default). `max_entries`: limits the number of recorded memory events. The default value is `9223372036854775807`, indicating that there is no actual limit. `device`: (Optional) specifies the device whose memory history is to be recorded.
Parameters of the `torch_npu.npu.memory._dump_snapshot(file_path)` function	`file_path`: specifies the path of the file that stores the memory snapshot. The file format is `pickle`.

Parameters of the torch_npu.npu.memory._record_memory_history() function

enabled: controls the range of the content to be recorded.

None: disables memory history.
state: records only information about the currently allocated memory.
all: records the history of all memory allocation and deallocation events (default).

context: controls the scope of recorded stack information.

None: No stack information is recorded.
state: records only the stack information of the currently allocated memory.
alloc: also records the stack information of the memory allocation operation.
all: also records the stack information of the memory deallocation operation (default).

stacks: controls the depth of the stack information.

python: includes the stack information of the Python, TorchScript, and Inductor frameworks.
all: also includes the stack information of the C++ framework in addition (default).

max_entries: limits the number of recorded memory events. The default value is 9223372036854775807, indicating that there is no actual limit.

device: (Optional) specifies the device whose memory history is to be recorded.

Parameters of the torch_npu.npu.memory._dump_snapshot(file_path) function

file_path: specifies the path of the file that stores the memory snapshot. The file format is pickle.

The following is an example of the code for collecting PyTorch Snapshot (memory snapshot) data:

import torch_npu

# Enable the memory history to record all events and stack information.
torch_npu.npu.memory._record_memory_history(stacks='python')

# Run the model code.
def run_model():
    # Model definition and training/inference code
    model = torch.nn.Linear(1000, 1000).cuda()
    input = torch.randn(1000, 1000).cuda()
    output = model(input)
    loss = output.sum()
    loss.backward()

run_model()

# Export memory snapshot.
torch_npu.npu.memory._dump_snapshot("model_memory_snapshot.pickle")

Data Description

memscope Data Description

You can import the memory result files in DB format collected by the msMemScope tool to display related content in graphics. For details about how to obtain the DB file, see section "Collection via CLI" in the msMemScope Memory Collection. For details about the memory data that can be imported, see Table 1 Memory data description.

Table 1 Memory data description

File Name	Description	Displayed Content
msMemScope_dump_{timestamp}.db	The value of `--events` must contain at least the alloc and free events.	Memory block lifetime chart (memory allocation/release curve and memory block chart)
	The value of `--analysis` contains `decompose`.	Memory details chart
	The tracer function is enabled using the Python API.	Python call stack chart

PyTorch Snapshot Data Description

PyTorch Memory Snapshot is a memory snapshot function provided by PyTorch. It is used to record and analyze the memory usage of the memory pool managed by PyTorch during model running. For details about the original data of memory snapshot, see "Memory Snapshot Application Scenarios" in Section "Model Development" in the Ascend Extension for PyTorch.

Common Memory Problems

Memory leak, overflow, or reallocation: During model training or inference, the memory usage keeps increasing step by step or request by request until an avalanche occurs (APP or PTA reserved memory suddenly decreases after reaching a certain point) or out of memory (OOM) occurs.
Memory fragmentation: During model training, there is a large gap between the operator reservation and operator allocation curves.
Peak memory tuning: During model training, the peak memory needs to be analyzed to determine which operators or tensors cause the peak memory. Then evaluate whether the memory peak can be reduced by adjusting the allocation order of tensors or the execution order of operators.

NOTE

The preceding common memory problems occur when the memory tab page is displayed on the System Tuning page after profile data is collected.

Memory Details

memscope Data Memory Details

Function Description

During memory tuning, MindStudio Insight displays the memory status through the Python call stack and memory block lifetime charts, allowing developers to conveniently analyze and locate memory issues.

GUI Description

The memory details (msMemScope) page consists of the call stack flame graph (area 1), memory block lifetime graph (area 2), memory details disassembly diagram (area 3), and memory details table (area 4), as shown in Figure 1 Memory details page.

Figure 1 Memory details page Figure 1 Memory details page

Area 1: In the function stack flame graph, you can select a thread ID to display the corresponding Python stack graph. To highlight functions, enter function names in the Search text box or select function names from the drop-down list.

NOTE
By default, the Allow Trim option is selected in this area. In this state, the tool compresses data without affecting the overall data display, improving the tool usability.
Area 2: The memory block lifetime graph displays the memory allocation/release line graph and memory block graph. You can select a color block in the memory block chart to view details about the memory block. You can also select a device ID and type to view the corresponding memory block lifetime graph.
Area 3: The memory details disassembly diagram is not displayed by default. When you hover the mouse pointer over the call stack flame graph or memory block lifetime graph, a timeline is displayed. In the memory block lifetime graph area, click the timeline to view the memory details disassembly graph at the corresponding time point. After the memory details disassembly graph is displayed, you can click the button in the upper left corner to hide or display the graph.
Area 4: The memory details table is classified into Block View and Event View for your selection. For details, see Memory Details.

Usage Description

The call stack flame graph and memory block lifetime graph can be used to view data within a specified time range.

In the call stack flame graph and memory block lifetime graph of MindStudio Insight, you can select a range by dragging the zoom slider in the trend chart to display data within a specified range, as shown in Figure 1 Zoom slider in the trend chart.

Figure 1 Zoom slider in the trend chart Figure 1 Zoom slider in the trend chart

The zoom slider in the trend chart supports trend chart visualization and diversified operations on the time domain.

Trend chart visualization:

The background of the zoom slider in the trend chart of MindStudio Insight displays the trend of the memory usage (Operator Allocated) in the overall time range, and intuitively displays the memory usage trend in the selected time range.

NOTE
The Operator Allocated curve indicates the change trend of the allocated memory collected when the operator allocates or releases the memory. It represents the total allocated memory of all operators.
Diversified operations on the time domain:

You can use the left and right arrow keys to select the start time and end time and view the memory usage.
You can select any range in the trend chart to select a time range and display the corresponding memory usage.
You can set a fixed time length and drag the slider leftwards or rightwards to view the memory usage in the fixed time length.

The memory block lifetime graph and memory details disassembly diagram support dragging and scrolling.

The memory block lifetime graph and memory details disassembly diagram can be moved by dragging and zoomed in or out by scrolling.

Memory details disassembly diagram

When you hover the mouse pointer over the call stack flame graph or memory block lifetime graph, a timeline is displayed. In the memory block lifetime graph area, click the timeline to view the memory details disassembly diagram at the corresponding time point below the memory block lifetime graph, helping you view the memory usage. The content displayed in Memory Details Disassembly Diagram varies depending on the selected type.

To view a specified memory layer, click the layer bar below Memory Details Disassembly Diagram.

When the type is set to HAL, Memory Details Disassembly Diagram displays only the memory data classified at the CANN layer, as shown in Figure 2 Memory details disassembly diagram at the CANN layer.

Figure 2 Memory details disassembly diagram at the CANN layer
If Type is set to a value other than HAL, Memory Details Disassembly Diagram displays the memory type and layer of the memory pool on the corresponding framework. For example, if Type is set to PTA, Memory Details Disassembly Diagram displays only the memory information of the PTA framework, as shown in Figure 3 Memory details disassembly diagram of the PTA framework.

Figure 3 Memory details disassembly diagram of the PTA framework

NOTE

You can drag the Memory Details Disassembly Diagram leftward, rightward, upward, and downward, and zoom in or out the diagram.

Place the cursor on the diagram and hold down the left mouse button to drag the diagram left, right, up, or down.
On the Memory Details Disassembly Diagram, you can use the mouse wheel to zoom in or out the diagram. Alternatively, you can select a memory block and left-click to zoom in on the selected memory layer.

Memory Details

The Memory Details Table area displays memory details by Block View and Event View. By default, all memory information is displayed.

NOTE

The memory details are hidden by default. To view the details, click the expand button to display the memory details. If you do not need to view the memory details, click the collapse button to hide the memory details.

Filter is supported for the Size, Malloc Timestamp, and Free Timestamp fields in Block View, and the Timestamp field in Event View. After you click , you can enter integers between 0 and the value displayed in the table to set the minimum and maximum values of the filter range.

Block View: displays detailed information about memory blocks, as shown in Figure 4 Block View. For details about the fields, see Table 1 Block View fields.

When you select different device IDs and types in the memory block lifetime graph, the information displayed in the memory block view is updated accordingly. When you select an area in the memory block lifetime graph, the information displayed in the memory block view is also updated accordingly, showing information about all memory blocks that overlap with the selected time range.

To filter inefficient memory blocks, you can click Filter Inefficient Memory Blocks in the upper right corner of the Block View table, and set thresholds for early request, delayed release, or long idle period.

Figure 4 Block View Figure 4 Block View

Table 1 Block View fields

	Field	Description
	ID	Unique ID of a memory block.
	Addr	Memory block address, which corresponds to the address of the memory request, release, or access event.
	Size(bytes)	Size of the memory block, which corresponds to the memory request event. The unit is bytes.
	Malloc Timestamp(ns)	Time when the memory block is requested, which corresponds to the memory request event. The unit is nanoseconds.
	Free Timestamp(ns)	Time when the memory block is released, which corresponds to the memory release event. The unit is nanoseconds.
	Owner	Memory block owner.
	Process ID	Memory block process ID, which corresponds to the process ID of the memory request or release event.
	Thread ID	Memory block thread ID, which corresponds to the thread ID of the memory request or release event.
	First Access Timestamp(ns)	Time of the first access event.
	Last Access Timestamp(ns)	Time of the last access event.
	Max Access Interval(ns)	Maximum interval of access events.
	Attr	Extended attributes: - allocation_id: ID for the request, access, or release sequence of a memory block. It uniquely identifies a group of memory events. - lazy_used: early request. The value can be true or false. true indicates that the scenario has been detected. - delayed_free: delayed release. The value can be true or false. true indicates that the scenario has been detected. - long_Idle: long idle period. The value can be true or false. true indicates that the scenario has been detected.

NOTE

If the imported data is collected by the msMemScope tool of a version earlier than MindStudio 8.2.RC1 or no access event is collected, allocation_id is displayed as 0, First Access Timestamp(ns) and Last Access Timestamp(ns) are displayed as -1, and Max Access Interval(ns) is displayed as 0.
Currently, the msMemScope tool can collect memory access events only in the ATB and Ascend Extension for PyTorch operator scenarios. Therefore, First Access Timestamp(ns), Last Access Timestamp(ns), and Max Access Interval(ns) are available only in these scenarios. In other scenarios, First Access Timestamp(ns) and Last Access Timestamp(ns) are displayed as -1, and Max Access Interval(ns) is displayed as 0.

Event View: displays detailed information about memory events, as shown in Figure 5 Event View. For details about the fields, see Table 2 Event View fields.

When you select different device IDs in the memory block lifetime graph, the information displayed in the event view is updated accordingly. When you select a specific area in the memory block lifetime graph, the information in the event view is also updated, displaying all memory events within the selected time range.

Figure 5 Event View Figure 5 Event View

Table 2 Event View fields

	Field	Description
	ID	Event ID, which, together with Process ID, uniquely identifies an event.
	Event	Events recorded by msMemScope.
	Event Type	Event subtypes.
	Name	Event name, which is related to the value of Event.
	Timestamp(ns)	Time when a memory event occurs.
	Process ID	Process information.
	Thread ID	Thread ID.
	Addr	Memory address.
	Attr	Memory event attribute. Each event type has its own attribute.
	Call Stack(Python)	Python call stack. This field is displayed only when the information is collected.
	Call Stack(C)	C call stack. This field is displayed only when the information is collected.

NOTE

For details about the values of Event, Event Type, and Name fields, see the description of the msmemscope_dump_{timestamp}.csv result file in section "Collection via CLI" in msMemScope Memory Collection.

Slice Detail: displays details about the memory block, as shown in [Figure 6 Slice Detail](#Slice Detail).

When you click the view of any time point in the memory block lifetime graph, the displayed information in the Slice Detail area is updated accordingly.

Figure 6 Slice Detail

PyTorch Snapshot Data Memory Details (Memory Snapshot)

Description

Based on the visualization of the memory allocation and release lifetime, the tool locates and tunes memory fragmentation problems based on the memory pool allocation status.

GUI Description

The PyTorch Snapshot page consists of the memory block lifetime graph (area 1), memory pool status graph (area 2), and memory details table (area 3), as shown in Figure 1 PyTorch Snapshot.

Figure 1 PyTorch Snapshot Figure 1 PyTorch Snapshot

Area 1: The memory block lifetime graph displays the memory allocation/release line graph and memory block graph. You can select a color block in the memory block chart to view details about the memory block.
Area 2: When you hover the mouse pointer over the memory block lifetime graph, a timeline is displayed. In the memory block lifetime graph area, click the timeline to view the memory pool status graph at the corresponding time point. For details, see Memory Pool Status Graph.
Area 3: The memory snapshot details table is classified into Block View and Event View for your selection. For details, see Memory Snapshot Details.

Usage Description

The memory block lifetime graph can be used to view data within a specified time range.

In the memory block lifetime graph of MindStudio Insight, you can select a range by dragging the zoom slider in the memory snapshot trend chart to display data within a specified range, as shown in Figure 1 Zoom slider in the memory snapshot trend chart.

Figure 1 Zoom slider in the memory snapshot trend chart Figure 1 Zoom slider in the memory snapshot trend chart

The zoom slider in the trend chart supports trend chart visualization and diversified operations on the time domain.

Trend chart visualization:

The background of the zoom slider in the memory snapshot trend chart of MindStudio Insight displays the trend chart of the memory usage (Operator Allocated) in the overall time range, intuitively showing the trend of the entire memory usage in the selected time range.

NOTE
The Operator Allocated curve indicates the change trend of the allocated memory collected when the operator allocates or releases the memory. It represents the total allocated memory of all operators.
Diversified operations on the time domain:

You can use the left and right arrow keys to select the start time and end time and view the memory usage.
You can select any range in the trend chart to select a time range and display the corresponding memory usage.
You can set a fixed time length and drag the slider leftwards or rightwards to view the memory usage in the fixed time length.

The memory block lifetime graph and memory pool status graph support dragging and scrolling.

The memory block lifetime graph and memory pool status graph can be moved by dragging and zoomed in or out by scrolling.

Memory Pool Status Graph

When you hover the mouse pointer over the memory block lifetime graph, click a memory block. The event overview and memory pool status graph at the corresponding time point are displayed below the memory block lifetime graph. You can also search for the corresponding address to accurately locate the event details, as shown in Figure 2 Memory pool status graph.

NOTE

If the memory pool status is not updated after you click a memory block, no allocation event is collected for the memory block in the lifetime.

Figure 2 Memory pool status graph Figure 2 Memory pool status graph

Memory Snapshot Details

The memory details include the slice detail and system view. The slice detail displays the details of the event. In the system view, the memory block view and memory event view are used to display memory details. By default, all memory-related information is displayed.

NOTE

For the Size(bytes) and Requested Size(bytes) fields in the memory block view, and the Size(bytes), Allocated(bytes), Active(bytes), and Reserved(bytes) fields in the memory event view, you can click to enter the minimum and maximum values for range filtering.

Block View: displays detailed information about memory blocks, as shown in Figure 3 Block View. For details about the fields, see Table 1 Block View fields.

Figure 3 Block View Figure 3 Block View

Table 1 Block View fields

Field	Required/Optional	Description	Type	Example Value	Remarks
ID	Required	Unique ID of a memory event.	Integer	0	None
Requested Size(bytes)	Required	Size of the memory block to be allocated, in bytes.	Floating point number	12.5	Size of the memory required by the allocation event. The PTA performs padding and alignment based on the requested size. Therefore, the allocated size may be greater than the requested size.
Size(bytes)	Required	Size of the memory operated by the memory event, in bytes.	Floating point number	12.5	Actual size of the memory allocated to the memory block, in bytes. The value is greater than or equal to the requested size.
Address	Required	Memory event address.	0xhexadecimal address	0x7f9f00000000	Address of the memory block in the memory.
State	Required	Memory block status.	One of the enumerated values. For details, see the remarks.	`active_allocated`	Current status of the memory block. The options are as follows: `active_allocated`: The memory block has been allocated and is currently in use. It cannot be reused. `active_pending_free`: The memory block has been requested to be released, but the release is not complete (possibly due to cross-stream dependencies). It cannot be reused. `inactive`: The memory block is not allocated (or the memory block has been released) and can be reused.
Alloc Event ID	Optional	ID of the memory block allocation event.	Integer	1	Unique ID of the memory block allocation event. The value `-1` indicates that the memory block allocation event is not recorded in the memory snapshot collection lifetime.
Free Event ID	Optional	ID of the memory block release completion event.	Integer	2	Unique ID of the memory block release completion event. The value `-1` indicates that the memory block release completion event is not recorded in the memory snapshot collection lifetime.

NOTE

If the imported data is collected by the msMemScope tool of a version earlier than MindStudio 8.2.RC1 or no access event is collected, allocation_id is displayed as 0, First Access Timestamp(ns) and Last Access Timestamp(ns) are displayed as -1, and Max Access Interval(ns) is displayed as 0.
Currently, the msMemScope tool can collect memory access events only in the ATB and Ascend Extension for PyTorch operator scenarios. Therefore, First Access Timestamp(ns), Last Access Timestamp(ns), and Max Access Interval(ns) are available only in these scenarios. In other scenarios, First Access Timestamp(ns) and Last Access Timestamp(ns) are displayed as -1, and Max Access Interval(ns) is displayed as 0.

Event View: displays details about memory events, as shown in Figure 4 Event View. For details about the fields, see Table 2 Event View fields.

Figure 4 Event View Figure 4 Event View

Table 2 Event View fields

Field	Required/Optional	Description	Type	Example Value	Remarks
ID	Required	Unique ID of a memory event.	Integer	0	None
Action	Required	Memory event operation type.	One of the enumerated values. For details, see the description.	alloc	Memory segment operations: `segment_alloc`: memory segment allocation event, which triggers the PTA memory pool to allocate physical memory from the driver and triggers capacity expansion. `segment_free`: memory segment release event, which triggers the PTA memory pool to release physical memory and triggers capacity reduction. `segment_map`: memory segment mapping event. In the virtual memory scenario, this event triggers the PTA memory pool to map the physical memory to the virtual address space. `segment_unmap`: memory segment unmapping event. In the virtual memory scenario, this event triggers the PTA memory pool to cancel the mapping of the physical memory to the virtual address space. Memory block operations: `alloc`: memory block allocation event. The PTA searches the memory pool for available inactive memory blocks for secondary allocation. `free_requested`: memory block release request event. The PTA memory pool sets the memory block status to `active_pending_free` and waits for subsequent release. `free_completed`: memory block release completion event. The PTA memory pool sets the memory block status to `inactive` and releases the memory block to the memory pool. Operator workspace snapshot: `workspace_snapshot`.
Address	Required	Memory event address.	0xhexadecimal address	0x7f9f00000000	Memory address operated by the memory event.
Size(bytes)	Required	Size of the memory operated by the memory event, in MB.	Floating point number	12.5	None
Stream	Required	ID of the stream to which the memory event belongs.	Integer	0	None
Allocated(bytes)	Required	Total size of secondary allocations from the PTA memory pool after the event, in bytes.	Floating point number	12.5	Total size of all `active_allocated` blocks in all segments, indicating the size of memory that has been secondarily allocated from the PTA memory pool to tensors when the event occurs.
Active(bytes)	Required	Total size of active memory in the PTA memory pool after the event occurs, in bytes.	Floating point number	12.5	Total size of all `active_allocated` blocks in all segments, indicating the size of memory that has been secondarily allocated from the PTA memory pool to tensors when the event occurs.
Reserved(bytes)	Required	Total size of reserved memory in the PTA memory pool after the event occurs, in bytes.	Floating point number	12.5	The value is the total size of all memory segments, indicating the size of memory that is actually allocated from the driver and reserved in the PTA memory pool when the event occurs.
Call Stack	Optional	Call stack of a memory event.	String	`/home/xxx/test/demo.py: 60 main`	Call stack of a memory event, which displays the call stack triggered when the memory event occurs. If the value is empty, the possible causes are as follows: Stacks are not enabled when `_record_memory_history` is called. The event occurs in autograd during backward propagation, and there may be no call stack information.

NOTE

Slice Detail: displays details about the memory block, as shown in Figure 5 Slice Detail. For details about the fields, see Table 3 Slice Detail fields.

When you click the view of any time point in the memory block lifetime graph, the displayed information in the Slice Detail area is updated accordingly.

Figure 5 Slice Detail Figure 5 Slice Detail

Table 3 Slice Detail fields

Field	Required/Optional	Description	Type	Example Value	Remarks
ID	Required	Unique ID of a memory event.	Integer	0	None
Action	Required	Memory event operation type.	One of the enumerated values. For details, see the description.	alloc	Memory segment operations: <idp:inline displayname="code" id="code31262515467">segment_alloc</idp:inline>: memory segment allocation event, which triggers the PTA memory pool to allocate physical memory from the driver and triggers capacity expansion. <idp:inline displayname="code" id="code4698165713466">segment_free</idp:inline>: memory segment release event, which triggers the PTA memory pool to release physical memory and triggers capacity reduction. <idp:inline displayname="code" id="code52211310174716">segment_map</idp:inline>: memory segment mapping event. In the virtual memory scenario, this event triggers the PTA memory pool to map the physical memory to the virtual address space. <idp:inline displayname="code" id="code26361219154714">segment_unmap</idp:inline>: memory segment unmapping event. In the virtual memory scenario, this event triggers the PTA memory pool to cancel the mapping of the physical memory to the virtual address space. Memory block operations: <idp:inline displayname="code" id="code1864793164714">alloc</idp:inline>: memory block allocation event. The PTA searches the memory pool for available inactive memory blocks for secondary allocation. <idp:inline displayname="code" id="code12643164524710">free_requested</idp:inline>: memory block release request event. The PTA memory pool sets the memory block status to `active_pending_free` and waits for subsequent release. <idp:inline displayname="code" id="code1395455394717">free_completed</idp:inline>: memory block release completion event. The PTA memory pool sets the memory block status to `inactive` and releases the memory block to the memory pool. Operator workspace snapshot: <idp:inline displayname="code" id="code281085764715">workspace_snapshot</idp:inline>.
Address	Required	Memory event address.	0xhexadecimal address	0x7f9f00000000	Memory address operated by the memory event.
Size(MBytes)	Required	Size of the memory operated by the memory event, in MB.	Floating point number	12.5	None
Stream	Required	ID of the stream to which the memory event belongs.	Integer	0	None
Caching Allocated(MBytes)	Required	Total size of secondary allocations from the PTA memory pool after the event, in MB.	Floating point number	12.5	Total size of all `active_allocated` blocks in all segments, indicating the size of memory that has been secondarily allocated from the PTA memory pool to tensors when the event occurs.
Caching Active(MBytes)	Required	Total size of active memory in the PTA memory pool after the event occurs, in MB.	Floating point number	12.5	The value should be the sum of the total size of all `active_allocated` blocks and the total size of all `active_pending_free` blocks in all segments. Size of memory that is actually occupied but cannot be reused in the PTA memory pool when the event occurs.
Caching Reserved(MBytes)	Required	Total size of reserved memory in the PTA memory pool after the event occurs, in MB.	Floating point number	12.5	The value is the total size of all memory segments. indicating the size of memory that is actually allocated from the driver and reserved in the PTA memory pool when the event occurs.
Call Stack	Optional	Call stack of a memory event.	String	`/home/xxx/test/demo.py: 60 main`	Call stack of a memory event, which displays the call stack triggered when the memory event occurs. If the value is empty, the possible causes are as follows: Stacks are not enabled when `_record_memory_history` is called. The event occurs in autograd during backward propagation, and there may be no call stack information.

Comparison Between MemScope Data Collection and PyTorch Snapshot Data Collection

Compared with the memory tuning data collected by MemScope, memory snapshot data has the following advantages:
1. Collection performance overhead: During the collection of memory snapshot data, memory events are recorded at the host level only after record_memory_history is enabled, and the PTA allocator segments are saved at the dump time. This has little impact on the model running performance.
2. Tuning data size: The core data consists of PTA memory events and segments at the collection end time. The data density is high, and the data size is smaller than that collected by MemScope.
3. Special memory pool status data: Compared with the memory tuning data collected by MemScope, the memory snapshot data also includes the status data of the PTA memory pool, such as the memory pool size and memory pool usage. This data helps users comprehensively analyze the memory usage. Especially in the case of memory fragmentation, the memory pool status data helps users intuitively analyze the memory fragmentation.
Compared with the memory tuning data collected by MemScope, memory snapshot data has the following disadvantages:
1. Tuning data analysis: Memory snapshot data is stored in Python pickle data files, which are difficult to use. The analysis relies heavily on visualization capabilities. To parse and analyze raw data, you need to have a basic understanding of memory events and data collection principles.
2. Limited capabilities/performance of community visualization tools: Although community-based online visualization enables analysis of memory snapshot data, notable deficiencies remain in stability, performance, and interactivity.
  - Stability: Online web pages may be unavailable due to client environment issues (such as restricted to an intranet) or network problems (such as CDN resource exceptions).
  - Performance: When processing large-scale memory snapshots (with more than 15,000 events and a snapshot size exceeding 10 MB), the online visualization web page may experience performance issues, such as long loading times and delayed interaction responses.
  - Interactivity: During the analysis process on the online visualization web page, the only way to associate the memory block lifetime graph with the memory pool status graph is to manually copy the memory address and search for it on the web page. Additionally, it is not possible to reverse the association from the memory pool status graph to the memory block lifetime graph.

Memory Issue Analysis Cases

Overview

Memory issues are common in Ascend full-stack development activities. However, due to the complex software stack layers involved in memory issues (including the OS driver and runtime library, CANN, MindSpore/PyTorch_NPU, model training, and model inference), it is often challenging to locate and resolve these issues. For details about typical memory issue categories, see Table 1 Memory issue categories.

This document describes how to use MindStudio Insight to locate memory issues.

Table 1 Memory issue categories

Issue Category	Symptom	Scenario
Memory corruption	The precision is abnormal or NaN occurs, which usually occurs on the device.	Training, inference, and operator development
Excessive memory usage	Excessive memory usage is usually related to the following two situations: - Memory leak or out of memory (OOM): The memory usage on the host or device keeps increasing or even OOM occurs. - Great difference from the expected or baseline value: Actual measured memory usage far exceeds the expected or baseline data, with differences potentially reaching gigabytes. This behavior is typically observed on the device.	Training and inference

Analysis Process

For excessive memory usage or OOM on the device, the issue analysis process is as follows:

Use the profiling tools to collect profile data and import the data to MindStudio Insight.
On the Memory page, view the memory curve and memory allocation/release details of operators or components in the memory analysis area to perform basic analysis and determine the exception scope, step, or operator.
Use the memory tool (msMemScope) to collect memory details and memory disassembly data within the exception scope, and import the data to MindStudio Insight.
On the memory details (msMemScope) page, analyze the memory usage based on the call stack flame graph, memory block lifetime graph, and memory details table.

Preparation

Preparing Software

Download and install MindStudio Insight. For details, see MindStudio Insight Installation Guide.
Install the msMemScope tool. For details, see msMemScope Installation Guide.

Preparing Data

The following collects memory leak data.

Use the msMemScope tool to run the following command to allocate a 4 x 10 MB tensor in each step and add it to the global variable leak\_mem\_list (which will not be released with train\_one\_step). Collect Python trace data of three steps.

msmemscope --level=0,1 --events=alloc,free,access,launch --analysis=decompose --data-format=db python test.py

The sample code of test.py is as follows:

import torch
import torch_npu
from torchvision.models import resnet50
import msmemscope
import msmemscope.describe as describe
leak_mem_list = []
def train_one_step(model, optimizer, loss_fn, device):
    # Mark the code block. The owner attribute of all memory allocation events in the code block will be labeled as leaks_mem.
    describe.describer(owner="leaks_mem").__enter__()
    # Memory leak code segment
    leak_mem_list.append(torch.randn(1024 * 1024 * 10, dtype=torch.float32).to(device))
    # End marker
    describe.describer(owner="leaks_mem").__exit__(None, None, None)
    # Single training code segment
    inputs = torch.randn(1, 3, 224, 224).to(device)
    labels = torch.rand(1, 10).to(device)
    pred = model(inputs)
    loss_fn(pred, labels).backward()
    optimizer.step()
    optimizer.zero_grad()
def train(model, optimizer, loss_fn, device, steps=1):
    for i in range(steps):
        train_one_step(model, optimizer, loss_fn, device)
device = torch.device("npu:0")
torch.npu.set_device(device)  # Set the device.
model = resnet50(pretrained=False, num_classes=10).to(device)  # Load the model.
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)  # Define the optimizer.
loss_fn = torch.nn.CrossEntropyLoss()  # Define the loss function.

# Enable the collection of Python function call data.
msmemscope.tracer.start()
train(model, optimizer, loss_fn, device, steps=3)  # Start training.

# Disable the collection of Python function call data.
msmemscope.tracer.stop()

After the collection is complete, a file in .db format is generated.
Download the file to the local host.

Memory Analysis

Importing Data

Open MindStudio Insight and click Import Data in the navigation tree on the left.
In the displayed File Explorer dialog box, select the .db file to be imported.
After the import is successful, the Leaks page is displayed.

Memory Analysis

Open the Leaks page and view the call stack flame graph and memory block lifetime graph.
Click and drag the mouse to box-select the step 2 area in the memory block lifetime graph, and release the mouse button to zoom in on the area.

As shown in Figure 1 Unreleased memory blocks, there is still an unreleased memory block when step 2 ends.

Figure 1 Unreleased memory blocks
The call stack flame graph shows that the memory block comes from a tensor object and is allocated before the forward propagation starts, as shown in [Figure 2 Tensor object](#Tensor object)

Figure 2 Tensor object
When cross-referencing the leaks\_mem segment against the memory details disassembly diagram, a clear increasing trend is detected within the segment. From step 1, the memory usage of the leaks\_mem segment is 40 MB for the first time, as shown in [Figure 3 Checking the memory usage in step 1] (#checking-the-memory-usage-in-step-1).

Figure 3 Checking the memory usage in step 1

As shown in Figure 4 Checking the memory usage in step 2, the leaks_mem memory usage in step 2 increases from 40 MB to 80 MB.

Figure 4 Checking the memory usage in step 2

As shown in Figure 5 Checking the memory usage in step 3, the leaks_mem memory usage in step 3 increases from 80 MB to 120 MB.

Figure 5 Checking the memory usage in step 3