MindStudio Insight Serving Tuning

Overview

MindStudio Insight displays the end-to-end request execution in the timeline view, showing the duration of the request in each key phase and the status of the request. This helps users quickly identify service performance bottlenecks and adjust the profiling policy accordingly.

Preparations

Environment Preparation

Install MindStudio Insight first. For details, see MindStudio Insight Installation Guide.

Data preparation

Import profile data in the correct format. For details about the data, see Data Description. For details about how to import data, see Importing Data.

Data Description

MindStudio Insight allows users to import profile data files and displays related content in graphics. In the serving tuning scenario, two types of data can be imported: SQLite database file (profiler.db) of the visualization curve and JSON file (chrome_tracing.json) of the inference serving request trace data.

MindStudio Insight provides various import modes based on the file type. For details, see Table 1 Data import modes.

Table 1 Data import modes

File Name	Import Mode
chrome_tracing.json	Single file.
profiler.db	- Single file. - Batch import: To import the profiler.db files in multiple folders, you only need to select the parent directory.
DB files starting with ms_service_	Multiple DB files starting with ms_service_ in the same folder can be imported. These files represent multiple process files and a general DB file. You only need to select the folder.

Precautions

The profile data of system tuning and serving tuning can be imported at the same time. You need to place the data of the two scenarios in the same folder and select the folder when importing the data.
For details about how to obtain the data, see section "msServiceProfiler" > "Serving Tuning" > "Data Parsing" in the Profiling Tool Guide.

Timeline

Function Description

During serving tuning, MindStudio Insight displays the end-to-end request execution in the timeline view, showing the duration of the request in each key phase and the status of the request. By analyzing the timeline, you can quickly identify service performance bottlenecks and adjust the tuning policy based on the symptom.

You can check the duration and interval at each level in the timeline view to determine whether performance problems exist in the corresponding key phase.

GUI Description

GUI Display

The Timeline tab page consists of the toolbar (area 1), graphical display (area 2), and data pane (area 3), as shown in Figure 1 Timeline page.

Figure 1 Timeline page

Area 1: toolbar, which contains common shortcut keys. From left to right, the shortcut keys are Marker List, Filter (card or unit), Search, Flow Events, Reset (page restoration), Timeline Zoom Out, and Timeline Zoom In.
Area 2: graphical display. The profile data collected by service is displayed on the left. The first level is the process, and the second level is the key phase information of the request. Unit Information describes the unit information. The timeline view is displayed on the right line by line, including the execution sequence and duration of each key phase.
Area 3: data pane, which displays statistics or instruction details. If you select Slice Detail, the details of a single key phase are displayed. If you select Slice List, the key phase list information of the selected area in the unit is displayed.

Unit Information

Table 1 Unit information

Unit	Description
CPU Usage	Average CPU usage. This unit is displayed only when the host_system_usage_freq data collection function is enabled.
Memory Usage	System memory usage on the host. This unit is displayed only when the host_system_usage_freq data collection function is enabled.
NPU Usage	NPU memory usage. This unit is displayed only when the npu_memory_usage_freq data collection function is enabled.
KVCache	Usage of remaining KV cache over time.
BatchSchedule	Group batch time, in nanoseconds.
WAITING	Time when a request is in the WAITING state.
PENDING	Time when a request is in the PENDING state.
RUNNING	Time when a request is in the RUNNING state.
RUNNING2	Time when a request is in the RUNNING2 state.
SWAPPED	Time when a batch enters the SWAPPED state.
RECOMPUTE	Time when a request is in the RECOMPUTE state.
SUSPENDED	Time when a batch enters the SUSPENDED state.
END	Time when a request is in the END state.
END_PRE	Time when a request is in the END_PRE state.
STOP	Time when a batch enters the STOP state.
PREFILL_HOLD	Time when a batch is in the PREFILL_HOLD state.
http	HTTP request lifetime data, covering the receipt, encoding, and decoding of the request.
batchFrameworkProcessing	Batch data, including the batch creation time, current batch type (prefill or decode), request RID, and steps.
preprocessBatch	Time consumed for parameter injection to batches during IBIS data distribution, in nanoseconds.
SerializeExecuteMessage	Time consumed for serialization during IBIS data distribution, in nanoseconds.
setInferBuffer	Time consumed for buffer setting during IBIS data distribution, in nanoseconds.
grpcWriteToSlave	Time consumed for gRPC read during IBIS data distribution, in nanoseconds.
deserializeExecuteRequestsForInfer	Time consumed for deserialization during IBIS data distribution, in nanoseconds.
convertTensorBatchToBackend	Time consumed for request conversion during IBIS data distribution, in nanoseconds.
getInputMetadata	Time consumed for metadata obtaining during IBIS data distribution, in nanoseconds.
beforemodelExec	Processing time before model execution, in nanoseconds.
modelExec	Model execution data, in nanoseconds, including the execution time, current batch type (prefill or decode), request RID, and steps.
instanceExecute	Model instance execution time, in nanoseconds.
Queue	Time when the request is enqueued.
PDcommunication	PD disaggregation communication time, in nanoseconds. This unit exists only in the PD disaggregation scenario.
forward	Forward propagation time of model inference, in nanoseconds.
operatorExecute	Python-side model API execution time, in nanoseconds.
processPythonExecResult	Time consumed for response conversion, serialization, and writing to the shared memory during data receiving, in nanoseconds.
deserializeExecuteResponse	Time consumed for deserialization during data receiving, in nanoseconds.
saveoutAndContinueBatching	Time consumed for parsing responses as outputs during data receiving, in nanoseconds.
continueBatching	Time consumed for enqueuing requests during data receiving, in nanoseconds.
sendExecuteMessage	Time consumed for sending execution information, in nanoseconds.
postprocess	Postprocessing time of model inference, in nanoseconds.
preprocess	Preprocessing time of model inference, in nanoseconds.
processBroadcastMessage	Time consumed for broadcasting communication information, in nanoseconds.
sample	Sampling time, in nanoseconds.
PullKVCache	KV cache transfer time between PD nodes, in nanoseconds. This unit exists only in the PD disaggregation scenario.
CANN	Operator execution time, in nanoseconds. This unit is displayed only when the acl_task_time data collection function is enabled.
dpBatch	DP domain information corresponding to each request during model inference.
RequestState	Request status changes during model inference.

Usage Description

For details about how to use the Timeline tab page in the serving tuning scenario, see "Usage Description" in the MindStudio Insight System Tuning.

Slice Detail

When you select a key phase block, the details about the key phase are displayed on the Slice Detail tab page in the lower part. If res_list exists on the Slice Detail tab page, click any row in the rid list. The request details of the corresponding RID are displayed in the right pane of the Slice Detail tab page, as shown in Figure 1 Slice Detail. For details about the fields, see Table 1 Slice Detail fields.

Figure 1 Slice Detail

Table 1 Slice Detail fields

Chinese	Field	Description
Title	Title	Name.
Start	Start	Start time.
	Start(Raw Timestamp)	Original start time of data collection.
Duration	Wall Duration	Total duration.
	Args	Key phase parameters.

System View

On the System View tab page, when you select Stats System View, the Rank ID selection box and serving data are displayed. You can select the rank to be viewed from the Rank ID selection box.

The serving data includes the kvcache_usage, batch_info, request_data, and forward_info tab pages, as shown in Figure 2 System View.

When you select a serving data type, the corresponding details are displayed in the right area. For details about the fields, see Table 2 Servitization View fields. You can search for information by clicking next to the field name.

Figure 2 System View

Table 2 Servitization View fields

Chinese	Field	Description
kvcache_usage
rid	rid	Request ID.
name	name	Method that changes the graphics memory usage.
real_start_time_ms	real_start_time_ms	Time when the device memory usage changes, in milliseconds.
device_kvcache_left	device_kvcache_left	Number of left blocks in the graphics memory.
kvcache_usage_rate	kvcache_usage_rate	KV cache usage.
batch_info
name	name	Batch grouping or execution. batchFrameworkProcessing refers to batch grouping, while modelExec refers to batch execution.
res_list	res_list	Batch composition information.
start_time_ms	start_time_ms	Start time of batch grouping or batch execution, in milliseconds.
end_time_ms	end_time_ms	End time of batch grouping or batch execution, in milliseconds.
batch_size	batch_size	Number of requests in a batch.
batch_type	batch_type	Request status (`prefill` or `decode`) in a batch.
during_time_ms	during_time_ms	Execution time, in ms.
dp*_rid	dp*_rid	ID of the request contained in the DP domain. The asterisk (*) indicates the DP domain ID, and the value range is [0, n-1].
dp*_size	dp*_size	Batch size of the DP domain. The asterisk (*) indicates the DP domain ID, and the value range is [0, n-1].
dp*_forward_ms	dp*_forward_ms	The longest forward execution time in the DP domain, in milliseconds. The asterisk (*) indicates the DP domain ID, and the value range is [0, n-1].
request_data
http_rid	http_rid	HTTP request ID.
start_time_ms	start_time_ms	Request arrival time, in milliseconds.
recv_token_size	recv_token_size	Input token length of a request.
reply_token_size	reply_token_size	Output token length of a request.
execution_time_ms	execution_time_ms	End-to-end request duration, in ms.
queue_wait_time_ms	queue_wait_time_ms	The total waiting time of a request in the queue throughout the inference process includes both waiting and pending periods, measured in milliseconds.
first_token_latency	first_token_latency	Time to first token (TTFT), in milliseconds.
forward_info
name	name	Labels a forward event, indicating the model forward execution process.
relative_start_time(ms)	relative_start_time(ms)	Time elapsed since the initial forward on each device.
start_time(ms)	start_time(ms)	Forward start time.
end_time(ms)	end_time(ms)	Forward end time.
during_time(ms)	during_time(ms)	Execution duration of a forward event, in ms.
bubble_time(ms)	bubble_time(ms)	Bubble time between forward events, in ms.
batch_size	batch_size	Number of requests processed in a forward event.
batch_type	batch_type	Request status in a forward event.
forward_iter	forward_iter	Forward iteration number on each device.
dp_rank	dp_rank	DP information of the forward. The values for the same DP domain are the same.
prof_id	prof_id	Identifies different devices. This value is the same for the same device.
hostname	hostname	Identifies different hosts. This value is the same for the same host.

Generating Line Charts by Blocks

The duration and bubble line charts of blocks are available in the serving tuning scenario, facilitating fault analysis.

On the Timeline tab page, right-click a block in any unit and choose Generate Duration Line Chart By Block or Generate Bubble Line Chart By Block from the shortcut menu. The Curve tab page is displayed, showing the curve (duration and average duration) and data details of the unit where the block is located, as shown in Figure 3 Generating a curve by block.

Figure 3 Generating a curve by block

If you spot an anomaly in the curve, zoom into that area and click on the anomaly. Check the related information in the data details table below the curve. Right-click the data row and choose Find in Timeline from the shortcut menu. The Timeline page is displayed, as shown in Figure 4 Find in Timeline.

Figure 4 Find in Timeline

Curve

Function Description

Data changes can be displayed in curves and data details tables, facilitating analysis. The Curve tab page is displayed only when the profiler.db file is imported.

GUI Description

The Curve tab page consists of the parameter configuration area (area 1), curve data (area 2), and table data details (area 3), as shown in Figure 1 Curve page.

Figure 1 Curve page Figure 1 Curve page

Area 1: parameter configuration area, including the card ID and grouping mode.
Area 2: curve chart, showing data changes.
Area 3: table data details, showing the detailed data of the SQLite database. The table supports sorting and pagination. You can click the table header of each column to display data in ascending, descending, or default order.

Usage Description

Zooming In and Out on a Curve

MindStudio Insight allows you to left-click to drag select and zoom in on the selected part of the curve and right-click to zoom out on the curve. To improve the display performance, most points are hidden in the curve when the data volume is large. You can select a fine area to display all points or right-click the selected part to restore the original display effect.

In the curve, click and drag the mouse to the end point to be zoomed in and release the mouse. The selected region is zoomed in. If some points are still hidden, repeat the zoom-in operation to display the hidden points. Figure 1 Selected zoom-in region shows the selected zoom-in region.

You can click to dim a legend on the top to hide the curve. You can also click the dimmed legend to show the curve.

Figure 1 Selected zoom-in region

NOTE

Click in the upper right corner of the curve. If the button is dimmed, the curve is locked and cannot be zoomed in by clicking and dragging the mouse. You can click the button again or right-click the curve to restore the chart. The zoom-in function is enabled by default.
You can click in the upper right corner of the curve to cancel the last zoom-in operation.
Click in the upper right corner of the curve. The curve is restored to the initial state.