NPU High-Performance Programming Guide
Combining Grid Cores
I. Principles for Automatically Combining Grid Cores
Some scenarios requiring migration of Triton operators from GPUs to NPUs. Due to architectural differences, the Triton operators developed on GPUs often utilize large grid core counts. When executed on NPUs, these operators cannot be scheduled all at once. Delivering them in batches introduces significant latency and degrades performance. To optimize NPU-based Triton operators, you need to check the grid core counts first. In cases with large grid core counts, set the environment variable TRITON_ALL_BLOCKS_PARALLEL to improve operator execution performance.
Optimizing Instruction Parallelism
I. Core Principles of Instruction Parallelism Optimization
When executing Triton operators, NPUs leverage parallel mechanisms such as multi-buffer and instruction parallelism to parallelize data-in, computation, and data-out, thereby enhancing performance. However, in certain scenarios, the multi-buffer mechanism cannot be enabled, which reduces the degree of parallelism (DOP) and degrades operator execution performance. If this issue occurs during performance optimization, consider the following aspects and implement optimizations based on the provided code examples:\
- Data transfer and computation involve dependencies, which introduce synchronization. The memory transfer engine (MTE) can only be triggered after vector computation completes, resulting in low DOP.\
- In cases where the operator lacks multiple data loads or a single execution completes without tiling, the multi-buffer mechanism cannot be enabled.\
- The multi-buffer mechanism requires additional UB space. If the UB space is insufficient during computation, the multi-buffer mechanism cannot be enabled.
II. Code Examples
-
Example 1: Reducing synchronization for higher DOP
In operator optimization, increasing instruction-level parallelism (DOP) is a critical strategy. In the
tl.loadstatement below, whenN>M, the loaded data fills only a portion of the tensor memory space pointed to bydata. For the remaining unfilled portion, if users do not specify theothervalue, GPUs default to zero-padding. To reduce the adaptation workload of migration, NPUs maintain the same behavior as GPUs. NPUs first use the vector core to set all the memory space pointed to bydatato a specified value (defaulting to0if noothervalue is provided). Subsequently, the MTE2 instruction transfers data to part of the memory space pointed to bydata. This implementation results in a dependency between MTE2 and vector operations, which limits parallelism and degrades overall performance.@triton.jit def npu_vector_add_kernel( input, # [Tensor] input tensor (1 x col) output, # [Tensor] output tensor (1 x col) M: tl.constexpr, # len of the vector BLOCK_SIZE: tl.constexpr ): N :tl.constexpr = BLOCK_SIZE idx = tl.arange(0, N) mask = idx < M data = tl.load(input + idx, mask = mask) # Alternatively, specify a value such as other=-1.To increase DOP and enhance performance, when the loaded data fills only a portion of the memory space pointed to by
data, addcare_padding=Falseto the load statement to remove default-value padding, provided that the unfilled portion does not affect subsequent computation results. That is, the preceding operator can be optimized follows:@triton.jit def npu_vector_add_kernel( input, # [Tensor] input tensor (1 x col) output, # [Tensor] output tensor (1 x col) M: tl.constexpr, # len of the vector BLOCK_SIZE: tl.constexpr ): idx = tl.arange(0, N) mask = idx < M - data = tl.load(input + idx, mask = mask) # Alternatively, specify a value such as other=-1. + data = tl.load(input + idx, mask = mask, care_padding=False) # Alternatively, specify a value such as other=-1. -
Example 2: Using
forloops in Triton operators to increase tiling and enhance DOPIn Triton operator programming,
maskoperations are frequently employed in syntax such asload,store, andwhere. During performance optimization, you should prioritize identifying performance degradation caused by these operations. When the logic within Triton operators executes sequentially in a single pass (Start -> Data-in -> Computation -> Data-out -> End), instructions cannot be parallelized, resulting in low execution efficiency. By introducingforloops to increase tiling, you can process data in multiple passes (with each pass handling a reduced data volume), enabling parallel execution of data-in, computation, and data-out. This approach reduces serial waiting time and improves overall performance. Additionally, compared to monolithic (non-tiled) data processing, the use offorloops for tiling reduces UB space consumption. Note: Mathematical equivalence is an important aspect to consider when you increase data tiling.@triton.jit def alloc_extend_kernel( pre_lens_ptr, seq_lens_ptr, free_page_ptr, out_indices, bs_upper: tl.constexpr, page_size: tl.constexpr, max_num_extend_tokens: tl.constexpr, + BLOCK_SIZE: tl.constexpr = 1024, ): pid = tl.program_id(0) load_offset = tl.arange(0, bs_upper) seq_lens = tl.load(seq_lens_ptr + load_offset, mask=load_offset <= pid) pre_lens = tl.load(pre_lens_ptr + load_offset, mask=load_offset <= pid) extend_lens = seq_lens - pre_lens seq_len = tl.load(seq_lens_ptr + pid) pre_len = tl.load(pre_lens_ptr + pid) extend_len = seq_len - pre_len sum_extend_lens = tl.sum(extend_lens) output_start_loc = sum_extend_lens - extend_len num_pages_after = (seq_lens + page_size - 1) // page_size num_pages_before = (pre_lens + page_size - 1) // page_size num_new_pages = num_pages_after - num_pages_before num_page_start_loc_self = (seq_len + page_size - 1) // page_size - ( pre_len + page_size - 1 ) // page_size sum_num_new_pages = tl.sum(num_new_pages) new_page_start_loc = sum_num_new_pages - num_page_start_loc_self # Part 2: fill the new full pages num_part2 = ( seq_len // page_size * page_size - (pre_len + page_size - 1) // page_size * page_size ) - # load data at once - offset_many_page = tl.arange(0, max_num_extend_tokens) - page_start = tl.load( - free_page_ptr + new_page_start_loc + offset_many_page // page_size, - mask=offset_many_page < num_part2, - ) - tl.store( - out_indices + output_start_loc + offset_many_page, - page_start * page_size + offset_many_page % page_size, - mask=offset_many_page < num_part2, - ) + # load data using loop + num_loop = tl.cdiv(max_num_extend_tokens, BLOCK_SIZE) + blk_offset = tl.arange(0, BLOCK_SIZE) + for i in range(num_loop): + offset_many_page = blk_offset + i * BLOCK_SIZE + page_start = tl.load( + free_page_ptr + new_page_start_loc + offset_many_page // page_size, + mask=offset_many_page < num_part2, + ) + tl.store( + out_indices + output_start_loc + offset_many_page, + page_start * page_size + offset_many_page % page_size, + mask=offset_many_page < num_part2, + )
Optimizing Data Types
I. Core Principles of Data Type Optimization
Some operations of the A2/A3 vector units do not support certain data types. In this case, the corresponding vector operations will degrade to scalar operations, affecting performance. If the overall operator accuracy is not affected, it is advisable to use supported data types to improve performance. The following operations are involved.
| Operator Name | Unsupported Data Type |
|---|---|
| Vector Add | int64 |
| Vector Cmp | int64/int32 |
II. Code Examples
-
Example code of the Triton operator Vector Add
For the following Triton operator, when the input tensors
xandyutilize the int64 data type,x1 + y1is expanded into a scalar operation, which degrades performance. Provided that computational accuracy remains unaffected, it is advisable to use the int32 data type.@triton.jit def npu_vector_add_kernel( x, # [Tensor] input tensor (1 x col) y, # [Tensor] input tensor (1 x col) z, # [Tensor] output tensor (1 x col) vector_len: tl.constexpr, # len of the vector BLOCK_SIZE: tl.constexpr ): pid = tl.program_id(axis=0) offset = pid * BLOCK_SIZE + tl.arange(BLOCK_SIZE) len_mask = offset < vector_len x1 = tl.load(x + offset, mask=len_mask) y1 = tl.load(y + offset, mask=len_mask) z1 = x1 + y1 tl.store(z + offset, z1, mask=len_mask) -
Example code of the Triton operator Vector Cmp
In the following Triton operator, the
maskoperation utilizes Cmp. However, Cmp does not support the int64 or int32 data type, causing the conditioncols < Nto be expanded into a scalar operation, which reduces performance. Provided that computational accuracy remains unaffected, it is advisable to use the FP32 data type. In Triton operator programming,maskoperations are frequently employed in syntax such asload,store, andwhere. During performance optimization, you should prioritize identifying performance degradation caused by these operations.@triton.jit def npu_vector_cmp_kernel( X, # [Tensor] input tensor (row x col) Out, # [Tensor] output tensor (row x col) Mean, # [Vector] mean tensor (row, ) of X Rstd, # [Vector] std tensor (row, ) of X stride_x_row, # [Scalar] stride of row of x stride_out_row, # [Scalar] stride of row of out, normally equals to stride_x_row M, # [Scalar] row number N, # [Scalar] col number eps, # [Scalar] epsilon to avoid division by zeros BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr ): group_m = tl.program_id(0) group_n = tl.program_id(1) row = group_m # calculate index & offset Mean = Mean + group_n * M Rstd = Rstd + group_n * M X = X + row * stride_x_row + group_n * N Out = Out + row * stride_out_row + group_n * N cols = tl.arange(0, BLOCK_N) # cols is int64 x = tl.load(X + cols, mask=cols < N, other=0.0).to(tl.float32) # calculate mean & rstd mean = tl.sum(x, axis=0) / N tl.store(Mean + row, mean) # [Changed begin] - xbar = tl.where(cols < N, X - mean, 0.0) + cols_cmp = cols.to(tl.float32) + xbar = tl.where(cols_cmp < N, x - mean, 0.0) # [Changed end] var = tl.sum(xbar * xbar, axis=0) / N rstd = 1 / tl.sqrt(var + eps) tl.store(Rstd + row, rstd) # calculate Out mask = cols < N out = (x - mean) * rstd tl.store(Out + cols, out, mask=mask)