Vector Operator Development
Vector operators are mainly executed by Vector Cores. Typical examples include element-wise computation, row-wise reduction, type conversion, gather/scatter, masked update, and small fused operators without tl.dot. The key is not to create as many grid programs as possible, but to keep the launch close to the number of physical Vector Cores and let each program process multiple tiles in an inner loop.
Simple Vector Operator Development
For a simple Vector operator, start with the Vector Addition example or third_party/ascend/tutorials/01-vector-add.py. The basic pattern is:
- Build contiguous offsets for the current tile with
tl.arange. - Use a tail mask to guard load/store.
- Compute the element-wise expression and store the result.
- If the grid is much larger than the physical core count, set the grid to
num_vectorcoreand process tiles withrange(pid, num_blocks, num_core)inside the kernel.
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
num_core = tl.num_programs(0)
num_blocks = tl.cdiv(n_elements, BLOCK_SIZE)
for block_idx in range(pid, num_blocks, num_core):
offsets = block_idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
tl.store(out_ptr + offsets, x + y, mask=mask)
Check these items first:
- Data type: Ascend Vector units have different performance for integer types. Prefer
int32for indices, lengths, and offsets when precision allows. Seetriton-ascend-ops/tutorial/basic/001-vector_add.zh.mdand002-vector_cmp.zh.md. - BLOCK_SIZE: Keep it as large as possible without exceeding UB capacity. If UB overflow occurs, reduce the tile size or split it into sub-blocks.
- Core count: GPU-style small tiles with very large grids often cause repeated dispatch overhead on NPUs.
Complex Vector Operator Development
Complex Vector operators usually combine irregular memory access, token reordering, multiple outputs, or long hidden dimensions. Useful references in Ascend/triton-ascend-ops include:
tutorial/best_practice/004-gather_scatter.pytutorial/best_practice/005-binned_gather_scatter.pytutorial/best_practice/006-padded_gather_scatter.py
Use this structure:
- Split outer tasks by physical Vector Core count.
- Split the hidden dimension by UB capacity with
BLOCK_X. - Use
SUB_BLOCK_SIZEto batch small irregular tasks. - Use
tl.insert_sliceto assemble UB-local blocks andtl.extract_sliceto scatter sub-blocks. - Keep index masks, column masks, and expert/bin boundary masks separate, and combine them only at load/store sites.
Typical UB budgeting:
num_core = get_npu_properties()["num_vectorcore"]
block_size = triton.cdiv(indices_length, num_core)
block_x = round_up(min(num_columns, max_block_x), 16)
sub_block_size = max((ub_budget - block_x * element_bytes) //
(block_x * element_bytes + index_bytes), 1)
When performance is poor, first check whether the grid is much larger than the physical Vector Core count, whether irregular GM access can be converted into "bulk load to UB and select in UB", whether the tail axis is 32B aligned, and whether BLOCK_X or SUB_BLOCK_SIZE causes UB overflow or too-small transfer granularity.