super_kernel Use Case Demonstration

Use Case Function

Model contains 6 sk fragments. Some sk are identical and reuse cache, some add optional input bias or different option configurations requiring online compilation.

Compile models using super_kernel and not using super_kernel, output performance data.

Operator Fusion

Use the following with statement block (super_kernel). Operators within the statement block are all fused into one super kernel for computation:

with torchair.scope.super_kernel("sk1"): 

For detailed function introduction, see Mark SuperKernel Scope in Graph.

Execution Command

python3 superkernel_compare.py

Expected Execution Result

After execution, print shows success:

execute sample success

A prof_result folder is generated in execution directory with the following structure. After obtaining data, compare time consumption:

prof_result
├── sk_model                             # with superkernel result
│  ├── localhost.localdomain_ascend_pt   
│     ├── PROF_*                         
│        ├── mindstudio_profiler_output   
│           ├── op_statistic.csv         # profiling data
├── no_sk_model                          # without superkernel result
│  ├── localhost.localdomain_ascend_pt   
│     ├── PROF_*                         
│        ├── mindstudio_profiler_output   
│           ├── op_statistic.csv         # profiling data

Extract the following data from both op_statistic.csv files respectively:

OP_Type Core Type Total Time(us)
GroupedMatmul MIX_AIC 126.26
Transpose AI_VECTOR_CORE 90.02
MoeGatingTopK AI_VECTOR_CORE 68.32
Tile AI_VECTOR_CORE 24.96
DequantSwigluQuant AI_VECTOR_CORE 24.18
ReduceMeanD MIX_AIV 16.36
ConcatV2D AI_VECTOR_CORE 14.9
ReduceMeanD AI_VECTOR_CORE 14.14
SplitVD AI_VECTOR_CORE 10.04
MatMul AI_CORE 6.26
Cast AI_VECTOR_CORE 3.96
Data AI_VECTOR_CORE 3.3
StridedSliceD AI_VECTOR_CORE 3.18
AutomaticBufferFusionOp AI_VECTOR_CORE 1.66
no_sk_model Total Time 407.54
OP_Type Core Type Total Time(us)
SuperKernel MIX_AIC 172.4
Transpose AI_VECTOR_CORE 92.42
Tile AI_VECTOR_CORE 24.66
SuperKernel MIX_AIV 18.48
ReduceMeanD MIX_AIV 16.34
ConcatV2D AI_VECTOR_CORE 14.74
ReduceMeanD AI_VECTOR_CORE 14.24
SplitVD AI_VECTOR_CORE 10.12
MatMul AI_CORE 8.6
Cast AI_VECTOR_CORE 4.08
Data AI_VECTOR_CORE 3.72
StridedSliceD AI_VECTOR_CORE 3.1
AutomaticBufferFusionOp AI_VECTOR_CORE 1.76
sk_model Total Time 384.66

Compare to obtain that using super_kernel operator fusion benefit is 5.61%