super_kernel Use Case Demonstration

Use Case Function

Model contains 6 sk fragments. Some sk are identical and reuse cache, some add optional input bias or different option configurations requiring online compilation.

Compile models using super_kernel and not using super_kernel, output performance data.

Operator Fusion

Use the following with statement block (super_kernel). Operators within the statement block are all fused into one super kernel for computation:

with torchair.scope.super_kernel("sk1"):

For detailed function introduction, see Mark SuperKernel Scope in Graph.

Execution Command

python3 superkernel_compare.py

Expected Execution Result

After execution, print shows success:

execute sample success

A prof_result folder is generated in execution directory with the following structure. After obtaining data, compare time consumption:

prof_result
├── sk_model                             # with superkernel result
│  ├── localhost.localdomain_ascend_pt   
│     ├── PROF_*                         
│        ├── mindstudio_profiler_output   
│           ├── op_statistic.csv         # profiling data
├── no_sk_model                          # without superkernel result
│  ├── localhost.localdomain_ascend_pt   
│     ├── PROF_*                         
│        ├── mindstudio_profiler_output   
│           ├── op_statistic.csv         # profiling data

Extract the following data from both op_statistic.csv files respectively:

OP_Type	Core Type	Total Time(us)
GroupedMatmul	MIX_AIC	126.26
Transpose	AI_VECTOR_CORE	90.02
MoeGatingTopK	AI_VECTOR_CORE	68.32
Tile	AI_VECTOR_CORE	24.96
DequantSwigluQuant	AI_VECTOR_CORE	24.18
ReduceMeanD	MIX_AIV	16.36
ConcatV2D	AI_VECTOR_CORE	14.9
ReduceMeanD	AI_VECTOR_CORE	14.14
SplitVD	AI_VECTOR_CORE	10.04
MatMul	AI_CORE	6.26
Cast	AI_VECTOR_CORE	3.96
Data	AI_VECTOR_CORE	3.3
StridedSliceD	AI_VECTOR_CORE	3.18
AutomaticBufferFusionOp	AI_VECTOR_CORE	1.66
no_sk_model	Total Time	407.54

OP_Type	Core Type	Total Time(us)
SuperKernel	MIX_AIC	172.4
Transpose	AI_VECTOR_CORE	92.42
Tile	AI_VECTOR_CORE	24.66
SuperKernel	MIX_AIV	18.48
ReduceMeanD	MIX_AIV	16.34
ConcatV2D	AI_VECTOR_CORE	14.74
ReduceMeanD	AI_VECTOR_CORE	14.24
SplitVD	AI_VECTOR_CORE	10.12
MatMul	AI_CORE	8.6
Cast	AI_VECTOR_CORE	4.08
Data	AI_VECTOR_CORE	3.72
StridedSliceD	AI_VECTOR_CORE	3.1
AutomaticBufferFusionOp	AI_VECTOR_CORE	1.76
sk_model	Total Time	384.66

Compare to obtain that using super_kernel operator fusion benefit is 5.61%