38c766b1创建于 11 天前历史提交

文件	最后提交记录	最后更新时间
README.md	docs: 文档国际化改造，支持中英文双语	11 天前
README_en.md	docs: 文档国际化改造，支持中英文双语	11 天前
superkernel_compare.py	Synchronize open source repositories and private repositories. Co-authored-by: XuebinYang<yangxuebin6@hisilicon.com> # message auto-generated for no-merge-commit merge: !56 merge master into master 【feat】: Synchronize open source repositories and private repositories. Created-by: XuebinYang Commit-by: XuebinYang Merged-by: cann-robot Description: # Pull Request ## 描述 Synchronize open source repositories and private repositories. The main changes are as follows: The interface between the super kernel and Ascend C package is changed from super_kernel_compile to compile_super_kernel. The super_kernel_feature_manager file is added to manage the feature compatibility changes between Ascend C and the super kernel. The enum type, such as the kernel type of the super kernel, is added to ensure independent upgrade of the graph-autofusion package. ## 变更类型请选择本次引入的变更类型： <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [x] ✨ 新功能 - [ ] 💄 代码风格更新（格式化，局部变量） - [ ] ♻️ 重构（既不修复错误也不增加功能的代码变动） - [ ] 📦 构建过程或辅助工具的变动 - [x] 📝 文档内容更新 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在当前页面的右侧'关联Issue'部分添加相应Issue链接，并勾选'合并后关闭已关联的 Issue'选项。 --> ## 如何测试描述测试此变更的步骤和前提条件： 1.继承用例通过 ## 核对清单 <!-- [x] 表示选中 --> - [x] 我的代码遵循了项目的代码风格 - [x] 我已对代码进行了自测 - [x] 我已更新了相关的文档 - [x] 我在标题中使用了合适的类型标签（如：`feat:`, `fix:`） - [x] 我已经详细阅读了贡献指南（CONTRIBUTING.md），并遵守了其中的所有规定，包括但不限于commit message的格式、无效commit的合并等 ## 其他信息在此添加任何其他关于本次 PR 的说明。 See merge request: cann/graph-autofusion!56	3 个月前

super_kernel 用例演示

用例功能

模型包含6个sk片段，部分sk相同复用缓存，部分多了可输入bias或option配置不同走在线编译

编译使用super_kernel和不使用super_kernel的模型，将性能数据输出

融合算子

使用如下with语句块（super_kernel），语句块内算子均被融合为一个超级Kernel进行计算

with torchair.scope.super_kernel("sk1"):

详细功能介绍见图内标定SuperKernel范围。

执行命令

python3 superkernel_compare.py

预期执行结果

执行后打印显示success

execute sample success

在执行目录生成prof_result文件夹，目录如下，获取数据后对比耗时

prof_result
├── sk_model                             # 带superkernel结果
│  ├── localhost.localdomain_ascend_pt   
│     ├── PROF_*                         
│        ├── mindstudio_profiler_output   
│           ├── op_statistic.csv         # profiling数据
├── no_sk_model                          # 不带superkernel结果
│  ├── localhost.localdomain_ascend_pt   
│     ├── PROF_*                         
│        ├── mindstudio_profiler_output   
│           ├── op_statistic.csv         # profiling数据

分别从两份op_statistic.csv表中得到如下数据

OP_Type	Core Type	Total Time(us)
GroupedMatmul	MIX_AIC	126.26
Transpose	AI_VECTOR_CORE	90.02
MoeGatingTopK	AI_VECTOR_CORE	68.32
Tile	AI_VECTOR_CORE	24.96
DequantSwigluQuant	AI_VECTOR_CORE	24.18
ReduceMeanD	MIX_AIV	16.36
ConcatV2D	AI_VECTOR_CORE	14.9
ReduceMeanD	AI_VECTOR_CORE	14.14
SplitVD	AI_VECTOR_CORE	10.04
MatMul	AI_CORE	6.26
Cast	AI_VECTOR_CORE	3.96
Data	AI_VECTOR_CORE	3.3
StridedSliceD	AI_VECTOR_CORE	3.18
AutomaticBufferFusionOp	AI_VECTOR_CORE	1.66
no_sk_model	总耗时	407.54

OP_Type	Core Type	Total Time(us)
SuperKernel	MIX_AIC	172.4
Transpose	AI_VECTOR_CORE	92.42
Tile	AI_VECTOR_CORE	24.66
SuperKernel	MIX_AIV	18.48
ReduceMeanD	MIX_AIV	16.34
ConcatV2D	AI_VECTOR_CORE	14.74
ReduceMeanD	AI_VECTOR_CORE	14.24
SplitVD	AI_VECTOR_CORE	10.12
MatMul	AI_CORE	8.6
Cast	AI_VECTOR_CORE	4.08
Data	AI_VECTOR_CORE	3.72
StridedSliceD	AI_VECTOR_CORE	3.1
AutomaticBufferFusionOp	AI_VECTOR_CORE	1.76
sk_model	总耗时	384.66

对比获得使用super_kernel融合算子的收益是5.61%