ops-math/conversion/clip_by_value_v2/op_kernel_aicpu · CANN/ops-math - AtomGit

cann-robotoptimize the performace of aicpu ClipByValueV2 operator

文件	最后提交记录	最后更新时间
clip_by_value_v2_aicpu.cpp	optimize the performace of aicpu ClipByValueV2 operator Co-authored-by: ZhaiPeiChao<zhaipeichao@huawei.com> # message auto-generated for no-merge-commit merge: !2460 merge clip_by_value_v2 into master optimize the performace of aicpu ClipByValueV2 operator Created-by: ZhaiPeiChao Commit-by: ZhaiPeiChao Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 本 PR 对 AICPU 的 ClipByValueV2 算子进行了全面的性能优化，通过重构核心计算逻辑、优化并行调度策略、分离标量/逐元素边界处理路径，显著提升了算子执行效率。 ### 主要改动 1. 核心计算函数重构: 将原有的函数指针调用模式重构为专用的内联函数（ClampScalar、ClampComplex），使用 if constexpr 和类型特征在编译期选择最优路径，消除了运行时开销。 2. 并行调度优化: 将并行阈值从基于元素数量（256K/32K）改为基于字节数（200KB/40KB），新增统一的 DispatchParallel 调度函数，优化小数据量的串行回退策略。 3. 分支路径分离: 区分标量边界和逐元素边界两种场景，为每种场景提供专用的 kernel 函数（KernelScalarBound、KernelElemBound 等），标量场景预加载边界值到寄存器。 4. 编译器优化提示: 大量使用 __attribute__((always_inline))、__attribute__((hot))、__restrict__、__builtin_expect 等编译器提示，优化指令级并行和分支预测。 5. 类型分发重构: 将 std::map 函数指针分发改为 switch-case 结构，利用编译器优化和更紧凑的代码布局。 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> issue [#1380](https://gitcode.com/cann/ops-math/issues/1380) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> - 高频 fp32 scalar-bounds 中等规模（64K）：T=4 7.5× / T=6 5.7×； - 整型 elem-bounds（1M int32）：T=6 1.37×； - int8 scalar-bounds：T=1 1.29× / T=4 1.22×； - 单线程普遍：1.1–1.4×（向量化 + 分支外提）； - 大张量 DRAM 带宽饱和场景：0.93–0.97×（噪声范围）。 ### T=1（单线程，反映纯算法/访存/向量化收益） \| case \| old(ms) \| new(ms) \| speedup \| old(GB/s) \| new(GB/s) \| equal \| \|---\|---:\|---:\|---:\|---:\|---:\|:---:\| \| fp32 small 1K scalar-bounds \| 0.0130 \| 0.0110 \| 1.18× \| 0.60 \| 0.71 \| Y \| \| fp32 small 1K elem-bounds \| 0.0120 \| 0.0130 \| 0.92× \| 1.30 \| 1.20 \| Y \| \| fp32 64K scalar-bounds \| 0.4419 \| 0.3173 \| 1.39× \| 1.19 \| 1.65 \| Y \| \| fp32 256K elem-bounds \| 2.1340 \| 2.2015 \| 0.97× \| 1.96 \| 1.91 \| Y \| \| int32 256K scalar-bounds \| 0.9320 \| 0.8534 \| 1.09× \| 2.25 \| 2.46 \| Y \| \| int32 1M elem-bounds \| 8.7056 \| 7.8415 \| 1.11× \| 1.93 \| 2.14 \| Y \| \| int8 1M scalar-bounds \| 2.2670 \| 1.7555 \| 1.29× \| 0.92 \| 1.19 \| Y \| \| fp32 16M elem-bounds \| 162.6 \| 175.2 \| 0.93× \| 1.65 \| 1.53 \| Y \| ### T=4 \| case \| old(ms) \| new(ms) \| speedup \| new(GB/s) \| equal \| \|---\|---:\|---:\|---:\|---:\|:---:\| \| fp32 64K scalar-bounds \| 0.4513 \| 0.0600 \| 7.52× \| 27.3 \| Y \| \| fp32 256K elem-bounds \| 2.1500 \| 1.6860 \| 1.28× \| 2.49 \| Y \| \| int32 1M elem-bounds \| 8.6700 \| 8.4860 \| 1.02× \| 1.98 \| Y \| \| int8 1M scalar-bounds \| 2.2500 \| 1.8450 \| 1.22× \| 1.13 \| Y \| ### T=6 \| case \| old(ms) \| new(ms) \| speedup \| new(GB/s) \| equal \| \|---\|---:\|---:\|---:\|---:\|:---:\| \| fp32 64K scalar-bounds \| 0.4450 \| 0.0776 \| 5.73× \| 21.1 \| Y \| \| int32 1M elem-bounds \| 8.6820 \| 6.3410 \| 1.37× \| 2.65 \| Y \| \| int8 1M scalar-bounds \| 2.2700 \| 1.9800 \| 1.15× \| 1.06 \| Y \| ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> 不涉及 ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [ ] 新特性 - [x] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-math!2460	25 天前
clip_by_value_v2_aicpu.h	optimize the performace of aicpu ClipByValueV2 operator Co-authored-by: ZhaiPeiChao<zhaipeichao@huawei.com> # message auto-generated for no-merge-commit merge: !2460 merge clip_by_value_v2 into master optimize the performace of aicpu ClipByValueV2 operator Created-by: ZhaiPeiChao Commit-by: ZhaiPeiChao Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 本 PR 对 AICPU 的 ClipByValueV2 算子进行了全面的性能优化，通过重构核心计算逻辑、优化并行调度策略、分离标量/逐元素边界处理路径，显著提升了算子执行效率。 ### 主要改动 1. 核心计算函数重构: 将原有的函数指针调用模式重构为专用的内联函数（ClampScalar、ClampComplex），使用 if constexpr 和类型特征在编译期选择最优路径，消除了运行时开销。 2. 并行调度优化: 将并行阈值从基于元素数量（256K/32K）改为基于字节数（200KB/40KB），新增统一的 DispatchParallel 调度函数，优化小数据量的串行回退策略。 3. 分支路径分离: 区分标量边界和逐元素边界两种场景，为每种场景提供专用的 kernel 函数（KernelScalarBound、KernelElemBound 等），标量场景预加载边界值到寄存器。 4. 编译器优化提示: 大量使用 __attribute__((always_inline))、__attribute__((hot))、__restrict__、__builtin_expect 等编译器提示，优化指令级并行和分支预测。 5. 类型分发重构: 将 std::map 函数指针分发改为 switch-case 结构，利用编译器优化和更紧凑的代码布局。 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> issue [#1380](https://gitcode.com/cann/ops-math/issues/1380) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> - 高频 fp32 scalar-bounds 中等规模（64K）：T=4 7.5× / T=6 5.7×； - 整型 elem-bounds（1M int32）：T=6 1.37×； - int8 scalar-bounds：T=1 1.29× / T=4 1.22×； - 单线程普遍：1.1–1.4×（向量化 + 分支外提）； - 大张量 DRAM 带宽饱和场景：0.93–0.97×（噪声范围）。 ### T=1（单线程，反映纯算法/访存/向量化收益） \| case \| old(ms) \| new(ms) \| speedup \| old(GB/s) \| new(GB/s) \| equal \| \|---\|---:\|---:\|---:\|---:\|---:\|:---:\| \| fp32 small 1K scalar-bounds \| 0.0130 \| 0.0110 \| 1.18× \| 0.60 \| 0.71 \| Y \| \| fp32 small 1K elem-bounds \| 0.0120 \| 0.0130 \| 0.92× \| 1.30 \| 1.20 \| Y \| \| fp32 64K scalar-bounds \| 0.4419 \| 0.3173 \| 1.39× \| 1.19 \| 1.65 \| Y \| \| fp32 256K elem-bounds \| 2.1340 \| 2.2015 \| 0.97× \| 1.96 \| 1.91 \| Y \| \| int32 256K scalar-bounds \| 0.9320 \| 0.8534 \| 1.09× \| 2.25 \| 2.46 \| Y \| \| int32 1M elem-bounds \| 8.7056 \| 7.8415 \| 1.11× \| 1.93 \| 2.14 \| Y \| \| int8 1M scalar-bounds \| 2.2670 \| 1.7555 \| 1.29× \| 0.92 \| 1.19 \| Y \| \| fp32 16M elem-bounds \| 162.6 \| 175.2 \| 0.93× \| 1.65 \| 1.53 \| Y \| ### T=4 \| case \| old(ms) \| new(ms) \| speedup \| new(GB/s) \| equal \| \|---\|---:\|---:\|---:\|---:\|:---:\| \| fp32 64K scalar-bounds \| 0.4513 \| 0.0600 \| 7.52× \| 27.3 \| Y \| \| fp32 256K elem-bounds \| 2.1500 \| 1.6860 \| 1.28× \| 2.49 \| Y \| \| int32 1M elem-bounds \| 8.6700 \| 8.4860 \| 1.02× \| 1.98 \| Y \| \| int8 1M scalar-bounds \| 2.2500 \| 1.8450 \| 1.22× \| 1.13 \| Y \| ### T=6 \| case \| old(ms) \| new(ms) \| speedup \| new(GB/s) \| equal \| \|---\|---:\|---:\|---:\|---:\|:---:\| \| fp32 64K scalar-bounds \| 0.4450 \| 0.0776 \| 5.73× \| 21.1 \| Y \| \| int32 1M elem-bounds \| 8.6820 \| 6.3410 \| 1.37× \| 2.65 \| Y \| \| int8 1M scalar-bounds \| 2.2700 \| 1.9800 \| 1.15× \| 1.06 \| Y \| ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> 不涉及 ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [ ] 新特性 - [x] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-math!2460	25 天前
clip_by_value_v2_aicpu_def.cpp	feat: migrate math+conversion json ops to opdef pipeline Co-authored-by: zhaowenrui666<zhaowenrui7@huawei.com> # message auto-generated for no-merge-commit merge: !2130 merge mathdef into master feat: migrate math+conversion json ops to opdef pipeline Created-by: zhaowenrui666 Commit-by: zhaowenrui666 Merged-by: cann-robot Description: ## 描述 math仓aicpu算子信息库由json融合修改为op def文件，修改注册方式以及工程适配 ## 关联的Issue https://gitcode.com/cann/ops-math/issues/1271 ## 测试本次按 ENABLE_CUSTOM=ON 的 custom 路径验收，已验证 OpDef -> libaicpu_ops.so -> op_build --aicpu -> aicpu_kernel.ini -> parser_ini.py -> cust_aicpu_kernel.json 链路跑通，相关产物已生成并校验通过。AICPU before/after 与 I/O 对比已按新规则重评：input/output -> input0/output0 视为规范化差异；RightShift、SearchSorted、Sinh 在整改前 JSON 缺 I/O，但已对齐 CANN 基线 aicpu_kernel.ini，不判缺失；AddN: input0 -> dynamic_input0 和 AddFake -> Add 也已确认接受，不作为阻塞项。 AICore 回归已补做，before/after 经 parser_ini.py 转 json 后无差异。综合结论：本提交满足当前 OpDef 整改验收标准，可通过。 ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [x] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-math!2130	1 个月前