ops-math/conversion/concat_v2/tests/ut · CANN/ops-math - AtomGit

cann-robotoptimize the performance of aicpu concatv2 operator

文件	最后提交记录	最后更新时间
op_host	add ut to concat_v2 and concat and strided_slice_grad Co-authored-by: biabu111<hebaojing1@huawei.com> # message auto-generated for no-merge-commit merge: !1334 merge master into master add ut to concat_v2 and concat and strided_slice_grad Created-by: biabu111 Commit-by: biabu111 Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 补充concat_v2 、 concat 、 strided_slice_grad算子ut ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> https://gitcode.com/cann/ops-math/issues/795 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：补充ut See merge request: cann/ops-math!1334	3 个月前
op_kernel	test: add kernel UTs for conversion and math operators Co-authored-by: fenglin28<fenglin28@huawei.com> # message auto-generated for no-merge-commit merge: !2272 merge patch_math_conversion_ut into master test: add kernel UTs for conversion and math operators Created-by: fenglin28 Commit-by: fenglin28 Merged-by: cann-robot Description: ## 描述新增conversion和math算子kernel UT ## 关联的Issue https://gitcode.com/cann/ops-math/issues/1222 ## 测试流水线UT测试通过 ## 文档更新 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：增加单元测试 See merge request: cann/ops-math!2272	1 个月前
op_kernel_aicpu	optimize the performance of aicpu concatv2 operator Co-authored-by: ZhaiPeiChao<zhaipeichao@huawei.com> # message auto-generated for no-merge-commit merge: !2395 merge master into master optimize the performance of aicpu concatv2 operator Created-by: ZhaiPeiChao Commit-by: ZhaiPeiChao Merged-by: cann-robot Description: ## 描述对 conversion/concat_v2 aicpu 算子进行性能优化重构。修改位置 - `conversion/concat_v2/op_kernel_aicpu/concat_v2_aicpu.h` - `conversion/concat_v2/op_kernel_aicpu/concat_v2_aicpu.cpp` - `conversion/concat_v2/tests/ut/op_kernel_aicpu/test_concat_v2.cpp` 修改原因与方法原实现在热路径上存在多项性能与工程规范问题，导致小张量多输入场景延迟偏高、多核扩展性受限： 1. 去除 Eigen 与 shared_ptr 抽象层：原代码每个输入都 `std::make_shared<Eigen::TensorMap<...>>` 构造访问器，一次 Compute 产生 N 次堆分配，热路径存在两层指针解引用（`shared_ptr → TensorMap → data`），并因 shared_ptr 控制块在 shard 间拷贝导致引用计数原子 RMW 造成跨核 cache-line ping-pong。改为 POD 描述符 `ConcatV2InputDesc{const T* base; int64_t dim1}`，在 Prepare 阶段一次性构建，热路径零分配、单层解引用。 2. 按自然边界（整行 / 输入区间）并行：原 `ParallelFor` 使用元素级 grain（`sizeof(T)`），每个 shard 要处理"首尾半行"边界，分支多、memcpy 长度不稳定、任务数量级过大。改为按行切分（ByRow 路径）或按输入区间切分（ByInput 路径），`per_unit_size` 传整行字节数，shard 永远对齐到自然边界，每段都能走 libc memcpy ifunc 快路径。 3. 退化形态专用分支：`flat_dim0 == 1`（含 axis=0、一维）场景走 ByInput 路径，通过前缀 offset + 输出区间切分让各 shard 写入不相交区间，彻底消除 false sharing；`N == 1`、空输入等也做前置处理。 4. 热内核函数 static 模板化 + 显式捕获 lambda：`CopyOneRow` 改为头文件中的 `static` 模板函数，去 `this` 指针便于跨 TU 内联；并行 lambda 由默认捕获 `[this, &...]` 改为显式捕获，同时用 `std::atomic<uint32_t> + memory_order_relaxed` 收敛 shard 错误，避免原实现中 `memcpy_s` 失败被静默。 5. 合规与可维护性：`reinterpret_cast` 替换为 `aicpu::PtrToPtr`；`Compute` 拆分为 `CheckParams / ParseInputs / ValidateShapes / DoCompute<T>` 等小函数（单函数 ≤ 50 行、圈复杂度 ≤ 5）；不改成员的方法加 `const`；关键路径补 `KERNEL_LOG_INFO`，错误分支补 `KERNEL_LOG_ERROR` 含上下文信息；CPU 核使用策略（是否预留）与原实现一致。 6. UT 扩展：在原 16 例基础上新增 9 例（共 25 例），覆盖新增的 ByRow/ByInput 分派路径、axis=0 与一维退化形态、N=1、空输入、多 dtype、错误路径，保证优化前后行为严格等价。 ## 关联的Issue [#1342](https://gitcode.com/cann/ops-math/issues/1342) ## 测试 - UT 全量回归：直接 cmake 构建 `math_aicpu_op_kernel_ut`（绕开 `build.sh --opkernel_aicpu_test` 的 `UT_TARGETS=()` 已知 bug），25 例 UT 全部通过，覆盖常规路径、退化形态、错误路径、全部支持的 dtype。 - 等价性验证：独立 microbench 对 7 个典型输入场景（小张量 × N 大、中等带宽、大张量、axis=0、N=1 等）执行 `std::memcmp` 逐字节比对，新旧实现输出完全一致。 - 性能验证：在 1 / 4 / 10 三档线程下采集实测数据，典型场景加速 1.8×–7.6×。 ## 文档更新无。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [x] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-math!2395	1 个月前