68a3b43e创建于 4月27日历史提交

文件	最后提交记录	最后更新时间
simt_mul	stdt mul cumsum example fix	1 个月前
test	stdt mul cumsum example fix	1 个月前
README.md	stdt mul cumsum example fix	1 个月前
plan.md	stdt mul cumsum example fix	1 个月前
pyproject.toml	five example	1 个月前
requirements.txt	stdt mul cumsum example fix	1 个月前
setup.py	stdt mul cumsum example fix	1 个月前

mul 迁移说明

1. 算子说明

算子名称：torch.mul / torch.multiply
迁移模式：torch_npu / PrivateUse1 out-of-tree 扩展
原始 CUDA 文件：pytorch/aten/src/ATen/native/cuda/BinaryMulKernel.cu
原始 CUDA 入口：mul_kernel_cuda
目标平台：Ascend SIMT
交付目录：ported-ops/mul
功能概述：在 Ascend SIMT 平台承接 torch.mul、torch.multiply、Tensor.mul_、out=、Tensor/Tensor 和 Tensor/Scalar 逐元素乘法
一比一复刻结论：当前实现保留 TensorIterator、dtype dispatch、scalar wrapped-number 语义、广播 stride、非连续输出、out= 和原位路径；未引入 CPU fallback 或 ACL 高层 Mul 替代路径
用户决策结论：迁移过程中遇到降级决策点，一律选择不同意降级，坚持一对一迁移

2. 原始 CUDA 实现摘要

BinaryMulKernel.cu 的核心职责是注册 CUDA backend 的 mul_stub，并在 mul_kernel_cuda 中通过 opmath_symmetric_gpu_kernel_with_scalars 执行对称二元 pointwise kernel。该路径依赖 TensorIterator 处理 dtype promotion、广播、stride、scalar wrapped-number、out= 和 inplace 场景。

用户侧典型调用链：

torch.mul / torch.multiply / tensor.mul_ / out=
└── aten::mul.Tensor / mul.Scalar / mul.out / mul_.*
    └── mul_stub
        └── mul_kernel_cuda
            └── opmath_symmetric_gpu_kernel_with_scalars

3. Ascend SIMT 迁移摘要

当前实现将 CUDA helper 迁移为本地 SIMT counterpart，核心路径如下：

torch.mul / torch.multiply / tensor.mul_ / out=
└── aten PrivateUse1 override
    └── at::native::simt_mul::mul_privateuse1*
        └── TensorIterator::binary_op
            └── mul_kernel_simt
                └── simt_opmath_symmetric_gpu_kernel_with_scalars
                    ├── simt_binary_elementwise_kernel
                    └── simt_unary_elementwise_kernel

主要实现点：

mul_kernel_simt：对应 CUDA mul_kernel_cuda
simt_opmath_symmetric_gpu_kernel_with_scalars：对应 CUDA opmath_symmetric_gpu_kernel_with_scalars
SimtOffsetCalculator：承接 TensorIterator 的 shape 和 stride 信息，支持 broadcast 和 non-contiguous
SimtNumericTraits：处理 half、bfloat16、complex half、double、complex double 的 opmath 转换
OpCommand::RunOpApiV2：按 torch_npu 扩展方式发起 SIMT kernel launch
TORCH_LIBRARY_IMPL(aten, PrivateUse1, ...)：注册 mul.* 和 multiply.* overload

4. API 与语法映射

CUDA / ATen 项	Ascend SIMT 项	处理方式	说明
`mul_kernel_cuda`	`mul_kernel_simt`	migrate	保留 dtype dispatch 和 pointwise kernel 结构
`opmath_symmetric_gpu_kernel_with_scalars`	`simt_opmath_symmetric_gpu_kernel_with_scalars`	migrate	保留 scalar-aware symmetric binary kernel 语义
CUDA kernel launch	SIMT `__global__` kernel launch	migrate	使用 `c10_npu::getCurrentNPUStream().stream(true)` 获取当前 NPU stream
`__device__` callable	`__aicore__` callable	migrate	适配 Ascend SIMT 设备函数标注
`TensorIterator::binary_op`	`TensorIterator::binary_op`	reuse	复用 ATen 公共广播、stride、dtype 驱动抽象
`wrapped_scalar_tensor`	`wrapped_scalar_tensor`	reuse	scalar overload 进入 TensorIterator 路径
CUDA helper `Loops.cuh`	本地 SIMT helper	migrate	不降级为 host 计算或 ACL 高层算子

5. 能力覆盖矩阵

能力项	源能力	当前实现	状态
Tensor/Tensor	`mul.Tensor`	PrivateUse1 `mul.Tensor`	已注册
Tensor/Scalar	`mul.Scalar`	PrivateUse1 `mul.Scalar`	已注册，当前 float64 scalar 数值仍需修复
`out=`	`mul.out` / `mul.Scalar_out`	PrivateUse1 `mul.out` / `mul.Scalar_out`	已注册
inplace	`mul_.Tensor` / `mul_.Scalar`	PrivateUse1 inplace wrappers	已注册
alias	`multiply.*`	`multiply.Tensor/out/Scalar/Scalar_out/inplace`	已注册
broadcasting	TensorIterator	TensorIterator + SIMT offset calculator	float32 用例通过
non-contiguous output	TensorIterator stride	独立 output/input offset calculator	float32 `out=` 用例通过
32-bit indexing split	`with_32bit_indexing`	保留 split 递归	已实现
dtype dispatch	all / complex / half / bf16 / bool / unsigned	`AT_DISPATCH_V2` 覆盖	编译覆盖保留，部分 dtype 数值待修复
CPU fallback	无	无	未引入
ACL 高层替代	无	无	未引入

6. 目录结构

ported-ops/mul/
├── README.md
├── plan.md
├── pyproject.toml
├── requirements.txt
├── setup.py
├── simt_mul
│   ├── __init__.py
│   └── csrc
│       ├── mul_bindings.asc
│       ├── mul_simt.h
│       └── simt
│           └── mul_simt.asc
└── test
    └── test_mul.py

7. 构建方式

环境要求：

Python 环境可导入 torch 和 torch_npu
可调用 bisheng
已配置 Ascend CANN 环境变量，例如 ASCEND_TOOLKIT_HOME / ASCEND_OPP_PATH

构建命令：

cd ported-ops/mul
python -m pip install -e . --no-build-isolation

编译选项说明：

默认 OPT_LEVEL=0，即 bisheng -O0
DEBUG=1 时同样使用 -O0 并追加 -g
可通过 OPT_LEVEL=3 python -m pip install -e . --no-build-isolation 尝试优化构建
当前 CANN 9.0.0 的 bisheng 在 -O1/-O3 编译 FP64 / complex FP64 SIMT 路径时会触发后端寄存器宽度错误，因此默认保守使用 -O0
该编译规避不删 dtype、不删 overload、不改执行路径，不属于算子语义降级

8. 验证方式

Python 测试命令：

cd ported-ops/mul
python -m unittest discover -s test -p 'test_*.py'

测试覆盖：

Tensor/Tensor broadcast float32
torch.multiply alias float32
Tensor/Scalar float64
inplace Tensor/Tensor float16
inplace Tensor/Scalar bfloat16
non-contiguous out= float32

9. 当前验证结果

已执行：

python -m pip install -e . --no-build-isolation

结果：通过，扩展可构建和安装。

已执行：

python -m unittest discover -s test -p 'test_*.py'

结果：6 个用例中 3 个通过、3 个失败。

通过项：

test_mul_tensor_broadcast_float32
test_multiply_alias_matches_cpu
test_mul_out_non_contiguous

失败项：

test_mul_inplace_tensor_float16：当前输出 NaN
test_mul_inplace_scalar_bfloat16：当前输出 0
test_mul_scalar_float64：当前输出近似 [-4, -2, -1, 0, 1, 2, 4, 4]，与 CPU 期望 [-5.25, -3.5, -1.75, 0, 1.75, 3.5, 5.25, 7] 不一致

当前状态标签：implemented, builds, partially verified, dtype numeric fixes pending

10. 已知问题与后续定位

half 输出 NaN：需要继续定位 half_from_float / float_from_half_bits 在 bisheng -O0 下的设备侧 bit conversion 或 store 行为
bfloat16 scalar 输出 0：需要继续定位 CPU scalar operand、wrapped scalar tensor 与 bf16 from_opmath 写回路径
float64 scalar 数值不一致：需要继续修正当前软件 FP64 multiply/add/sub 的尾数、舍入或 exponent 组装逻辑
complex double：当前保留分发和软件 FP64 组合逻辑，但需要在 float64 标量修复后补充专项测试
OPT_LEVEL=1/3：当前会触发 bisheng 后端错误，后续需随工具链版本验证

11. 降级决策记录

决策点	用户选择	当前处理
是否用 CPU fallback 绕过 half/bfloat16/double 问题	不同意	保留 SIMT 路径，记录未通过项
是否删掉 double / complex double dispatch	不同意	保留 dtype dispatch，并实现软件 FP64 helper
是否用 ACL 高层 `Mul` 替代 SIMT kernel	不同意	保留本地 SIMT elementwise kernel
是否只覆盖 float32	不同意	保留完整 overload 和 dtype 覆盖

12. 后续建议

优先修复 half / bfloat16 设备侧转换，先让低精度 tensor/inplace 用例通过
再单独建立 FP64 helper 的 host/device 对照测试，定位软件乘法的舍入与 exponent 问题
FP64 修复后增加 complex64、complex128、complex32 覆盖测试
工具链升级后重新验证 OPT_LEVEL=1/3，若优化构建通过再恢复更高优化级别