qixingkaifix example randn multinomial foreach_sqrt

文件	最后提交记录	最后更新时间
distribution_normal_randn	fix example randn multinomial foreach_sqrt	30 天前
simt_randn	add pytorch tasks	1 个月前
test	fix example randn multinomial foreach_sqrt	30 天前
README.md	fix example randn multinomial foreach_sqrt	30 天前
plan.md	fix example randn multinomial foreach_sqrt	30 天前
pyproject.toml	fix example randn multinomial foreach_sqrt	30 天前
requirements.txt	add pytorch tasks	1 个月前
setup.py	fix example randn multinomial foreach_sqrt	30 天前

torch.randn Ascend SIMT 迁移说明

1. 算子说明

算子名称：torch.randn / normal_
迁移模式：torch_npu
原始 CUDA 文件：pytorch/aten/src/ATen/native/cuda/DistributionNormal.cu
交付根目录：/home/q00954432/simt-buddy/ported-ops/randn
用户决策：不同意降级，坚持一对一迁移

本工程提供：

distribution_normal_randn.randn(...)
distribution_normal_randn.normal_(tensor, mean=0.0, std=1.0, generator=None)

默认不覆盖当前环境已有的 torch_npu aten::normal_；如目标环境需要验证原生 torch.randn 的 aten::normal_ 接管路径，可在构建时设置 DISTRIBUTION_NORMAL_REGISTER_ATEN=1。

2. 原始 CUDA 调用链

torch.randn
-> TensorFactories.cpp::randn
-> result.normal_(0, 1, generator)
-> normal_stub
-> DistributionNormal.cu::normal_kernel
-> DistributionTemplates.h::normal_kernel
-> normal_and_transform
-> distribution_nullary_kernel
-> distribution_elementwise_grid_stride_kernel
-> curand_normal4 / curand_normal2_double

3. Ascend SIMT 迁移摘要

保留 normal_kernel(const TensorBase&, double, double, optional<Generator>)
保留 REGISTER_DISPATCH(normal_stub, &normal_kernel) 形态
保留 normal_and_transform
保留 distribution_nullary_kernel
保留 distribution_elementwise_grid_stride_kernel
保留 PhiloxCudaState seed/offset 传递形态
用 curand_normal4_compat 承接 float32 normal4 路径
curand/Philox 直接替换头文件在 CANN SIMT include 目录中未找到，因此实现本地 Philox 4x32 + Box-Muller
设备侧 double 路径按规则排除，不改写为 float

4. 目录结构

ported-ops/randn/
├── plan.md
├── README.md
├── setup.py
├── pyproject.toml
├── requirements.txt
├── distribution_normal_randn/
│   ├── __init__.py
│   ├── ops.py
│   └── csrc/
│       ├── normal.asc
│       └── simt/
│           └── normal.asc
└── test/
    ├── __init__.py
    └── test_distribution_normal_randn.py

5. 构建方式

cd /home/q00954432/simt-buddy/ported-ops/randn
source /usr/local/Ascend/cann-9.0.0/set_env.sh
python setup.py build_ext --inplace

可选原生 aten::normal_ 注册：

DISTRIBUTION_NORMAL_REGISTER_ATEN=1 python setup.py build_ext --inplace

如果目标环境已有 torch_npu aten::normal_ PrivateUse1 kernel，可选注册可能触发重复注册冲突；默认构建不启用该路径。

6. 使用方式

import torch
import distribution_normal_randn

x = distribution_normal_randn.randn(1024, dtype=torch.float32, device="npu")
y = torch.empty(1024, dtype=torch.float32, device="npu")
distribution_normal_randn.normal_(y, mean=0.0, std=1.0)

7. 验证方式

cd /home/q00954432/simt-buddy/ported-ops/randn
python -m unittest discover -s test -p "test_*.py" -v

安装后从其他目录验证：

cd /tmp
python - <<'PY'
import torch
import distribution_normal_randn

print(distribution_normal_randn.randn(8, dtype=torch.float32, device="cpu"))
if hasattr(torch, "npu") and torch.npu.is_available():
    print(distribution_normal_randn.randn(8, dtype=torch.float32, device="npu").cpu())
PY

8. 当前限制

NPU 设备侧当前只实现 torch.float32 contiguous 输出路径
Half / BFloat16 NPU 设备路径标记为 blocked，未改写为 float
设备侧 double 路径标记为 blocked，未改写为 float
非 contiguous normal_ 需要完整 TensorIterator offset calculator，当前标记为 blocked
默认不覆盖已有 torch_npu aten::normal_，避免破坏系统内置实现

9. 当前验证状态

验证项	状态	说明
构建验证状态	`passed`	默认构建与 `DISTRIBUTION_NORMAL_REGISTER_ATEN=1` 可选构建均通过
软件侧验证状态	`passed`	`python -m unittest discover -s test -p "test_*.py" -v`：6 tests OK
Ascend 950 PR 硬件验证状态	`validated on Ascend 950 PR`	设备 `Ascend950PR_9599`，NPU float32 custom randn、安装后 `/tmp` 导入、同 seed generator 验证通过

实际验证摘要：

device Ascend950PR_9599
custom_mean_std 0.004098 0.995277
torch_randn_mean_std -0.016052 0.988166
same_seed_equal True

最终结论：已验证范围为 torch_npu Python 调用路径下的 NPU float32 contiguous randn/normal_；Half、BFloat16、设备侧 double、非 contiguous normal_ 仍保持 blocked，不做隐式降级。