ReduceSum

产品支持情况

产品	是否支持
Ascend 950PR/Ascend 950DT	√
Atlas A3 训练系列产品/Atlas A3 推理系列产品	√
Atlas A2 训练系列产品/Atlas A2 推理系列产品	√
Atlas 200I/500 A2 推理产品	√
Atlas 推理系列产品 AI Core	√
Atlas 推理系列产品 Vector Core	x
Atlas 训练系列产品	√
Kirin X90	√
Kirin 9030	√

功能说明

头文件路径为："basic_api/kernel_operator_vec_reduce_intf.h"。

ReduceSum接口对所有输入数据进行求和，计算过程如下。

图 1 ReduceSum计算示意图

函数原型

tensor前n个数据计算：

template <typename T, bool isSetMask = true>
__aicore__ inline void ReduceSum(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t count)

tensor高维切分计算：

mask逐bit模式：

template <typename T>
__aicore__ inline void ReduceSum(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const uint64_t mask[], const int32_t repeatTime, const int32_t srcRepStride)

mask连续模式：

template <typename T>
__aicore__ inline void ReduceSum(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t mask, const int32_t repeatTime, const int32_t srcRepStride)

参数说明

表 1 模板参数说明

参数名	描述
T	操作数数据类型。
isSetMask	保留参数，设置无效。

表 2 参数说明

参数名称	输入/输出	含义
dst	输出	目的操作数。类型为LocalTensor，支持的TPosition为VECIN、VECCALC、VECOUT（存储位置为Unified Buffer）。
src	输入	源操作数。类型为LocalTensor，支持的TPosition为VECIN、VECCALC、VECOUT（存储位置为Unified Buffer）。
sharedTmpBuffer	输入	指令执行期间存储中间结果，用于内部计算所需操作空间，需特别注意空间大小。详情请参考关键特性说明。类型为LocalTensor，支持的TPosition为VECIN、VECCALC、VECOUT（存储位置为Unified Buffer）。
count	输入	参与计算的元素个数。关于该参数的具体说明请参考连续计算。最大处理的数据量不能超过UB大小限制。
mask/mask[]	输入	`mask`用于控制每次迭代内参与计算的源操作数。详细设置参考掩码概述。
repeatTime	输入	迭代次数。关于该参数的具体描述请参考高维切分。注：与高维切分中不同的是，`repeatTime`可以支持更大的取值范围，保证不超过int32_t的最大值即可。
srcRepStride	输入	源操作数相邻迭代间的地址步长，即源操作数每次迭代跳过的DataBlock数目。取值范围为[0, $2^{16}-1$ ]。

注：以上高维切分相关参数mask，repeatTime，srcRepStride请参考高维切分中的介绍。

数据类型

支持的数据类型如下：

Ascend 950PR/Ascend 950DT，支持half、float、int64_t、uint64_t。
Atlas A3 训练系列产品/Atlas A3 推理系列产品，支持half、float。
Atlas A2 训练系列产品/Atlas A2 推理系列产品，支持half、float。
Atlas 200I/500 A2 推理产品，支持half、float。
Atlas 推理系列产品 AI Core，支持half、float。
Atlas 训练系列产品，支持half。
Kirin X90，支持half、float。
Kirin 9030，支持half、float。

返回值说明

无

约束说明

源操作数及sharedTmpBuffer的地址对齐约束请参考通用地址对齐约束，起始地址需要32字节对齐；目的操作数的起始地址对齐约束请参考ReduceRepeat-表3。
操作数地址重叠约束请参考通用地址重叠约束。
需要使用sharedTmpBuffer的情况下，支持dst与sharedTmpBuffer地址重叠（通常情况下dst比sharedTmpBuffer所需的空间要小），此时sharedTmpBuffer必须满足所需空间要求，详情请参考关键特性说明。

针对如下型号，当mask=0或repeatTime=0时，不会执行归约操作，不会对目的操作数进行写入，该接口将被视为NOP（空操作）。
- Atlas A3 训练系列产品/Atlas A3 推理系列产品
- Atlas A2 训练系列产品/Atlas A2 推理系列产品

针对Ascend 950PR/Ascend 950DT，int64_t/uint64_t数据类型仅支持tensor前n个数据计算接口。
srcRepStride取值范围为[0, $2^{16}-1$ ]，需要结合UB的实际大小避免出现越界。

关键特性说明

ReduceSum的累加方式分为两种：

方式一：同一repeat内采用二叉树累加，不同repeat的结果按顺序累加。
方式二：同一repeat内采用二叉树累加，不同repeat的结果也按二叉树累加。

ReduceSum接口以二叉树累加的方式完成每个repeat内的求和，详情请参考ReduceDataBlock关键特性说明。

对于上述两种累加方式，sharedTmpBuffer所需空间设置：

方式一：传入任意大小的sharedTmpBuffer，sharedTmpBuffer不会被使用。

方式二：按照如下计算公式计算所需空间：

int typeSize = 2;                           // half类型为2Bytes，float类型为4Bytes
int elementsPerBlock = 32 / typeSize;       // 1个datablock存放的元素个数
int elementsPerRepeat = 256 / typeSize;     // 1次repeat可以处理的元素个数
int firstMaxRepeat = repeatTime;            // 最后确定首次最大repeat值
// 注：对于tensor高维切分计算接口，firstMaxRepeat就是repeatTime；对于tensor前n个数据计算接口，firstMaxRepeat为count/elementsPerRepeat，对于count小于elementsPerRepeat的场景，firstMaxRepeat为1
int iter1OutputCount = firstMaxRepeat;                                              // 第一轮操作产生的元素个数
int iter1AlignEnd = DivCeil(iter1OutputCount, elementsPerBlock) * elementsPerBlock; // 第一轮产生的元素个数做向上取整
int finalWorkLocalNeedSize = iter1AlignEnd; // 最终sharedTmpBuffer所需的elements空间大小就是第一轮操作产生元素做向上取整后的结果

不同硬件形态对应的ReduceSum相加方式如下：

Ascend 950PR/Ascend 950DT，采用方式一。
Atlas A3 训练系列产品/Atlas A3 推理系列产品，tensor前n个数据计算接口采用方式二，tensor高维切分计算接口采用方式一。
Atlas A2 训练系列产品/Atlas A2 推理系列产品，tensor前n个数据计算接口采用方式二，tensor高维切分计算接口采用方式一。
Atlas 200I/500 A2 推理产品，采用方式一。
Atlas 推理系列产品 AI Core，采用方式一。
Atlas 训练系列产品，采用方式一。
Kirin X90，采用方式一。
Kirin 9030，采用方式一。

调用示例

更多样例请参考ReduceSum样例。

tensor高维切分计算样例-mask连续模式：

// dstLocal,srcLocal和sharedTmpBuffer均为half类型,srcLocal的计算数据量为8320,并且连续排布，使用tensor高维切分计算接口，设定repeatTime为65，mask为全部元素参与计算
int32_t mask = 128;
AscendC::ReduceSum<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8);

tensor高维切分计算样例-mask逐bit模式：

// dstLocal,srcLocal和sharedTmpBuffer均为half类型,srcLocal的计算数据量为8320,并且连续排布，使用tensor高维切分计算接口，设定repeatTime为65，mask为全部元素参与计算
uint64_t mask[2] = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
AscendC::ReduceSum<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8);

tensor前n个数据计算样例：

// dstLocal,srcLocal和sharedTmpBuffer均为half类型,srcLocal的计算数据量为8320,并且连续排布，使用tensor前n个数据计算接口
AscendC::ReduceSum<half>(dstLocal, srcLocal, sharedTmpBuffer, 8320);

tensor高维切分计算接口示例：

#include "kernel_operator.h"
int srcDataSize = 8320;
int dstDataSize = 16;
int mask = 128;
int repStride = 8;
int repeat = srcDataSize / mask; // 这里是65

// 初始化srcLocal 、dstLocal 、sharedTmpBuffer
AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
// mask为128一次计算128个元素,65次repeat计算完8320个数
AscendC::ReduceSum<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, repeat, repStride);
// 释放Tensor
outQueueDst.EnQue<half>(dstLocal);
inQueueSrc.FreeTensor(srcLocal);
workQueue.FreeTensor(sharedTmpBuffer);

示例结果如下：

输入数据src_gm：

[1. 1. 1. ... 1. 1. 1.]

输出数据dst_gm：

[8320. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

tensor前n个数据计算接口示例：

#include "kernel_operator.h"

int srcDataSize = 288;
// 初始化srcLocal 、dstLocal 、sharedTmpBuffer
AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();

// level2接口计算前288个数，计算前288个数的和
AscendC::ReduceSum<half>(dstLocal, srcLocal, sharedTmpBuffer, srcDataSize);
// 释放Tensor
outQueueDst.EnQue<half>(dstLocal);
inQueueSrc.FreeTensor(srcLocal);
workQueue.FreeTensor(sharedTmpBuffer);

示例结果如下：

输入数据src_gm：

[1. 1. 1. ... 1. 1. 1.]

输出数据dst_gm：

[288. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]