aclnnGroupedMatmulFinalizeRoutingV3
产品支持情况
| 产品 | 是否支持 |
|---|---|
| Ascend 950PR/Ascend 950DT | √ |
| Atlas A3 训练系列产品/Atlas A3 推理系列产品 | √ |
| Atlas A2 训练系列产品/Atlas A2 推理系列产品 | √ |
| Atlas 200I/500 A2 推理产品 | × |
| Atlas 推理系列产品 | × |
| Atlas 训练系列产品 | × |
功能说明
-
接口功能: GroupedMatmul和MoeFinalizeRouting的融合算子,GroupedMatmul计算后的输出按照索引做combine动作。
相较于aclnnGroupedMatmulFinalizeRoutingV2接口,此接口新增:
- Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:新增入参tuningConfigOptional,调优参数。数组中的第一个值表示各个专家处理的token数的预期值,算子tiling时会按照该预期值合理进行tiling切分,性能更优。
- Ascend 950PR/Ascend 950DT:新增了MX量化场景,相关信息参考量化介绍。
-
计算公式:
-
1.分组矩阵乘法GMM:
yi=(xi×weighti)∗scalei∗perTokenScaleiy_i=(x_i\times weight_i) * scale_i * perTokenScale_i
-
2.路由专家与专家输出分配:
对于每个token j,执行路由与输出专家分配:
y[rowIndex[i],:]=y[rowIndex[i],:]+yi(j)[j−starti(j)]y[ rowIndex[i] , : ] = y[rowIndex[i], :] + y_{i(j)}[ j - start_{i(j)}]
其中 i(j)i(j) 是 token j被分配到的专家索引。yi(j)[j−starti(j)]y_{i(j)}[ j - start_{i(j)}]是该token在对应专家下的计算结果。
-
3.共享专家输出融合:
y[rowIndex[i],:]=y[rowIndex[i],:]+sharedInputWeight×sharedInput[j,:]y [rowIndex[i],:] = y[rowIndex[i],:] + sharedInputWeight \times sharedInput[j, :]
-
4.共享专家输出融合:最终输出结果是所有专家输出与共享专家输出,按照rowIndex所有进行合并的结果,计算过程如下:
y[rowIndex[i],:]=∑i∈E[j]yi[j−starti]+sharedInputWeight×sharedInput[j,:]y[rowIndex[i],:] = \sum_{i \in \mathcal{E}[j]} y_i [j - start_i] + sharedInputWeight \times sharedInput[j, :]
其中E[j]\mathcal{E}[j]是表示分配给token j的专家集合。
-
函数原型
每个算子分为两段式接口,必须先调用“aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize”接口获取计算所需workspace大小以及包含了算子计算流程的执行器,再调用“aclnnGroupedMatmulFinalizeRoutingV3”接口执行计算。
aclnnStatus aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize(
const aclTensor *x1,
aclTensor *x2,
const aclTensor *scaleOptional,
const aclTensor *biasOptional,
const aclTensor *offsetOptional,
const aclTensor *antiquantScaleOptional,
const aclTensor *antiquantOffsetOptional,
const aclTensor *pertokenScaleOptional,
const aclTensor *groupListOptional,
const aclTensor *sharedInputOptional,
const aclTensor *logitOptional,
const aclTensor *rowIndexOptional,
int64_t dtype,
float sharedInputWeight,
int64_t sharedInputOffset,
bool transposeX1,
bool transposeX2,
int64_t groupListType,
const aclIntArray *tuningConfigOptional,
aclTensor *out,
uint64_t *workspaceSize,
aclOpExecutor **executor)
aclnnStatus aclnnGroupedMatmulFinalizeRoutingV3(
void *workspace,
uint64_t workspaceSize,
aclOpExecutor *executor,
aclrtStream stream)
aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize
-
参数说明
参数名 输入/输出 描述 使用说明 数据类型 数据格式 维度(shape) 非连续Tensor x1 输入 输入x(左矩阵)。 - INT8,FLOAT8_E5M2,FLOAT8_E4M3FN,FLOAT4_E2M1 ND (m, k) - x2 输入 输入weight(右矩阵)。 - INT4,FLOAT8_E5M2,FLOAT8_E4M3FN,FLOAT4_E2M1 ND 支持三维 - scaleOptional 输入 量化参数中的缩放因子,per-channel量化参数。 - INT64,FLOAT8_E8M0 ND - - biasOptional 输入 矩阵的偏移。 - FLOAT32,BF16 ND 支持二维,维度为(e, n) - offsetOptional 输入 非对称量化的偏移量。 - FLOAT32 ND - - antiquantScaleOptional 输入 伪量化的缩放因子。 目前暂未启用 FLOAT32 ND - antiquantOffsetOptional 输入 伪量化的偏移量。 目前暂未启用 FLOAT32 ND - pertokenScaleOptional 输入 矩阵计算的反量化参数。 FLOAT32,FLOAT8_E8M0 ND - - groupListOptional 输入 输入和输出分组轴方向的matmul大小分布。 INT64 ND 支持一维,维度为(e) - sharedInputOptional 输入 moe计算中共享专家的输出,需要与moe专家的输出进行combine操作。 BF16 ND - - logitOptional 输入 moe专家对各个token的logit大小。 FLOAT32 ND shape支持一维,维度为(m) - rowIndexOptional 输入 moe专家输出按照该rowIndex进行combine,其中的值即为combine做scatter add的索引。 INT64 ND shape支持一维,维度为(m) - dtype 输入 计算的输出类型:0:FLOAT32;1:FLOAT16;2:BFLOAT16。目前仅支持0。 INT64 - sharedInputWeight 输入 共享专家与moe专家进行combine的系数,sharedInput先与该参数乘,然后在和moe专家结果累加。 FLOAT32 - sharedInputOffset 输入 共享专家输出的在总输出中的偏移。 INT64 - transposeX1 输入 左矩阵是否转置,仅支持false。 BOOL - transposeX2 输入 右矩阵是否转置。 BOOL - groupListType 输入 分组模式:配置为0:cumsum模式,即为前缀和;配置为1:count模式。 INT64 - tuningConfigOptional 输入 数组中的第一个元素表示各个专家处理的token数的预期值,算子tiling时会按照数组的第一个元素合理进行tiling切分,性能更优。从第二个元素开始预留,用户无须填写。未来会进行扩展。兼容历史版本,用户如不使用该参数,不传入(即为nullptr)即可。 INT64 - out 输出 输出结果。 - FLOAT32 ND (batch, n) - workspaceSize 输出 返回需要在Device侧申请的workspace大小。 - - - - - executor 输出 返回op执行器,包含了算子计算流程。 - - - - - -
Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:
- x1仅支持INT8。维度为(m, k),维度m的取值范围为[1,16*1024*8],k支持2048;
- x2仅支持INT4。当输入为INT32时维度为(e, k, n / 8),输入转为INT4时维度为(e, k, n),e取值范围[1,256],k支持2048,n支持7168。仅支持非转置,即transposeX2必须为false。
- scaleOptional支持INT64。shape支持三维,维度为(e, 1, n),e、n和w的e、n一致。
- biasOptional支持FLOAT32。e、n和w的e、n一致。
- offsetOptional支持FLOAT32。shape支持三维,维度为(e, 1, n),e、n和w的e、n一致。
- perTokenScaleOptional支持FLOAT32。支持一维,维度为(m),m和x的m一致。
- groupListOptional支持e和w的e一致。
- sharedInputOptional支持二维,维度为(bsdp,n),bsdp必须小于等于batchSize/e,n和w的n一致。
- logitOptional支持m和x的m一致。
- rowIndexOptional支持m和x的m一致。
- x1、x2、groupListOptional是必选参数,scaleOptional、pertokenScaleOptional、logitOptional、rowIndexOptional、biasOptional,sharedInputOptional是可选参数。
-
Ascend 950PR/Ascend 950DT:
- x1不支持INT8。
- x2不支持INT4。维度为(e,k,n),转置情况下维度为(e,n,k),e取值范围[1,1024]。
- scaleOptional支持FLOAT8_E8M0。shape支持四维。x2非转置时维度为(e,Ceil(k/64),n,2),x2转置时维度为(e,n,Ceil(k/64),2)。数据类型只支持FLOAT8_E8M0,转置属性必须和x2保持一致。
- biasOptional支持BF16。
- sharedInputOptional支持二维,维度为(bsdp,n),bsdp代表batchSize / dataParallelSize。
- perTokenScaleOptional支持FLOAT8_E8M0。shape支持三维,维度为(m,Ceil(k/64),2)。
- x1、x2、scaleOptional、pertokenScaleOptional、groupListOptional、logitOptional、rowIndexOptional是必选参数,biasOptional,sharedInputOptional是可选参数。目前暂不支持offsetOptional参数。所有参数均不支持空tensor。
- out的第一维batch、sharedInputOffset必须大于等于0,且小于等于m。
- x1支持M为0的空Tensor。
- x2支持N为0的空Tensor。
-
-
返回值
返回aclnnStatus状态码,具体参见aclnn返回码。
第一段接口完成入参校验,出现以下场景时报错:
返回值 错误码 描述 ACLNN_ERR_PARAM_NULLPTR 161001 传入参数是必选输入、输出或者必须属性,且是空指针。 ACLNN_ERR_PARAM_INVALID 161002 x1、x2、scaleOptional、biasOptional、offsetOptional、antiquantScaleOptional、antiquantOffsetOptional、pertokenScaleOptional、groupListOptional、sharedInputOptional、logitOptional、rowIndexOptional、sharedInputWeight、sharedInputOffset、transposeX1、transposeX2、或out的数据类型或数据格式不在支持的范围内。 x1、x2、scaleOptional、biasOptional、offsetOptional、antiquantScaleOptional、antiquantOffsetOptional、pertokenScaleOptional、groupListOptional、sharedInputOptional、logitOptional、rowIndexOptional或out的shape不满足校验条件。 x1、x2、scaleOptional、biasOptional、offsetOptional、antiquantScaleOptional、antiquantOffsetOptional、pertokenScaleOptional、groupListOptional、sharedInputOptional、logitOptional、rowIndexOptional或out的shape是空tensor。
aclnnGroupedMatmulFinalizeRoutingV3
-
参数说明
参数名 输入/输出 描述 workspace 输入 在Device侧申请的workspace内存地址。 workspaceSize 输入 在Device侧申请的workspace大小,由第一段接口aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize获取。 executor 输入 op执行器,包含了算子计算流程。 stream 输入 指定执行任务的Stream。 -
返回值
返回aclnnStatus状态码,具体参见aclnn返回码。
约束说明
-
确定性计算:
- Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:aclnnGroupedMatmulFinalizeRoutingV3默认非确定性实现,支持通过aclrtCtxSetSysParamOpt开启确定性。
- Ascend 950PR/Ascend 950DT :aclnnGroupedMatmulFinalizeRoutingV3默认非确定性实现,不支持通过aclrtCtxSetSysParamOpt开启确定性。
-
Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:仅支持伪量化场景。
-
输入和输出支持以下数据类型组合:
x1 x2 scaleOptional biasOptional offsetOptional antiquantScaleOptional antiquantOffsetOptional pertokenScaleOptional groupListOptional sharedInputOptional logitOptional rowIndexOptional out INT8 INT4 INT64 FLOAT32 FLOAT32 null null FLOAT32 INT64 BFLOAT16 FLOAT32 INT64 FLOAT32 INT8 INT4 INT64 FLOAT32 null null null FLOAT32 INT64 BFLOAT16 FLOAT32 INT64 FLOAT32 -
在该场景中,scaleOptional代表per-channel和per-group离线融合的结果。
-
在该场景中,biasOptional代表离线计算的辅助结果,值要求为8×w×scaleOptional8 \times w \times scaleOptional,并在第一维累加。
-
该场景支持对称量化和非对称量化。在对称量化时,offsetOptional需要设置为空;在非对称量化时,offsetOptional代表离线计算的辅助结果,即为antiquantOffsetOptional×scaleOptionalantiquantOffsetOptional \times scaleOptional的结果。
-
在该场景中,antiquantScaleOptional、antiquantOffsetOptional必须设置为空。
-
-
Ascend 950PR/Ascend 950DT:仅支持MX全量化场景。
-
输入和输出支持以下数据类型组合:
MX量化场景 x1 x2 scaleOptional biasOptional pertokenScaleOptional groupListOptional sharedInputOptional logitOptional rowIndexOptional out MXFP8 FLOAT8_E4M3FN / FLOAT8_E5M2 FLOAT8_E4M3FN / FLOAT8_E5M2 FLOAT8_E8M0 BFLOAT16 / null FLOAT8_E8M0 INT64 BFLOAT16 / null FLOAT32 INT64 FLOAT32 MXFP4 FLOAT4_E2M1 FLOAT4_E2M1 FLOAT8_E8M0 BFLOAT16 / null FLOAT8_E8M0 INT64 BFLOAT16 / null FLOAT32 INT64 FLOAT32 -
在MXFP4/MXFP8场景中,offsetOptional、antiquantScaleOptional、antiquantOffsetOptional必须设置为空。
-
在MXFP4场景中,必须满足k为偶数的约束。在x2非转置的情况下,n必须为偶数。
-
在MXFP4/MXFP8场景中,支持x2转置或者非转置。x2与scale的转置属性必须保持一致。
-
e 必须小于等于1024。
-
在MXFP4场景中,k不能为2。
-
调用示例
调用示例代码如下,仅供参考,具体编译和执行过程请参考编译与运行样例。
-
Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:
#include <iostream> #include <memory> #include <vector> #include "acl/acl.h" #include "aclnnop/aclnn_permute.h" #include "aclnnop/aclnn_grouped_matmul_finalize_routing_v3.h" #include "aclnnop/aclnn_trans_matmul_weight.h" #define CHECK_RET(cond, return_expr) \ do { \ if (!(cond)) { \ return_expr; \ } \ } while (0) #define CHECK_FREE_RET(cond, return_expr) \ do { \ if (!(cond)) { \ Finalize(deviceId, stream); \ return_expr; \ } \ } while (0) #define LOG_PRINT(message, ...) \ do { \ printf(message, ##__VA_ARGS__); \ } while (0) int64_t GetShapeSize(const std::vector<int64_t> &shape) { int64_t shapeSize = 1; for (auto i : shape) { shapeSize *= i; } return shapeSize; } int Init(int32_t deviceId, aclrtStream *stream) { // 固定写法,资源初始化 auto ret = aclInit(nullptr); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret); ret = aclrtSetDevice(deviceId); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret); ret = aclrtCreateStream(stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret); return 0; } template <typename T> int CreateAclTensor(const std::vector<T> &hostData, const std::vector<int64_t> &shape, void **deviceAddr, aclDataType dataType, aclTensor **tensor) { auto size = GetShapeSize(shape) * sizeof(T); // 调用aclrtMalloc申请device侧内存 auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret); // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上 ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret); // 计算连续tensor的strides std::vector<int64_t> strides(shape.size(), 1); for (int64_t i = shape.size() - 2; i >= 0; i--) { strides[i] = shape[i + 1] * strides[i + 1]; } // 调用aclCreateTensor接口创建aclTensor *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND, shape.data(), shape.size(), *deviceAddr); return 0; } template <typename T> int CreateAclTensorWeight(const std::vector<T> &hostData, const std::vector<int64_t> &shape, void **deviceAddr, aclDataType dataType, aclTensor **tensor) { auto size = static_cast<uint64_t>(GetShapeSize(shape)); const aclIntArray *mat2Size = aclCreateIntArray(shape.data(), shape.size()); auto ret = aclnnCalculateMatmulWeightSizeV2(mat2Size, dataType, &size); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnCalculateMatmulWeightSizeV2 failed. ERROR: %d\n", ret); return ret); size *= sizeof(T); // 调用aclrtMalloc申请device侧内存 ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret); // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上 ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret); // 计算连续tensor的strides std::vector<int64_t> strides(shape.size(), 1); for (int64_t i = shape.size() - 2; i >= 0; i--) { strides[i] = shape[i + 1] * strides[i + 1]; } std::vector<int64_t> storageShape; storageShape.push_back(GetShapeSize(shape)); // 调用aclCreateTensor接口创建aclTensor *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND, storageShape.data(), storageShape.size(), *deviceAddr); return 0; } int main() { int32_t deviceId = 0; aclrtStream stream; auto ret = Init(deviceId, &stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("Init stream failed. ERROR: %d\n", ret); return ret); // 2. 构造输入与输出,需要根据API的接口自定义构造 int64_t m = 8; int64_t k = 2048; int64_t n = 7168; int64_t e = 1; int64_t batch = 8; int64_t bsdp = 1; int64_t dtype = 0; float shareInputWeight = 1.0; int64_t sharedInputOffset = 0; bool transposeX = false; bool transposeW = false; int64_t groupListType = 1; std::vector<int64_t> xShape = {m, k}; std::vector<int64_t> wShape = {e, k, n / 8}; std::vector<int64_t> scaleShape = {e, 1, n}; std::vector<int64_t> biasShape = {e, n}; std::vector<int64_t> offsetShape = {e, 1, n}; std::vector<int64_t> pertokenScaleShape = {m}; std::vector<int64_t> groupListShape = {e}; std::vector<int64_t> sharedInputShape = {bsdp, n}; std::vector<int64_t> logitShape = {m}; std::vector<int64_t> rowIndexShape = {m}; std::vector<int64_t> outShape = {batch, n}; std::vector<int64_t> tuningConfigVal = { 1 }; void *xDeviceAddr = nullptr; void *wDeviceAddr = nullptr; void *biasDeviceAddr = nullptr; void *scaleDeviceAddr = nullptr; void *offsetDeviceAddr = nullptr; void *pertokenScaleDeviceAddr = nullptr; void *groupListDeviceAddr = nullptr; void *sharedInputDeviceAddr = nullptr; void *logitDeviceAddr = nullptr; void *rowIndexDeviceAddr = nullptr; void *outDeviceAddr = nullptr; aclTensor* x = nullptr; aclTensor* w = nullptr; aclTensor* bias = nullptr; aclTensor* groupList = nullptr; aclTensor* scale = nullptr; aclTensor* offset = nullptr; aclTensor* pertokenScale = nullptr; aclTensor* sharedInput = nullptr; aclTensor* logit = nullptr; aclTensor* rowIndex = nullptr; aclTensor* out = nullptr; std::vector<int8_t> xHostData(GetShapeSize(xShape)); std::vector<int32_t> wHostData(GetShapeSize(wShape)); std::vector<int64_t> scaleHostData(GetShapeSize(scaleShape)); std::vector<float> biasHostData(GetShapeSize(biasShape)); std::vector<float> offsetHostData(GetShapeSize(offsetShape)); std::vector<float> pertokenScaleHostData(GetShapeSize(pertokenScaleShape)); std::vector<int64_t> groupListHostData(GetShapeSize(groupListShape)); std::vector<uint16_t> sharedInputHostData(GetShapeSize(sharedInputShape)); std::vector<int64_t> logitHostData(GetShapeSize(logitShape)); std::vector<float> rowIndexHostData(GetShapeSize(rowIndexShape)); std::vector<float> outHostData(GetShapeSize(outShape)); // 对groupList赋值 groupListHostData[0] = 8; // 创建x aclTensor ret = CreateAclTensor(xHostData, xShape, &xDeviceAddr, aclDataType::ACL_INT8, &x); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> xTensorPtr(x, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> xDeviceAddrPtr(xDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建int32_t 的w aclTensor,后续转为int_4 ret = CreateAclTensorWeight(wHostData, wShape, &wDeviceAddr, aclDataType::ACL_INT32, &w); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> wTensorPtr(w, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> wDeviceAddrPtr(wDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建scale aclTensor ret = CreateAclTensor(scaleHostData, scaleShape, &scaleDeviceAddr, aclDataType::ACL_INT64, &scale); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> scaleTensorPtr(scale, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> scaleDeviceAddrPtr(scaleDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建bias aclTensor ret = CreateAclTensor(biasHostData, biasShape, &biasDeviceAddr, aclDataType::ACL_FLOAT, &bias); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> biasTensorPtr(bias, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> biasDeviceAddrPtr(biasDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建offset aclTensor ret = CreateAclTensor(offsetHostData, offsetShape, &offsetDeviceAddr, aclDataType::ACL_FLOAT, &offset); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> offsetTensorPtr(offset, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> offsetDeviceAddrPtr(offsetDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建pertokenScale aclTensor ret = CreateAclTensor(pertokenScaleHostData, pertokenScaleShape, &pertokenScaleDeviceAddr, aclDataType::ACL_FLOAT, &pertokenScale); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> pertokenScaleTensorPtr(pertokenScale, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> pertokenScaleDeviceAddrPtr(pertokenScaleDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建groupList aclTensor ret = CreateAclTensor(groupListHostData, groupListShape, &groupListDeviceAddr, aclDataType::ACL_INT64, &groupList); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> groupListTensorPtr(groupList, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> groupListDeviceAddrPtr(groupListDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建sharedInput aclTensor ret = CreateAclTensor(sharedInputHostData, sharedInputShape, &sharedInputDeviceAddr, aclDataType::ACL_BF16, &sharedInput); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> sharedInputTensorPtr(sharedInput, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> sharedInputDeviceAddrPtr(sharedInputDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建logit aclTensor ret = CreateAclTensor(logitHostData, logitShape, &logitDeviceAddr, aclDataType::ACL_FLOAT, &logit); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> logitTensorPtr(logit, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> logitDeviceAddrPtr(logitDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建rowIndex aclTensor ret = CreateAclTensor(rowIndexHostData, rowIndexShape, &rowIndexDeviceAddr, aclDataType::ACL_INT64, &rowIndex); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> rowIndexTensorPtr(rowIndex, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> rowIndexDeviceAddrPtr(rowIndexDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建out aclTensor ret = CreateAclTensor(outHostData, outShape, &outDeviceAddr, aclDataType::ACL_FLOAT, &out); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> outTensorPtr(out, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> outDeviceAddrPtr(outDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); aclIntArray *tuningConfig = aclCreateIntArray(tuningConfigVal.data(), tuningConfigVal.size()); CHECK_RET(tuningConfig == nullptr, -1); // 3. 调用CANN算子库API,需要修改为具体的Api名称 uint64_t workspaceSize = 0; aclOpExecutor *executor; void *workspaceAddr = nullptr; // 调用aclnnGroupedMatmulFinalizeRoutingV3第一段接口 workspaceSize = 0; ret = aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize(x, w, scale, bias, offset, nullptr, nullptr, pertokenScale, groupList, sharedInput, logit, rowIndex, dtype, shareInputWeight, sharedInputOffset, transposeX, transposeW, groupListType, tuningConfig, out, &workspaceSize, &executor); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize failed. ERROR: %d\n", ret); return ret); // 根据第一段接口计算出的workspaceSize申请device内存 if (workspaceSize > 0) { ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret); } // 调用aclnnGroupedMatmulFinalizeRoutingV3第二段接口 ret = aclnnGroupedMatmulFinalizeRoutingV3(workspaceAddr, workspaceSize, executor, stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnGroupedMatmulFinalizeRoutingV3 failed. ERROR: %d\n", ret); return ret); // 4. (固定写法)同步等待任务执行结束 ret = aclrtSynchronizeStream(stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret); // 5. 获取输出的值,将device侧内存上的结果拷贝至host侧,需要根据具体API的接口定义修改 auto size = GetShapeSize(outShape); std::vector<float> resultData(size, 0); ret = aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), outDeviceAddr, size * sizeof(resultData[0]), ACL_MEMCPY_DEVICE_TO_HOST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret); for (int64_t i = 0; i < size; i++) { LOG_PRINT("result[%lld] is: %f\n", i, resultData[i]); } // 6. 释放aclTensor资源,需要根据具体API的接口定义修改 aclDestroyTensor(x); aclDestroyTensor(w); aclDestroyTensor(scale); aclDestroyTensor(bias); aclDestroyTensor(offset); aclDestroyTensor(pertokenScale); aclDestroyTensor(groupList); aclDestroyTensor(sharedInput); aclDestroyTensor(logit); aclDestroyTensor(rowIndex); aclDestroyTensor(out); // 7.释放device资源,需要根据具体API的接口定义修改 aclrtFree(xDeviceAddr); aclrtFree(wDeviceAddr); aclrtFree(scaleDeviceAddr); aclrtFree(biasDeviceAddr); aclrtFree(offsetDeviceAddr); aclrtFree(pertokenScaleDeviceAddr); aclrtFree(groupListDeviceAddr); aclrtFree(sharedInputDeviceAddr); aclrtFree(logitDeviceAddr); aclrtFree(rowIndexDeviceAddr); aclrtFree(outDeviceAddr); aclDestroyIntArray(tuningConfig); if (workspaceSize > 0) { aclrtFree(workspaceAddr); } aclrtDestroyStream(stream); aclrtResetDevice(deviceId); aclFinalize(); return 0; } -
Ascend 950PR/Ascend 950DT:
#include <iostream> #include <memory> #include <vector> #include "acl/acl.h" #include "aclnnop/aclnn_grouped_matmul_finalize_routing_v3.h" #define CHECK_RET(cond, return_expr) \ do { \ if (!(cond)) { \ return_expr; \ } \ } while (0) #define CHECK_FREE_RET(cond, return_expr) \ do { \ if (!(cond)) { \ Finalize(deviceId, stream); \ return_expr; \ } \ } while (0) #define LOG_PRINT(message, ...) \ do { \ printf(message, ##__VA_ARGS__); \ } while (0) int64_t GetShapeSize(const std::vector<int64_t> &shape) { int64_t shapeSize = 1; for (auto i : shape) { shapeSize *= i; } return shapeSize; } int Init(int32_t deviceId, aclrtStream *stream) { // 固定写法,资源初始化 auto ret = aclInit(nullptr); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret); ret = aclrtSetDevice(deviceId); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret); ret = aclrtCreateStream(stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret); return ACL_SUCCESS; } template <typename T> aclnnStatus CreateAclTensor(const std::vector<T> &hostData, const std::vector<int64_t> &shape, void **deviceAddr, aclDataType dataType, aclTensor **tensor) { auto size = GetShapeSize(shape) * sizeof(T); // 调用aclrtMalloc申请device侧内存 auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret); // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上 ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret); // 计算连续tensor的strides std::vector<int64_t> strides(shape.size(), 1); for (int64_t i = shape.size() - 2; i >= 0; i--) { strides[i] = shape[i + 1] * strides[i + 1]; } // 调用aclCreateTensor接口创建aclTensor *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND, shape.data(), shape.size(), *deviceAddr); return ACL_SUCCESS; } template <typename T> aclnnStatus CreateAclTensorWeight(const std::vector<T> &hostData, const std::vector<int64_t> &shape, void **deviceAddr, aclDataType dataType, aclTensor **tensor) { auto size = static_cast<uint64_t>(GetShapeSize(shape)); size *= sizeof(T); // 调用aclrtMalloc申请device侧内存 auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret); // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上 ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret); // 计算连续tensor的strides std::vector<int64_t> strides(shape.size(), 1); for (int64_t i = shape.size() - 2; i >= 0; i--) { strides[i] = shape[i + 1] * strides[i + 1]; } std::vector<int64_t> storageShape; storageShape.push_back(GetShapeSize(shape)); // 调用aclCreateTensor接口创建aclTensor *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND, storageShape.data(), storageShape.size(), *deviceAddr); return ACL_SUCCESS; } template <typename T1, typename T2> auto Ceil(T1 a, T2 b) -> T1 { if (b == 0) { return a; } return (a + b - 1) / b; } int main() { // 1. (固定写法)device/stream初始化,参考AscendCL对外接口列表 int32_t deviceId = 0; aclrtStream stream; auto ret = Init(deviceId, &stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("Init stream failed. ERROR: %d\n", ret); return ret); // 2. 构造输入与输出,需要根据API的接口自定义构造 int64_t m = 8; int64_t k = 2048; int64_t n = 7168; int64_t e = 1; int64_t g = 1; int64_t batch = 8; int64_t bsdp = 1; int64_t dtype = 0; float shareInputWeight = 1.0; int64_t sharedInputOffset = 0; bool transposeX = false; bool transposeW = false; int64_t groupListType = 1; std::vector<int64_t> xShape = {m, k}; std::vector<int64_t> wShape = {e, k, n}; std::vector<int64_t> scaleShape = {g, Ceil(k,64),n,2}; std::vector<int64_t> biasShape = {e, n}; std::vector<int64_t> offsetShape = {e, 1, n}; std::vector<int64_t> pertokenScaleShape = {m,Ceil(k,64),2}; std::vector<int64_t> groupListShape = {e}; std::vector<int64_t> sharedInputShape = {bsdp, n}; std::vector<int64_t> logitShape = {m}; std::vector<int64_t> rowIndexShape = {m}; std::vector<int64_t> outShape = {batch, n}; void *xDeviceAddr = nullptr; void *wDeviceAddr = nullptr; void *biasDeviceAddr = nullptr; void *scaleDeviceAddr = nullptr; void *offsetDeviceAddr = nullptr; void *pertokenScaleDeviceAddr = nullptr; void *groupListDeviceAddr = nullptr; void *sharedInputDeviceAddr = nullptr; void *logitDeviceAddr = nullptr; void *rowIndexDeviceAddr = nullptr; void *outDeviceAddr = nullptr; aclTensor* x = nullptr; aclTensor* w = nullptr; aclTensor* bias = nullptr; aclTensor* groupList = nullptr; aclTensor* scale = nullptr; aclTensor* offset = nullptr; aclTensor* pertokenScale = nullptr; aclTensor* sharedInput = nullptr; aclTensor* logit = nullptr; aclTensor* rowIndex = nullptr; aclTensor* out = nullptr; std::vector<int8_t> xHostData(GetShapeSize(xShape)); std::vector<int32_t> wHostData(GetShapeSize(wShape)); std::vector<int64_t> scaleHostData(GetShapeSize(scaleShape)); std::vector<float> biasHostData(GetShapeSize(biasShape)); std::vector<float> offsetHostData(GetShapeSize(offsetShape)); std::vector<float> pertokenScaleHostData(GetShapeSize(pertokenScaleShape)); std::vector<int64_t> groupListHostData(GetShapeSize(groupListShape)); std::vector<uint16_t> sharedInputHostData(GetShapeSize(sharedInputShape)); std::vector<int64_t> logitHostData(GetShapeSize(logitShape)); std::vector<float> rowIndexHostData(GetShapeSize(rowIndexShape)); std::vector<float> outHostData(GetShapeSize(outShape)); // 对groupList赋值 groupListHostData[0] = 8; // 创建x aclTensor ret = CreateAclTensor(xHostData, xShape, &xDeviceAddr, aclDataType::ACL_FLOAT8_E5M2, &x); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> xTensorPtr(x, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> xDeviceAddrPtr(xDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建w aclTensor ret = CreateAclTensorWeight(wHostData, wShape, &wDeviceAddr, aclDataType::ACL_FLOAT8_E5M2, &w); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> wTensorPtr(w, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> wDeviceAddrPtr(wDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建scale aclTensor ret = CreateAclTensor(scaleHostData, scaleShape, &scaleDeviceAddr, aclDataType::ACL_FLOAT8_E8M0, &scale); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> scaleTensorPtr(scale, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> scaleDeviceAddrPtr(scaleDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建bias aclTensor ret = CreateAclTensor(biasHostData, biasShape, &biasDeviceAddr, aclDataType::ACL_BF16, &bias); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> biasTensorPtr(bias, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> biasDeviceAddrPtr(biasDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建offset aclTensor ret = CreateAclTensor(offsetHostData, offsetShape, &offsetDeviceAddr, aclDataType::ACL_FLOAT, &offset); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> offsetTensorPtr(offset, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> offsetDeviceAddrPtr(offsetDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建pertokenScale aclTensor ret = CreateAclTensor(pertokenScaleHostData, pertokenScaleShape, &pertokenScaleDeviceAddr, ACL_FLOAT8_E8M0, &pertokenScale); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> pertokenScaleTensorPtr(pertokenScale, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> pertokenScaleDeviceAddrPtr(pertokenScaleDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建groupList aclTensor ret = CreateAclTensor(groupListHostData, groupListShape, &groupListDeviceAddr, aclDataType::ACL_INT64, &groupList); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> groupListTensorPtr(groupList, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> groupListDeviceAddrPtr(groupListDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建sharedInput aclTensor ret = CreateAclTensor(sharedInputHostData, sharedInputShape, &sharedInputDeviceAddr, aclDataType::ACL_BF16, &sharedInput); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> sharedInputTensorPtr(sharedInput, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> sharedInputDeviceAddrPtr(sharedInputDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建logit aclTensor ret = CreateAclTensor(logitHostData, logitShape, &logitDeviceAddr, aclDataType::ACL_FLOAT, &logit); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> logitTensorPtr(logit, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> logitDeviceAddrPtr(logitDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建rowIndex aclTensor ret = CreateAclTensor(rowIndexHostData, rowIndexShape, &rowIndexDeviceAddr, aclDataType::ACL_INT64, &rowIndex); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> rowIndexTensorPtr(rowIndex, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> rowIndexDeviceAddrPtr(rowIndexDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 创建out aclTensor ret = CreateAclTensor(outHostData, outShape, &outDeviceAddr, aclDataType::ACL_FLOAT, &out); std::unique_ptr<aclTensor, aclnnStatus (*)(const aclTensor *)> outTensorPtr(out, aclDestroyTensor); std::unique_ptr<void, aclError (*)(void *)> outDeviceAddrPtr(outDeviceAddr, aclrtFree); CHECK_RET(ret == ACL_SUCCESS, return ret); // 3. 调用CANN算子库API,需要修改为具体的Api名称 uint64_t workspaceSize = 0; aclOpExecutor *executor; void *workspaceAddr = nullptr; // 调用aclnnGroupedMatmulFinalizeRoutingV3第一段接口 ret = aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize(x, w, scale, bias, nullptr, nullptr, nullptr, pertokenScale, groupList, sharedInput, logit, rowIndex, dtype, shareInputWeight, sharedInputOffset, transposeX, transposeW, groupListType, nullptr, out, &workspaceSize, &executor); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnGroupedMatmulFinalizeRoutingV3GetWorkspaceSize failed. ERROR: %d\n", ret); return ret); // 根据第一段接口计算出的workspaceSize申请device内存 if (workspaceSize > 0) { ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret); } // 调用aclnnGroupedMatmulFinalizeRoutingV3第二段接口 ret = aclnnGroupedMatmulFinalizeRoutingV3(workspaceAddr, workspaceSize, executor, stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnGroupedMatmulFinalizeRoutingV3 failed. ERROR: %d\n", ret); return ret); // 4. (固定写法)同步等待任务执行结束 ret = aclrtSynchronizeStream(stream); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret); // 5. 获取输出的值,将device侧内存上的结果拷贝至host侧,需要根据具体API的接口定义修改 auto size = GetShapeSize(outShape); std::vector<float> resultData(size, 0); ret = aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), outDeviceAddr, size * sizeof(resultData[0]), ACL_MEMCPY_DEVICE_TO_HOST); CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret); for (int64_t i = 0; i < size; i++) { LOG_PRINT("result[%lld] is: %f\n", i, resultData[i]); } // 6. 释放aclTensor和aclTensor,需要根据具体API的接口定义修改 aclDestroyTensor(x); aclDestroyTensor(w); aclDestroyTensor(scale); aclDestroyTensor(bias); aclDestroyTensor(offset); aclDestroyTensor(pertokenScale); aclDestroyTensor(groupList); aclDestroyTensor(sharedInput); aclDestroyTensor(logit); aclDestroyTensor(rowIndex); aclDestroyTensor(out); // 7.释放device资源,需要根据具体API的接口定义修改 aclrtFree(xDeviceAddr); aclrtFree(wDeviceAddr); aclrtFree(scaleDeviceAddr); aclrtFree(biasDeviceAddr); aclrtFree(offsetDeviceAddr); aclrtFree(pertokenScaleDeviceAddr); aclrtFree(groupListDeviceAddr); aclrtFree(sharedInputDeviceAddr); aclrtFree(logitDeviceAddr); aclrtFree(rowIndexDeviceAddr); aclrtFree(outDeviceAddr); if (workspaceSize > 0) { aclrtFree(workspaceAddr); } aclrtDestroyStream(stream); aclrtResetDevice(deviceId); aclFinalize(); return 0; }