aclnnNpuFormatCast

📄 查看源码

产品支持情况

产品 是否支持
Ascend 950PR/Ascend 950DT
Atlas A3 训练系列产品/Atlas A3 推理系列产品
Atlas A2 训练系列产品/Atlas A2 推理系列产品
Atlas 200I/500 A2 推理产品 ×
Atlas 推理系列产品 ×
Atlas 训练系列产品 ×

功能说明

  • 接口功能

    • Ascend 950PR/Ascend 950DT:

      • 完成ND数据格式到指定C0大小的FRACTAL_NZ数据格式的转换功能,C0是FRACTAL_NZ数据格式最后一维的大小,C0由additionalDtype确定。
      • 完成指定C0大小的FRACTAL_NZ数据格式到ND数据格式的转换功能,其中支持的NZ格式包括:FRACTAL_NZ、FRACTAL_NZ_C0_2、FRACTAL_NZ_C0_4、FRACTAL_NZ_C0_16、FRACTAL_NZ_C0_32。
    • Atlas A3 训练系列产品/Atlas A3 推理系列产品、Atlas A2 训练系列产品/Atlas A2 推理系列产品:

      • 完成ND←→NZ的转换功能。C0是NZ数据格式最后一维的大小。计算方法C0 = 32B / ge::GetSizeByDataType(static_cast additionalDtype)。
      • 完成NCDHW←→NDC1HWC0、NCDHW←→FRACTAL_Z_3D的转换功能。其中,C0与微架构强相关,该值等于cube单元的size,例如16。C1是将C维度按照C0切分:C1=C/C0, 若结果不整除,最后一份数据需要padding到C0。计算方法C0 = 32B / srcDataType(例如FP16为2byte)
  • 计算流程

    aclnnNpuFormatCastCalculateSizeAndFormat根据输入张量srcTensor、数据类型additionalDtype和目标张量的数据格式dstFormat计算出转换后目标张量dstTensor的shape和实际数据格式,用于构造dstTensor,然后调用aclnnNpuFormatCast把srcTensor转换为实际数据格式的目标张量dstTensor。

函数原型

必须先调用aclnnNpuFormatCastCalculateSizeAndFormat计算出dstTensor的shape和实际数据格式,再调用两段式接口。 两段式接口先调用aclnnNpuFormatCastGetWorkSpaceSize接口获取计算所需workspace大小以及包含了算子计算流程的执行器,再调用aclnnNpuFormatCast接口执行计算。

aclnnStatus aclnnNpuFormatCastCalculateSizeAndFormat(
    const aclTensor* srcTensor,
    const int        dstFormat,
    int              additionalDtype,
    int64_t**        dstShape,
    uint64_t*        dstShapeSize,
    int*             actualFormat)
aclnnStatus aclnnNpuFormatCastGetWorkspaceSize(
    const aclTensor* srcTensor,
    aclTensor*       dstTensor,
    uint64_t*        workspaceSize,
    aclOpExecutor**  executor)
aclnnStatus aclnnNpuFormatCast(
    void*          workspace,
    uint64_t       workspaceSize,
    aclOpExecutor* executor,
    aclrtStream    stream)

aclnnNpuFormatCastCalculateSizeAndFormat

  • 参数说明

    参数名 输入/输出 描述 使用说明 数据类型 数据格式 维度(shape) 非连续Tensor
    srcTensor(aclTensor*) 输入 转换的源Tensor。 - INT8、UINT8、INT32、UINT32、FLOAT、FLOAT16、BFLOAT162、FLOAT8_E4M3FN、FLOAT8_E4M3FN2、FLOAT4_E2M12、HIFLOAT8 ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D、NCL2 2-6 -
    dstFormat(int) 输入 输出张量的数据格式。 - None ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D None -
    additionalDtype(int) 可选输入 转换为FRACTAL_NZ数据格式时,推断C0大小所使用的基本数据类型 - ACL_FLOAT16(1)、ACL_BF16(27)、INT8(2)、ACL_FLOAT8_E4M3FN(36)、ACL_HIFLOAT8(34) None None -
    dstShape(int64_t**) 输出 用于输出dstTensor的shape数组的指针。该指针指向的内存由本接口申请,调用者释放。 - None None 4-8 -
    dstShapeSize(uint64_t*) 输出 用于输出dstTensor的shape数组大小的指针。 - None None None -
    actualFormat(int*) 输出 用于输出dstTensor实际数据格式的指针。 - None ACL_FORMAT_ND(2)、ACL_FORMAT_FRACTAL_NZ(29)、ACL_FORMAT_NCDHW(30)、ACL_FORMAT_NDC1HWC0(32)、ACL_FRACTAL_Z_3D(33)、ACL_FORMAT_FRACTAL_NZ_C0_16(50)2、ACL_FORMAT_FRACTAL_NZ_C0_32(51)2 None -
    • Ascend 950PR/Ascend 950DT:

      • 上表数据类型列中的角标“1”代表该系列不支持的数据类型或数据格式。
    • Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:

      • 上表数据类型列中的角标“2”代表该系列不支持的数据类型或数据格式。
  • 返回值:

    aclnnStatus:返回状态码,具体参见aclnn返回码

    第一段接口完成入参校验,出现以下场景时报错:

    返回值 错误码 描述
    ACLNN_ERR_PARAM_NULLPTR 161001 传入的srcTensor是空指针。
    ACLNN_ERR_PARAM_INVALID 161002 srcTensor的数据格式非ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D、NCL,数据类型非INT8、UINT8、INT32、UINT32、FLOAT、FLOAT16、BFLOAT16、FLOAT8_E4M3FN、FLOAT4_E2M1、HIFLOAT8。
    dstFormat的数据格式非ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D
    additionalDtype的数据类型非ACL_FLOAT16(1)、ACL_BF16(27)、INT8(2)、ACL_FLOAT8_E4M3FN(36)、ACL_HIFLOAT8(34)。
    srcTensor的view shape维度不在[2, 6]的范围
    srcTensor传入空Tensor
    ACLNN_ERR_RUNTIME_ERROR 361001 产品型号不支持。
    转换格式不支持。

aclnnNpuFormatCastGetWorkspaceSize

  • 参数说明

    参数名 输入/输出 描述 使用说明 数据类型 数据格式 维度(shape) 非连续Tensor
    srcTensor(aclTensor*) 输入 输入张量,输入的数据只支持连续的Tensor。 - INT8、UINT8、INT32、UINT32、FLOAT、FLOAT16、BFLOAT162、FLOAT8_E4M3FN2、FLOAT4_E2M12、HIFLOAT8 ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D、NCL2 2-6 -
    dstTensor(aclTensor*) 输入 转换后的目标张量,只支持连续的Tensor。 - INT8、UINT8、INT32、UINT32、FLOAT、FLOAT16、BFLOAT162、FLOAT8_E4M3FN、FLOAT8_E4M3FN2、FLOAT4_E2M12、HIFLOAT8 ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D、ACL_FORMAT_FRACTAL_NZ_C0_16(50)2、ACL_FORMAT_FRACTAL_NZ_C0_32(51)2 4-8 -
    workspaceSize(uint64_t*) 输入 需要在Device侧申请的workspace的大小。 - None None None -
    executor(aclOpExecutor**) 输入 包含算子计算流程的op执行器。 - None None None -
    • Ascend 950PR/Ascend 950DT:

      • 上表数据类型列中的角标“1”代表该系列不支持的数据类型或数据格式。
    • Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:

      • 上表数据类型列中的角标“2”代表该系列不支持的数据类型或数据格式。
  • 返回值:

    aclnnStatus:返回状态码,具体参见aclnn返回码

    第一段接口完成入参校验,出现以下场景时报错:

    返回值 错误码 描述
    ACLNN_ERR_PARAM_NULLPTR 161001 传入的srcTensor、dstTensor是空指针。
    ACLNN_ERR_PARAM_INVALID 161002 srcTensor的数据类型非INT8、UINT8、INT32、UINT32、FLOAT、FLOAT16、BFLOAT16、FLOAT8_E4M3FN、FLOAT4_E2M1、HIFLOAT8,数据格式非ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D、NCL。
    dstTensor的数据类型非INT8、UINT8、INT32、UINT32、FLOAT、FLOAT16、BFLOAT16、FLOAT8_E4M3FN、FLOAT4_E2M1、HIFLOAT8,数据格式非ND、NZ、NCDHW、NDC1HWC0、FRACTAL_Z_3D。
    srcTensor、dstTensor传入非连续的Tensor。
    srcTensor的view shape维度不在[2, 6]的范围,dstTensor的storage shape维度不在[4, 8]的范围。2
    ACLNN_ERR_RUNTIME_ERROR 361001 产品型号不支持。
    • Ascend 950PR/Ascend 950DT:

      • 上表数据类型列中的角标“1”代表该系列不支持的拦截类型。
    • Atlas A2 训练系列产品/Atlas A2 推理系列产品、Atlas A3 训练系列产品/Atlas A3 推理系列产品:

      • 上表数据类型列中的角标“2”代表该系列不支持的拦截类型。

aclnnNpuFormatCast

  • 参数说明

    参数名 输入/输出 描述
    workspace 输入 在Device侧申请的workspace内存地址。
    workspaceSize 输入 在Device侧申请的workspace大小,由第一段接口aclnnNpuFormatCastGetWorkspaceSize获取。
    executor 输入 op执行器,包含了算子计算流程。
    stream 输入 指定执行任务的Stream。
  • 返回值:

    aclnnStatus:返回状态码,具体参见aclnn返回码

约束说明

  • 确定性计算:aclnnNpuFormatCast默认确定性实现。

  • 输入和输出支持以下数据类型组合:

    Ascend 950PR/Ascend 950DT
    • aclnnNpuFormatCastCalculateSizeAndFormat接口:

      srcTensor srcTensor数据格式 dstFormat additionalDtype actualFormat
      INT8 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_INT8(2) ACL_FORMAT_FRACTAL_NZ(29)
      INT32 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT16(1)、ACL_BF16(27) ACL_FORMAT_FRACTAL_NZ_C0_16(50)
      FLOAT ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT16(1)、ACL_BF16(27) ACL_FORMAT_FRACTAL_NZ_C0_16(50)
      FLOAT ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT8_E4M3FN(36) ACL_FORMAT_FRACTAL_NZ_C0_32(51)
      FLOAT16 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT16(1) ACL_FORMAT_FRACTAL_NZ(29)
      BFLOAT16 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_BF16(27) ACL_FORMAT_FRACTAL_NZ(29)
      FLOAT8_E4M3FN ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT8_E4M3FN(36) ACL_FORMAT_FRACTAL_NZ(29)
      FLOAT4_E2M1 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT8_E4M3FN(36) ACL_FORMAT_FRACTAL_NZ(29)
      FLOAT4_E2M1 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) ACL_FLOAT4_E2M1(40) ACL_FORMAT_FRACTAL_NZ(29)
      HIFLOAT8 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29) HIFLOAT8(34) ACL_FORMAT_FRACTAL_NZ(29)
      FLOAT4_E2M1 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_FLOAT4_E2M1(40) ACL_FORMAT_ND(2)
      FLOAT4_E2M1 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2) ACL_FLOAT4_E2M1(40) ACL_FORMAT_ND(2)
      INT8 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_INT8(2) ACL_FORMAT_ND(2)
      INT8 ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2) ACL_INT8(2) ACL_FORMAT_ND(2)
      INT8 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2) ACL_INT8(2) ACL_FORMAT_ND(2)
      UINT8 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_UINT8(4) ACL_FORMAT_ND(2)
      UINT8 ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2) ACL_UINT8(4) ACL_FORMAT_ND(2)
      UINT8 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2) ACL_UINT8(4) ACL_FORMAT_ND(2)
      FLOAT8_E4M3FN ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_FLOAT8_E4M3FN(36) ACL_FORMAT_ND(2)
      FLOAT16 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_FLOAT16(1) ACL_FORMAT_ND(2)
      BFLOAT16 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_BF16(27) ACL_FORMAT_ND(2)
      INT32 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_INT32(3) ACL_FORMAT_ND(2)
      INT32 ACL_FORMAT_FRACTAL_NZ_C0_2(52) ACL_FORMAT_ND(2) ACL_INT32(3) ACL_FORMAT_ND(2)
      INT32 ACL_FORMAT_FRACTAL_NZ_C0_4(53) ACL_FORMAT_ND(2) ACL_INT32(3) ACL_FORMAT_ND(2)
      INT32 ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2) ACL_INT32(3) ACL_FORMAT_ND(2)
      INT32 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2) ACL_INT32(3) ACL_FORMAT_ND(2)
      FLOAT ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2) ACL_FLOAT(0) ACL_FORMAT_ND(2)
      FLOAT ACL_FORMAT_FRACTAL_NZ_C0_2(52) ACL_FORMAT_ND(2) ACL_FLOAT(0) ACL_FORMAT_ND(2)
      FLOAT ACL_FORMAT_FRACTAL_NZ_C0_4(53) ACL_FORMAT_ND(2) ACL_FLOAT(0) ACL_FORMAT_ND(2)
      FLOAT ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2) ACL_FLOAT(0) ACL_FORMAT_ND(2)
      FLOAT ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2) ACL_FLOAT(0) ACL_FORMAT_ND(2)
    • aclnnNpuFormatCastGetWorkspaceSize接口:

      srcTensor dstTensor数据类型 srcTensor数据格式 dstTensor数据格式
      INT8 INT8 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29)
      INT32 INT32 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ_C0_16(50)
      FLOAT FLOAT ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ_C0_16(50)/ACL_FORMAT_FRACTAL_NZ_C0_32(51)
      FLOAT16 FLOAT16 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29)
      BFLOAT16 BFLOAT16 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29)
      FLOAT8_E4M3FN FLOAT8_E4M3FN ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29)
      FLOAT4_E2M1 FLOAT4_E2M1 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ_C0_32(51)
      FLOAT4_E2M1 FLOAT4_E2M1 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29)
      HIFLOAT8 HIFLOAT8 ACL_FORMAT_ND(2) ACL_FORMAT_FRACTAL_NZ(29)
      INT8 INT8 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      INT8 INT8 ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2)
      INT8 INT8 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2)
      UINT8 UINT8 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      UINT8 UINT8 ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2)
      UINT8 UINT8 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2)
      INT32 INT32 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      INT32 INT32 ACL_FORMAT_FRACTAL_NZ_C0_2(52) ACL_FORMAT_ND(2)
      INT32 INT32 ACL_FORMAT_FRACTAL_NZ_C0_4(53) ACL_FORMAT_ND(2)
      INT32 INT32 ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2)
      INT32 INT32 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2)
      FLOAT FLOAT ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      FLOAT FLOAT ACL_FORMAT_FRACTAL_NZ_C0_2(52) ACL_FORMAT_ND(2)
      FLOAT FLOAT ACL_FORMAT_FRACTAL_NZ_C0_4(53) ACL_FORMAT_ND(2)
      FLOAT FLOAT ACL_FORMAT_FRACTAL_NZ_C0_16(50) ACL_FORMAT_ND(2)
      FLOAT FLOAT ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2)
      FLOAT16 FLOAT16 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      BFLOAT16 BFLOAT16 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      FLOAT8_E4M3FN FLOAT8_E4M3FN ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      FLOAT4_E2M1 FLOAT4_E2M1 ACL_FORMAT_FRACTAL_NZ(29) ACL_FORMAT_ND(2)
      FLOAT4_E2M1 FLOAT4_E2M1 ACL_FORMAT_FRACTAL_NZ_C0_32(51) ACL_FORMAT_ND(2)
    • C0计算方法:C0=32Bsize of additionalDtypeC0=\frac{32B}{size\ of\ additionalDtype}

      additionalDtype C0
      ACL_INT8(2) 32
      ACL_FLOAT16(1) 16
      ACL_BF16(27) 16
      ACL_FLOAT8_E4M3FN(36) 32
      ACL_HIFLOAT8(34) 32
    • 当前不支持的特殊场景:

      • srcTensor的数据类型和additionalDtype相同,srcTensor格式为ND且类型为FLOAT16或BFLOAT16时,若维度表示为[k, n], 则k为1场景暂不支持。
      • 不支持调用当前接口转昇腾亲和数据格式FRACTAL_NZ后, 进行任何能修改张量的操作, 如contiguous、pad、slice等;
      • 当srcTensor的shape后两维任意一维度shape等于1场景,也不允许转昇腾亲和数据格式FRACTAL_NZ后再进行任何修改张量的操作, 包括transpose。
    Atlas A3 训练系列产品/Atlas A3 推理系列产品、Atlas A2 训练系列产品/Atlas A2 推理系列产品
    • aclnnNpuFormatCastCalculateSizeAndFormat接口参数:

      srcTensor dstFormat additionalDtype actualFormat
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_FRACTAL_NZ(29) INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_FRACTAL_NZ(29)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_ND(2) INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_ND(2)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_NCDHW(30) INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_NCDHW(30)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_NDC1HWC0(32) INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_NDC1HWC0(32)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FRACTAL_Z_3D(33) INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FRACTAL_Z_3D(33)
    • aclnnNpuFormatCastGetWorkspaceSize接口:

      srcTensor dstTensor数据类型 dstTensor数据格式
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_FRACTAL_NZ(29)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_ND(2)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_NCDHW(30)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FORMAT_NDC1HWC0(32)
      INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 INT8, UINT8, FLOAT, FLOAT16, BF16, INT32, UINT32, HIFLOAT8 ACL_FRACTAL_Z_3D(33)
    • C0计算方法:C0=32Bsize of srcTensor的基础类型C0=\frac{32B}{size\ of\ srcTensor的基础类型}

      srcTensor的基础类型 C0
      ACL_FLOAT(0)、ACL_INT32(3)、ACL_UINT32(8) 8
      ACL_FLOAT16(1)、ACL_BF16(27) 16
      ACL_INT8(2)、ACL_UINT8(4)、ACL_HIFLOAT8(34) 32
    • 当前不支持的特殊场景:

      • 不支持调用当前接口转昇腾亲和数据格式FRACTAL_NZ后, 进行任何能修改张量的操作, 如contiguous、pad、slice等;
      • 不允许转昇腾亲和数据格式FRACTAL_NZ后再进行任何修改张量的操作, 包括transpose。

调用示例

  • Ascend 950PR/Ascend 950DT: 示例代码如下,仅供参考,具体编译和执行过程请参考编译与运行样例

    #include <iostream>
    #include <vector>
    #include "acl/acl.h"
    #include "aclnnop/aclnn_npu_format_cast.h"
    
    #define CHECK_RET(cond, return_expr) \
    do {                               \
        if (!(cond)) {                   \
        return_expr;                   \
        }                                \
    } while (0)
    
    #define LOG_PRINT(message, ...)     \
    do {                              \
        printf(message, ##__VA_ARGS__); \
    } while (0)
    
    #define CEIL_DIV(x, y) ((((x) + (y)) - 1) / (y))
    #define CEIL_ALIGN(x, y) ((((x) + (y)) - 1) / (y) * (y))
    
    int64_t GetShapeSize(const std::vector<int64_t>& shape) {
        int64_t shapeSize = 1;
        for (auto i : shape) {
            shapeSize *= i;
        }
        return shapeSize;
    }
    
    extern "C" aclnnStatus aclnnNpuFormatCastCalculateSizeAndFormat(const aclTensor* srcTensor, const int dstFormat, const int additionalDtype,  int64_t** dstShape, uint64_t* dstShapeSize, int* actualFormat);
    extern "C" aclnnStatus aclnnNpuFormatCastGetWorkspaceSize(const aclTensor* srcTensor, aclTensor* dstTensor,uint64_t* workspaceSize, aclOpExecutor** executor);
    extern "C" aclnnStatus aclnnNpuFormatCast(void* workspace, uint64_t workspaceSize, aclOpExecutor* executor, aclrtStream stream);
    
    int Init(int32_t deviceId, aclrtStream* stream) {
    // 固定写法,资源初始化
    auto ret = aclInit(nullptr);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret);
    ret = aclrtSetDevice(deviceId);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret);
    ret = aclrtCreateStream(stream);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret);
    return 0;
    }
    
    template <typename T>
    int CreateAclTensor(const std::vector<T>& hostData, const std::vector<int64_t>& shape, void** deviceAddr,
                        aclDataType dataType, aclTensor** tensor) {
        auto size = GetShapeSize(shape) * sizeof(T);
        // 调用aclrtMalloc申请device侧内存
        auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
        // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上
        ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);
    
        // 计算连续tensor的strides
        std::vector<int64_t> strides(shape.size(), 1);
        for (int64_t i = shape.size() - 2; i >= 0; i--) {
            strides[i] = shape[i + 1] * strides[i + 1];
        }
    
        // 调用aclCreateTensor接口创建aclTensor
        *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND,
                                    shape.data(), shape.size(), *deviceAddr);
        return 0;
    }
    
    template <typename T>
    int CreateAclTensorWithFormat(const std::vector<T>& hostData, const std::vector<int64_t>& shape, int64_t** storageShape, uint64_t* storageShapeSize, void** deviceAddr,
                                  aclDataType dataType, aclTensor** tensor, aclFormat format) {
        auto size = hostData.size() * sizeof(T);
        // 调用aclrtMalloc申请device侧内存
        auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
        // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上
        ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);
    
        // 计算连续tensor的strides
        std::vector<int64_t> strides(shape.size(), 1);
        for (int64_t i = shape.size() - 2; i >= 0; i--) {
            strides[i] = shape[i + 1] * strides[i + 1];
        }
    
        *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0,
                                    format, *storageShape, *storageShapeSize, *deviceAddr);
        return 0;
    }
    
    int main() {
        // 1. (固定写法)device/stream初始化,参考acl API手册
        // 根据自己的实际device填写deviceId
        int32_t deviceId = 0;
        aclrtStream stream;
        auto ret = Init(deviceId, &stream);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("Init acl failed. ERROR: %d\n", ret); return ret);
    
        // 2. 构造输入与输出,需要根据API的接口自定义构造
        int64_t k = 64;
        int64_t n = 128;
        int64_t srcDim0 = k;
        int64_t srcDim1 = n;
        int dstFormat = 29;
        aclDataType srcDtype = aclDataType::ACL_INT32;
        aclDataType additionalDtype = aclDataType::ACL_FLOAT16;
    
        std::vector<int64_t> srcShape = {srcDim0, srcDim1};
        void* srcDeviceAddr = nullptr;
        void* dstDeviceAddr = nullptr;
        aclTensor* srcTensor = nullptr;
        aclTensor* dstTensor= nullptr;
        std::vector<int32_t> srcHostData(k * n, 1);
        for (size_t i = 0; i < k; i++) {
            for (size_t j = 0; j < n; j++) {
                srcHostData[i * n + j] = (j + 1) % 128;
            }
        }
    
        std::vector<int32_t> dstTensorHostData(k * n, 1);
    
        int64_t* dstShape = nullptr;
        uint64_t dstShapeSize = 0;
        int actualFormat;
    
        // 创建src  aclTensor
        ret = CreateAclTensor(srcHostData, srcShape, &srcDeviceAddr, srcDtype, &srcTensor);
        CHECK_RET(ret == ACL_SUCCESS, return ret);
    
        // 3. 调用CANN算子库API
        uint64_t workspaceSize = 0;
        aclOpExecutor* executor;
        void* workspaceAddr = nullptr;
    
        // 计算目标tensor的shape和format
        ret = aclnnNpuFormatCastCalculateSizeAndFormat(srcTensor, 29, additionalDtype, &dstShape, &dstShapeSize, &actualFormat);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnNpuFormatCastCalculateSizeAndFormat failed. ERROR: %d\n", ret); return ret);
    
        ret = CreateAclTensorWithFormat(dstTensorHostData, srcShape, &dstShape, &dstShapeSize, &dstDeviceAddr, srcDtype, &dstTensor, static_cast<aclFormat>(actualFormat));
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("CreateAclTensorWithFormat failed. ERROR: %d\n", ret); return ret);
    
        // 调用aclnnNpuFormatCastGetWorkspaceSize第一段接口
        ret = aclnnNpuFormatCastGetWorkspaceSize(srcTensor, dstTensor, &workspaceSize, &executor);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnNpuFormatCastGetWorkspaceSize failed. ERROR: %d\n", ret); return ret);
        // 根据第一段接口计算出的workspaceSize申请device内存
    
        if (workspaceSize > 0) {
            ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
            CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret);
        }
    
        // 调用aclnnNpuFormatCastGetWorkspaceSize第二段接口
        ret = aclnnNpuFormatCast(workspaceAddr, workspaceSize, executor, stream);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnNpuFormatCast failed. ERROR: %d\n", ret); return ret);
    
        // 4. (固定写法)同步等待任务执行结束
        ret = aclrtSynchronizeStream(stream);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret);
    
        // 5. 获取输出的值,将device侧内存上的结果拷贝至host侧
        auto size = 1;
        for (size_t i = 0; i < dstShapeSize; i++) {
            size *= dstShape[i];
        }
    
        std::vector<int32_t> resultData(size, 0);
        ret = aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), dstDeviceAddr,
                            size * sizeof(resultData[0]), ACL_MEMCPY_DEVICE_TO_HOST);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
        for (int64_t i = 0; i < size; i++) {
            LOG_PRINT("result[%ld] is: %d\n", i, resultData[i]);
        }
    
        // 6. 释放dstShape、aclTensor和aclScalar
        delete[] dstShape;
        aclDestroyTensor(srcTensor);
        aclDestroyTensor(dstTensor);
    
        // 7. 释放device资源
        aclrtFree(srcDeviceAddr);
        aclrtFree(dstDeviceAddr);
    
        if (workspaceSize > 0) {
            aclrtFree(workspaceAddr);
        }
        aclrtDestroyStream(stream);
        aclrtResetDevice(deviceId);
        aclFinalize();
        return 0;
    }
    
  • Atlas A3 训练系列产品/Atlas A3 推理系列产品、Atlas A2 训练系列产品/Atlas A2 推理系列产品: 示例代码如下,仅供参考,具体编译和执行过程请参考编译与运行样例

    #include <iostream>
    #include <vector>
    #include "acl/acl.h"
    #include "aclnnop/aclnn_npu_format_cast.h"
    
    #define CHECK_RET(cond, return_expr) \
    do {                               \
        if (!(cond)) {                   \
        return_expr;                   \
        }                                \
    } while (0)
    
    #define LOG_PRINT(message, ...)     \
    do {                              \
        printf(message, ##__VA_ARGS__); \
    } while (0)
    
    #define CEIL_DIV(x, y) ((((x) + (y)) - 1) / (y))
    #define CEIL_ALIGN(x, y) ((((x) + (y)) - 1) / (y) * (y))
    
    int64_t GetShapeSize(const std::vector<int64_t>& shape) {
        int64_t shapeSize = 1;
        for (auto i : shape) {
            shapeSize *= i;
        }
        return shapeSize;
    }
    
    extern "C" aclnnStatus aclnnNpuFormatCastCalculateSizeAndFormat(const aclTensor* srcTensor, const int dstFormat, const int additionalDtype,  int64_t** dstShape, uint64_t* dstShapeSize, int* actualFormat);
    extern "C" aclnnStatus aclnnNpuFormatCastGetWorkspaceSize(const aclTensor* srcTensor, aclTensor* dstTensor,uint64_t* workspaceSize, aclOpExecutor** executor);
    extern "C" aclnnStatus aclnnNpuFormatCast(void* workspace, uint64_t workspaceSize, aclOpExecutor* executor, aclrtStream stream);
    
    int Init(int32_t deviceId, aclrtStream* stream) {
    // 固定写法,资源初始化
    auto ret = aclInit(nullptr);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret);
    ret = aclrtSetDevice(deviceId);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret);
    ret = aclrtCreateStream(stream);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret);
    return 0;
    }
    
    template <typename T>
    int CreateAclTensor(const std::vector<T>& hostData, const std::vector<int64_t>& shape, void** deviceAddr,
                        aclDataType dataType, aclTensor** tensor) {
        auto size = GetShapeSize(shape) * sizeof(T);
        // 调用aclrtMalloc申请device侧内存
        auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
        // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上
        ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);
    
        // 计算连续tensor的strides
        std::vector<int64_t> strides(shape.size(), 1);
        for (int64_t i = shape.size() - 2; i >= 0; i--) {
            strides[i] = shape[i + 1] * strides[i + 1];
        }
    
        // 调用aclCreateTensor接口创建aclTensor
        // 此处修改src的format
        *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_NCDHW,
                                    shape.data(), shape.size(), *deviceAddr);
        return 0;
    }
    
    template <typename T>
    int CreateAclTensorWithFormat(const std::vector<T>& hostData, const std::vector<int64_t>& shape, int64_t** storageShape, uint64_t* storageShapeSize, void** deviceAddr,
                                  aclDataType dataType, aclTensor** tensor, aclFormat format) {
        auto size = hostData.size() * sizeof(T);
        // 调用aclrtMalloc申请device侧内存
        auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
        // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上
        ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);
    
        // 计算连续tensor的strides
        std::vector<int64_t> strides(shape.size(), 1);
        for (int64_t i = shape.size() - 2; i >= 0; i--) {
            strides[i] = shape[i + 1] * strides[i + 1];
        }
    
        *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0,
                                    format, *storageShape, *storageShapeSize, *deviceAddr);
        return 0;
    }
    
    int main() {
        // 1. (固定写法)device/stream初始化,参考acl API手册
        // 根据自己的实际device填写deviceId
        int32_t deviceId = 0;
        aclrtStream stream;
        auto ret = Init(deviceId, &stream);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("Init acl failed. ERROR: %d\n", ret); return ret);
    
        // 2. 构造输入与输出,需要根据API的接口自定义构造
    
        int dstFormat = 32;
        //此处修改目标format
        aclDataType srcDtype = aclDataType::ACL_INT32;
        int additionalDtype = -1;
    
        // std::vector<int64_t> srcShape = {srcDim0 , srcDim1};
        int64_t N = 1;
        int64_t C = 17;
        int64_t D = 1;
        int64_t H = 2;
        int64_t W = 2;
    
        std::vector<int64_t> srcShape = {N, C, D, H, W};
        void* srcDeviceAddr = nullptr;
        void* dstDeviceAddr = nullptr;
        aclTensor* srcTensor = nullptr;
        aclTensor* dstTensor= nullptr;
        std::vector<int32_t> srcHostData(N * C * D * H * W, 1);
    
        int num = 0;
        for (int n = 0; n < N; ++n) {
            for (int c = 0; c < C; ++c) {
                for (int d = 0; d < D; ++d) {
                    for (int h = 0; h < H; ++h) {
                        for (int w = 0; w < W; ++w) {
                            // 按 行主序排布,计算线性索引
                            int index = (((n * C + c) * D + d) * H + h) * W + w;
                            srcHostData[index] = num;
                            num++;
                        }
                    }
                }
            }
        }
    
        std::vector<int32_t> dstTensorHostData(N * C * D * H * W, 1);
    
        int64_t* dstShape = nullptr;
        uint64_t dstShapeSize = 0;
        int actualFormat;
    
        // 创建src  aclTensor
        ret = CreateAclTensor(srcHostData, srcShape, &srcDeviceAddr, srcDtype, &srcTensor);
        CHECK_RET(ret == ACL_SUCCESS, return ret);
    
        // 3. 调用CANN算子库API
        uint64_t workspaceSize = 0;
        aclOpExecutor* executor;
        void* workspaceAddr = nullptr;
        std::cout << "init actualFormat = " << actualFormat << std::endl;
        // 计算目标tensor的shape和format
        ret = aclnnNpuFormatCastCalculateSizeAndFormat(srcTensor, dstFormat, additionalDtype, &dstShape, &dstShapeSize, &actualFormat);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnNpuFormatCastCalculateSizeAndFormat failed. ERROR: %d\n", ret); return ret);
    
        std::cout << "actualFormat = " << actualFormat << std::endl;
        std::cout << "&dstShape = " << &dstShape << std::endl;
        std::cout << "dstShape = [ ";
        for (int64_t i = 0; i < dstShapeSize; ++i) {
            std::cout << dstShape[i] << " ";
        }
        std::cout << "]" << std::endl;
    
        ret = CreateAclTensorWithFormat(dstTensorHostData, srcShape, &dstShape, &dstShapeSize, &dstDeviceAddr, srcDtype, &dstTensor, static_cast<aclFormat>(actualFormat));
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("CreateAclTensorWithFormat failed. ERROR: %d\n", ret); return ret);
    
        // 调用aclnnNpuFormatCastGetWorkspaceSize第一段接口
        ret = aclnnNpuFormatCastGetWorkspaceSize(srcTensor, dstTensor, &workspaceSize, &executor);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnNpuFormatCastGetWorkspaceSize failed. ERROR: %d\n", ret); return ret);
        // 根据第一段接口计算出的workspaceSize申请device内存
    
        if (workspaceSize > 0) {
            ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
            CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret);
        }
    
        // 调用aclnnNpuFormatCastGetWorkspaceSize第二段接口
        ret = aclnnNpuFormatCast(workspaceAddr, workspaceSize, executor, stream);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnNpuFormatCast failed. ERROR: %d\n", ret); return ret);
    
        // 4. (固定写法)同步等待任务执行结束
        ret = aclrtSynchronizeStream(stream);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret);
    
        // 5. 获取输出的值,将device侧内存上的结果拷贝至host侧
        auto size = 1;
        for (size_t i = 0; i < dstShapeSize; i++) {
            size *= dstShape[i];
        }
    
        std::vector<int32_t> resultData(size, 0);
        ret = aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), dstDeviceAddr,
                            size * sizeof(resultData[0]), ACL_MEMCPY_DEVICE_TO_HOST);
        CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
        for (int64_t i = 0; i < size; i++) {
            LOG_PRINT("result[%ld] is: %d\n", i, resultData[i]);
        }
    
        // 6. 释放dstShape、aclTensor和aclScalar
        delete[] dstShape;
        aclDestroyTensor(srcTensor);
        aclDestroyTensor(dstTensor);
    
        // 7. 释放device资源
        aclrtFree(srcDeviceAddr);
        aclrtFree(dstDeviceAddr);
    
        if (workspaceSize > 0) {
            aclrtFree(workspaceAddr);
        }
        aclrtDestroyStream(stream);
        aclrtResetDevice(deviceId);
        aclFinalize();
        return 0;
    }