训推一体示例

概述

以MultiScaleDeformableAttn算子为例,展示由模型onnx导出到om模型执行的整个过程,提供本仓库onnx使用的基础案例。

单算子onnx导出脚本

通过python脚本导出单算子模型,以下为导出脚本:

import os
import torch
import torch_npu
import mx_driving
from mx_driving import multi_scale_deformable_attn

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, value, shapes, level_start_index, sampling_locations, attention_weights):
        return multi_scale_deformable_attn(value, shapes, level_start_index, sampling_locations, attention_weights)


def onnx_export(model, inputs, onnx_model_name,
                input_names=None, output_names=None):
    if input_names is None:
        input_names = ["input_names"]
    if output_names is None:
        output_names = ["output_names"]
    model.eval()
    OPERATOR_EXPORT_TYPE = torch._C._onnx.OperatorExportTypes.ONNX
    with torch.no_grad():
        torch.onnx.export(model, inputs,
                            onnx_model_name,
                            opset_version=11,
                            operator_export_type=OPERATOR_EXPORT_TYPE,
                            input_names=input_names,
                            output_names=output_names)


def export_onnx(name):
    bs, num_levels, num_heads, num_points, num_queries, embed_dims = 2, 1, 8, 4, 40000, 32
    shapes = torch.tensor([[200, 200] * num_levels]).reshape(num_levels, 2).long()

    num_keys = sum((H * W).item() for H, W in shapes)
    value = torch.rand(bs, num_keys, num_heads, embed_dims) * 0.01
    sampling_locations = torch.rand(bs, num_queries, num_heads, num_levels, num_points, 2) * 1.2 - 0.1
    attention_weights = torch.rand(bs, num_queries, num_heads, num_levels, num_points) + 1e-5
    level_start_index = torch.cat((shapes.new_zeros((1, )), shapes.prod(1).cumsum(0)[:-1])).long()

    value = value.half()
    sampling_locations = sampling_locations.half()
    attention_weights = attention_weights.half()

    npu_value = value.clone().detach().npu()
    npu_sampling_locations = sampling_locations.clone().detach().npu()
    npu_attention_weights = attention_weights.clone().detach().npu()
    npu_level_start_index = level_start_index.clone().detach().npu()
    npu_shapes = shapes.clone().detach().npu()

    model = Model().npu()
    model(npu_value, npu_shapes, npu_level_start_index, npu_sampling_locations, npu_attention_weights)
    onnx_export(model, (npu_value, npu_shapes, npu_level_start_index, npu_sampling_locations, npu_attention_weights), name,
        ["value, shapes, level_start_index, sampling_locations, attention_weights"], ["outputs"])

if __name__ =='__main__':
    export_onnx("./msda.onnx")

执行后会在当前文件夹下生成msda.onnx文件

domain转换

若onnx转换om过程中出现FAQ中 The model has 2 domain_version fields 问题,则需要安装转换仓库进行domain转换,其中转换仓库为:https://gitee.com/Ronnie_zheng/MagicONNX 转换脚本为:

from magiconnx import OnnxGraph
graph = OnnxGraph('msda.onnx')

graph.keep_default_domain()
graph.save('msda.onnx')

执行后会将onnx模型中的多个domain进行统一

onnx转换om

通过ATC将onnx转换为om模型,在执行前需要设置环境变量,环境变量与转换指令如下,其中soc_version可通过npu-smi info进行查看:

pip3 show mx_driving
export ASCEND_CUSTOM_OPP_PATH=xxx/site-packages/mx_driving/packages/vendors/customize/
export LD_LIBRARY_PATH=xxx/site-packages/mx_driving/packages/vendors/customize/op_api/lib/:$LD_LIBRARY_PATH
atc --framework 5 --output msda --soc_version Ascend910B2 --model msda.onnx --op_select_implmode high_precision --precision_mode must_keep_origin_dtype --log debug

执行后若出现ATC run success, welcome to the next use. 则说明om模型转换成功

输入数据生成

根据onnx模型导出时的模型输入,构建输入文件,其构建脚本如下:

import numpy as np
import torch
import torch_npu

bs, num_levels, num_heads, num_points, num_queries, embed_dims = 2, 1, 8, 4, 40000, 32
shapes = torch.tensor([[200, 200] * num_levels]).reshape(num_levels, 2).long()

num_keys = sum((H * W).item() for H, W in shapes)
level_start_index = torch.cat((shapes.new_zeros((1, )), shapes.prod(1).cumsum(0)[:-1])).long()

input1 = np.random.rand(bs, num_keys, num_heads, embed_dims).astype(np.float16)
input2 = shapes.numpy().astype(np.int64)
input3 = level_start_index.numpy().astype(np.int64)
input4 = np.random.rand(bs, num_queries, num_heads, num_levels, num_points, 2).astype(np.float16)
input5 = np.random.rand(bs, num_queries, num_heads, num_levels, num_points).astype(np.float16)

# 生成输入.bin
input1.tofile("./inputs/input1.bin")
input2.tofile("./inputs/input2.bin")
input3.tofile("./inputs/input3.bin")
input4.tofile("./inputs/input4.bin")
input5.tofile("./inputs/input5.bin")

执行后会在inputs文件夹内生成模型的各输入文件

om执行

克隆仓库https://gitee.com/ascend/tools/tree/master/msame,并按照readme进行安装

随后通过msame工具,将生成好的模型输入文件输入到om模型中执行

./msame --model ./msda.om --input ./inputs/input1.bin,./inputs/input2.bin,./inputs/input3.bin,./inputs/input4.bin,./inputs/input5.bin --output ./msame/out/ --outfmt BIN --loop 1

执行后会在./msame/out文件夹下生成om模型的推理结果

精度验证

本部分验证om模型的推理结果与单算子结果是否一致,可自行选择工具进行验证,示例代码如下:

import numpy as np
import torch, torch_npu
import mx_driving
from mx_driving import multi_scale_deformable_attn

bs, num_levels, num_heads, num_points, num_queries, embed_dims = 2, 1, 8, 4, 40000, 32

input1 = np.fromfile("./inputs/input1.bin", dtype=np.float16).reshape(bs, -1, num_heads, embed_dims)
input2 = np.fromfile("./inputs/input2.bin", dtype=np.int64).reshape(num_levels, 2)
input3 = np.fromfile("./inputs/input3.bin", dtype=np.int64).reshape(num_levels)
input4 = np.fromfile("./inputs/input4.bin", dtype=np.float16).reshape(bs, num_queries, num_heads, num_levels, num_points, 2)
input5 = np.fromfile("./inputs/input5.bin", dtype=np.float16).reshape(bs, num_queries, num_heads, num_levels, num_points)

input1_npu = torch.from_numpy(input1).npu()
input2_npu = torch.from_numpy(input2).npu()
input3_npu = torch.from_numpy(input3).npu()
input4_npu = torch.from_numpy(input4).npu()
input5_npu = torch.from_numpy(input5).npu()

golden = multi_scale_deformable_attn(input1_npu, input2_npu, input3_npu, input4_npu, input5_npu)
output = torch.from_numpy(np.fromfile("msda_output_0.bin", dtype=np.float16).reshape(bs, num_queries, num_heads * embed_dims))

print(golden)
print(output)

FAQ

No parser is register for Op

可能原因1:在docker中编译的mx_driving包可能不附带onnx插件,需要编译protoc,参照Driving SDK 安装指导进行编译安装即可。

可能原因2:未引入onnx转换om环节中的mx_driving环境变量,导致转换过程中未检索到对应算子而报错

Can not find Node xxx custom infer_datatype func

原因:目前仓库内仅部分算子包含inferShape与inferDtype过程,参考 kernels/multi_scale_deformable_attn/op_host/ 目录下 MSDA 算子的 cpp 文件中 IMPL_OP_INFERSHAPE 相关代码行,添加类似逻辑即可解决该问题。

Optype xxx of ops kernel is unsupported

原因:给定的输入类型与算子支持的输入类型不匹配,请检查模型的输入类型

The model has 2 domain_version fields, but only one is allowed

原因:未进行domain转换,转换过程可见上方案例