{
"cells": [
{
"cell_type": "markdown",
"id": "ab6342ba-9185-4806-b202-6009984e8c8d",
"metadata": {},
"source": [
"# 快速入门-基于CANN快速跑通第一个自定义算子\n",
"本快速入门指南主要介绍Kernel直调工程下Ascend C算子开发的基础流程,核心内容及步骤如下:\n",
"\n",
"1. 环境准备:配置程序运行所需的环境变量\n",
"\n",
"2. 算子分析:梳理目标算子的原型定义\n",
"\n",
"3. 核函数开发:实现算子核心功能代码\n",
"\n",
"4. 算子调用:开发算子调用逻辑代码\n",
"\n",
"5. 编译运行:编译算子与调用代码并执行,完成功能验证\n",
"\n",
"Kernel 直调工程的核心优势为**算子实现与调用代码在同一源文件中**,可快速完成算子的开发与调用测试,是开发者高效开展算子原型开发、功能快速验证的实用利器。本次快速入门的所有开发工作,均基于Sources/add.asc文件开展。\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "219aa56f",
"metadata": {},
"source": [
"## 1. 环境准备\n",
"正式开始快速入门之前,先要对jupyter环境进行初始化。以下代码完成了初始化并将环境中的变量导入jupyter环境,同时完成了代码目录的创建。保证能正常导入代码以及使用bisheng编译器,完成算子的开发及编译。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8edee15e-4626-4d53-9fc5-834e7492806b",
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p Sources\n",
"\n",
"import os, subprocess\n",
"env = subprocess.check_output(\"bash -l -c 'source $ASCEND_TOOLKIT_HOME/set_env.sh && env'\", shell=True, text=True)\n",
"for line in env.splitlines():\n",
" if \"=\" in line: os.environ.__setitem__(*line.split(\"=\", 1))\n",
"print(\"\\n🎉 Environment initialization process completed successfully!\")"
]
},
{
"cell_type": "markdown",
"id": "2333a756-f74e-447a-9551-4017a8cc1d39",
"metadata": {},
"source": [
"---\n",
"\n",
"## 2. 算子分析\n",
"快速入门场景,基于Add算子为例进行自定义算子开发全流程介绍,为简化学习流程,该自定义算子存在以下特点。\n",
"\n",
"- 输入shape固定为(8,2048)\n",
"\n",
"- 输入类型仅支持float\n",
"\n",
"- 固定使用8个核\n",
"\n",
"基于上述条件,算子设计规格如下所示:\n",
"\n",
"<table style=\"float: left; border-collapse: collapse; margin: 0 10px 10px 0; font-size: 14px;\">\n",
"<tr style=\"background: #f0f0f0;\">\n",
" <td align=\"center\">算子类型(OpType)</td>\n",
" <td colspan=\"4\" align=\"center\">Add</td>\n",
"</tr>\n",
"<tr>\n",
" <td rowspan=\"3\" align=\"center\">算子输入</td>\n",
" <td align=\"center\">name</td>\n",
" <td align=\"center\">shape</td>\n",
" <td align=\"center\">data type</td>\n",
" <td align=\"center\">format</td>\n",
"</tr>\n",
"<tr>\n",
" <td align=\"center\">x</td>\n",
" <td align=\"center\">(8, 2048)</td>\n",
" <td align=\"center\">float</td>\n",
" <td align=\"center\">ND</td>\n",
"</tr>\n",
"<tr>\n",
" <td align=\"center\">y</td>\n",
" <td align=\"center\">(8, 2048)</td>\n",
" <td align=\"center\">float</td>\n",
" <td align=\"center\">ND</td>\n",
"</tr>\n",
"<tr>\n",
" <td rowspan=\"1\" align=\"center\">算子输出</td>\n",
" <td align=\"center\">z</td>\n",
" <td align=\"center\">(8, 2048)</td>\n",
" <td align=\"center\">float</td>\n",
" <td align=\"center\">ND</td>\n",
"</tr>\n",
"<tr>\n",
" <td rowspan=\"1\" align=\"center\">核函数名</td>\n",
" <td colspan=\"4\" align=\"center\">add</td>\n",
"</tr>\n",
"<tr>\n",
" <td rowspan=\"1\" align=\"center\">使用核数</td>\n",
" <td colspan=\"4\" align=\"center\">8</td>\n",
"</tr>\n",
"</table>\n"
]
},
{
"cell_type": "markdown",
"id": "b1a8ec6f-1290-4e99-ad8a-1b02d3ae885b",
"metadata": {},
"source": [
"---\n",
"\n",
"## 3. 核函数开发\n",
"本样例中使用固定8个核并行计算,即把数据进行分片,分配到多个核上进行处理。 Ascend C核函数是在一个核上的处理函数,所以只处理部分数据。 \n",
"\n",
"分配方案是:数据整体长度为8 * 2048个元素,平均分配到8个核上运行,每个核上处理的数据大小为2048个元素。对于单核上的处理数据,也可以进行数据切块,实现对数据的流水并行处理。\n"
]
},
{
"cell_type": "markdown",
"id": "0fbbcc08-003a-4603-a081-cadccc95b1e9",
"metadata": {},
"source": [
"### 3.1 头文件引入\n",
"进行算子开发时,首先要在add.asc源文件中导入必要的头文件。这些头文件都是固定头文件,在进行其它自定义算子开发时可直接复用。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f642b9e2-906c-4628-acf9-9a1f479d6e54",
"metadata": {},
"outputs": [],
"source": [
"%%writefile Sources/add.asc\n",
"\n",
"#include <cstdint>\n",
"#include <iostream>\n",
"#include <vector>\n",
"#include <algorithm>\n",
"#include <iterator>\n",
"#include \"acl/acl.h\"\n",
"#include \"kernel_operator.h\""
]
},
{
"cell_type": "markdown",
"id": "dbdf5b3e-a00c-475d-95ff-fa5fde97f9c1",
"metadata": {},
"source": [
"### 3.2 定义BufferNum\n",
"由于AI Core上,矢量计算CopyIn、CopyOut过程使用MTE指令队列(MTE2、MTE3),Compute过程使用Vector指令队列(V),意味着CopyIn、CopyOut过程和Compute过程是可以并行的。\n",
"\n",
"DoubleBuffer机制将待处理的数据一分为二,比如Tensor1、Tensor2。如下图所示,当Vector对Tensor1中数据进行Compute时,Tensor2可以执行CopyIn的过程;而当Vector切换到计算Tensor2时,Tensor1可以执行CopyOut的过程。由此,数据的进出搬运和Vector计算实现并行执行。 \n",
"\n",
"<img src=\"./images/double_buffer.png\" alt=\"double_buffer\" width=\"300px\" >\n",
"\n",
"使用Double Buffer的简单代码示例如下:\n",
"\n",
"```\n",
"pipe.InitBuffer(inQueueX, 2, 256);\n",
"```\n",
"\n",
"而在实际代码中,一般会用宏进行替换,实际代码如下。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79923704-50a6-4b78-b952-e89214c10b5e",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"constexpr uint32_t BUFFER_NUM = 2; // tensor num for each queue"
]
},
{
"cell_type": "markdown",
"id": "d0f9d053-4a1e-44e8-9d2b-15969462fcce",
"metadata": {},
"source": [
"### 3.3 tiling结构体创建\n",
"根据分配方案,需要定义一个结构体,用于保存并行数据切分相关的参数。结构体可由开发者自行定义,但需要注意以下两点:\n",
"\n",
"1. 结构体命名需与算子配套,保证代码可读性\n",
"\n",
"2. 结构体只定义必要参数,若存在大量冗余定义,会由于结构体下发速度慢从而导致算子性能下降。\n",
"\n",
"本入门体验中,结构体命名为AddTilingData,该结构体定义了如下两个参数:\n",
"\n",
"- totalLength:指待处理的数据总大小,本入门体验中固定为(8 * 2048)个元素\n",
"\n",
"- tileNum:指每个核需要计算的数据块个数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e988cab4-bbc1-4be4-a126-dc94b54547d9",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"struct AddCustomTilingData\n",
"{\n",
" uint32_t totalLength;\n",
" uint32_t tileNum;\n",
"};"
]
},
{
"cell_type": "markdown",
"id": "45c9a236-9fc0-47bc-b647-2f50bc6b832b",
"metadata": {},
"source": [
"### 3.4 进行核函数的定义\n",
"核函数(Kernel Function)是Ascend C算子设备侧实现的入口。Ascend C允许用户使用C/C++函数的语法扩展来编写设备端的运行代码,用户在核函数中进行数据访问和计算操作,由此实现该算子的所有功能。区别于普通的C++函数调用时仅执行一次,当核函数被调用时,多个核都执行相同的核函数代码,具有相同的函数入参,并行执行。其定义要求如下:\n",
"\n",
"- 使用__global__函数类型限定符来标识它是一个核函数。\n",
"\n",
"- 使用__aicore__函数类型限定符来标识该核函数在设备端AI Core上执行。\n",
"\n",
"- 指针入参变量需要增加变量类型限定符__gm__,表明该指针变量指向Global Memory上某处内存地址。\n",
"\n",
"- 核函数必须具有void返回类型。\n",
"\n",
"- 仅支持入参为指针或C/C++内置数据类型(Primitive data types),如:half* s0、float* s1、int32_t c。\n",
"\n",
"- 为了统一表达,建议使用GM_ADDR宏来修饰入参,其为是编译器中自带的宏,代表的含义为Global Memory中的地址,其定义如下: \n",
" ```\n",
" #define GM_ADDR __gm__ uint8_t*\n",
" ```\n",
"\n",
"示例如下:\n",
"```\n",
"extern \"C\" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)\n",
"```\n",
"\n",
"在本次体验中,由于涉及tiling结构体,所以核函数中增加了一个tiling结构体参数,并在核函数中调用算子类的Init和Process函数,算子类实现在后续步骤中介绍。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e2534e9-7c3e-40fe-8cb5-08f8105bf327",
"metadata": {},
"outputs": [],
"source": [
"%%writefile Sources/add_bak.asc\n",
"\n",
"__global__ __aicore__ void add(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)\n",
"{\n",
" KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置Kernel类型为Vector核(用于矢量计算)\n",
" KernelAdd op;\n",
" op.Init(x, y, z, tiling.totalLength, tiling.tileNum);\n",
" op.Process();\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "38278513-8ef2-4c6f-8d2c-36b73c503cd8",
"metadata": {},
"source": [
"### 3.5 创建核函数类\n",
"矢量编程范式把算子的实现流程分为3个基本任务:CopyIn,Compute,CopyOut。\n",
"\n",
"- **CopyIn**:将输入数据从Global Memory搬运到Local Memory,完成搬运后执行入队列操作;\n",
"\n",
"- **Compute**:完成队列出队后,从Local Memory获取数据并计算,计算完成后执行入队操作;\n",
"\n",
"- **CopyOut**:完成队列出队后,将计算结果从Local Memory搬运到Global Memory。\n",
"\n",
"<img src=\"./images/vector_programming_paradigm.png\" alt=\"vector_programming_paradigm\" width=\"700px\" >\n",
"\n",
"根据矢量编程范式实现算子类,本样例中定义KernelAdd算子类,其具体成员如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d30353b7-9037-4a48-a80b-014872bc2b0f",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"class KernelAdd {\n",
"public:\n",
" __aicore__ inline KernelAdd(){}\n",
" // 初始化函数,完成内存初始化相关操作\n",
" __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum);\n",
" // 核心处理函数,实现算子逻辑,调用私有成员函数CopyIn、Compute、CopyOut完成矢量算子的三级流水操作\n",
" __aicore__ inline void Process();\n",
"\n",
"private:\n",
" // 搬入函数,从Global Memory搬运数据至Local Memory,被核心Process函数调用\n",
" __aicore__ inline void CopyIn(int32_t progress);\n",
" // 计算函数,完成两个输入参数相加,得到最终结果,被核心Process函数调用\n",
" __aicore__ inline void Compute(int32_t progress);\n",
" // 搬出函数,将最终结果从Local Memory搬运到Global Memory上,被核心Process函数调用\n",
" __aicore__ inline void CopyOut(int32_t progress);\n",
"\n",
"private:\n",
" AscendC::TPipe pipe; // TPipe内存管理对象\n",
" AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueX, inQueueY; // 输入数据Queue队列管理对象,TPosition为VECIN\n",
" AscendC::TQue<AscendC::TPosition::VECOUT, BUFFER_NUM> outQueueZ; // 输出数据Queue队列管理对象,TPosition为VECOUT\n",
" AscendC::GlobalTensor<float> xGm; // 管理输入输出Global Memory内存地址的对象,其中xGm, yGm为输入,zGm为输出\n",
" AscendC::GlobalTensor<float> yGm;\n",
" AscendC::GlobalTensor<float> zGm;\n",
" uint32_t blockLength; // 每个核的计算数据长度\n",
" uint32_t tileNum; // 每个核需要计算的数据块个数\n",
" uint32_t tileLength; // 每个核内每个数据块的长度\n",
"};"
]
},
{
"cell_type": "markdown",
"id": "fabcedda-4d1f-4939-850c-5a09ad1270a1",
"metadata": {},
"source": [
"内部函数的调用关系如下所示:\n",
"\n",
"<img src=\"./images/calling_relationship.png\" alt=\"calling_relationship\" width=\"700px\" >\n",
"由此可见除了Init函数完成初始化外,Process中完成了对流水任务“搬入、计算、搬出”的调用,开发者可以重点关注三个流水任务的实现。"
]
},
{
"cell_type": "markdown",
"id": "d2c6d8b0-a2d0-4c8f-9321-6b382d9020a7",
"metadata": {},
"source": [
"### 3.6 init函数实现\n",
"初始化函数Init主要完成以下内容:\n",
"\n",
"- 设置输入输出Global Tensor的Global Memory内存地址。\n",
"\n",
"- 通过TPipe内存管理对象为输入输出Queue分配内存。 \n",
"\n",
"本算子中,将数据切分成8块,平均分配到8个核上运行,每个核上处理的数据大小为2048个元素。通过将每个核上处理的数据地址在起始地址上增加GetBlockIdx() * blockLength(每个block处理的数据长度)的偏移来获取。来实现多核并行计算的数据切分。\n",
"\n",
"以输入x为例,x + blockLength * GetBlockIdx()即为单核处理程序中x在Global Memory上的内存偏移地址,获取偏移地址后,使用GlobalTensor类的SetGlobalBuffer接口设定该核上Global Memory的起始地址以及长度。具体示意图如下。\n",
"\n",
"<img src=\"./images/inter_kernel_partition.png\" alt=\"inter_kernel_partition\" width=\"700px\" >\n",
"\n",
"对于单核上的处理数据,可以进行数据切块(Tiling),在本示例中,仅作为参考,将数据切分成8块(并不意味着8块就是性能最优)。切分后的每个数据块再次切分成2块,即可开启double buffer,实现流水线之间的并行。\n",
"\n",
"这样单核上的数据(2048个数)被切分成16块,每块tileLength(128)个数据。TPipe为inQueueX分配了两块大小为tileLength * sizeof(float)个字节的内存块,每个内存块能容纳tileLength(128)个float类型数据。数据切分示意图如下。\n",
"\n",
"<img src=\"./images/intra_kernel_partition.png\" alt=\"intra_kernel_partition\" width=\"700px\" >\n",
"\n",
"具体的初始化函数代码如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a264a25-cace-448a-a5b7-4c424b8acde1",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"__aicore__ inline void KernelAdd::Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum)\n",
"{\n",
" \n",
" this->blockLength = totalLength / AscendC::GetBlockNum(); // length computed of each core\n",
" this->tileNum = tileNum; // split data into 8 tiles for each core\n",
" this->tileLength = this->blockLength / tileNum / BUFFER_NUM; // separate to 2 parts, due to double buffer\n",
" // get start index for current core, core parallel\n",
" xGm.SetGlobalBuffer((__gm__ float *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);\n",
" yGm.SetGlobalBuffer((__gm__ float *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);\n",
" zGm.SetGlobalBuffer((__gm__ float *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);\n",
" // pipe alloc memory to queue, the unit is Bytes\n",
" pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(float));\n",
" pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(float));\n",
" pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(float));\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "997ca20f-b51d-4340-8635-7df53f63e167",
"metadata": {},
"source": [
"### 3.7 process函数实现\n",
"基于矢量编程范式,将核函数的实现分为3个基本任务:CopyIn,Compute,CopyOut。Process函数中通过如下方式调用这三个函数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b94774be-580e-4160-ac49-ff6eac065c13",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"__aicore__ inline void KernelAdd::Process()\n",
"{\n",
" // loop count need to be doubled, due to double buffer\n",
" int32_t loopCount = this->tileNum * BUFFER_NUM;\n",
" // tiling strategy, pipeline parallel\n",
" for (int32_t i = 0; i < loopCount; i++) {\n",
" CopyIn(i);\n",
" Compute(i);\n",
" CopyOut(i);\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "d5d92b25-894a-4321-a9c6-349056de2cdf",
"metadata": {},
"source": [
"### 3.8 CopyIn函数实现\n",
"CopyIn函数的核心作用是将NPU端的数据拷贝至AICore,具体执行流程如下:\n",
"\n",
"- 使用DataCopy接口将GlobalTensor数据拷贝到LocalTensor。\n",
"\n",
"- 使用EnQue将LocalTensor放入VecIn的Queue中。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2204d1bb-e45f-4631-ae4c-e828048e7c43",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"__aicore__ inline void KernelAdd::CopyIn( int32_t progress)\n",
"{\n",
" // alloc tensor from queue memory\n",
" AscendC::LocalTensor<float> xLocal = inQueueX.AllocTensor<float>();\n",
" AscendC::LocalTensor<float> yLocal = inQueueY.AllocTensor<float>();\n",
" // copy progress_th tile from global tensor to local tensor\n",
" AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);\n",
" AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);\n",
" // enque input tensors to VECIN queue\n",
" inQueueX.EnQue(xLocal);\n",
" inQueueY.EnQue(yLocal);\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "de25efb8-ce99-4bb3-8eea-3929929dfe9e",
"metadata": {},
"source": [
"### 3.9 Compute函数实现\n",
"Compute为核心计算函数,通过调用Ascend C接口,对已搬入AICore的数据执行实际计算操作,具体流程如下:\n",
"\n",
"- 使用DeQue从VecIn中取出LocalTensor。\n",
"\n",
"- 使用Ascend C接口Add完成矢量计算。\n",
"\n",
"- 使用EnQue将计算结果LocalTensor放入到VecOut的Queue中。\n",
"\n",
"- 使用FreeTensor将释放不再使用的LocalTensor。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d1a7922-23ab-4045-8d71-185c1b3303ac",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"__aicore__ inline void KernelAdd::Compute(int32_t progress)\n",
"{\n",
" // deque input tensors from VECIN queue\n",
" AscendC::LocalTensor<float> xLocal = inQueueX.DeQue<float>();\n",
" AscendC::LocalTensor<float> yLocal = inQueueY.DeQue<float>();\n",
" AscendC::LocalTensor<float> zLocal = outQueueZ.AllocTensor<float>();\n",
" // call Add instr for computation\n",
" AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);\n",
" // enque the output tensor to VECOUT queue\n",
" outQueueZ.EnQue<float>(zLocal);\n",
" // free input tensors for reuse\n",
" inQueueX.FreeTensor(xLocal);\n",
" inQueueY.FreeTensor(yLocal);\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "b897c7fb-f8fb-40c9-a629-bb502a85a5ca",
"metadata": {},
"source": [
"### 3.10 CopyOut函数实现。\n",
"CopyOut 函数用于将 AICore 中完成计算的结果数据搬出,具体流程如下:\n",
"\n",
"- 使用DeQue接口从VecOut的Queue中取出LocalTensor。\n",
"\n",
"- 使用DataCopy接口将LocalTensor拷贝到GlobalTensor上。\n",
"\n",
"- 使用FreeTensor将不再使用的LocalTensor进行回收。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "468984a6-6afb-42da-8dc5-4140de449642",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"__aicore__ inline void KernelAdd::CopyOut(int32_t progress)\n",
"{\n",
" // deque output tensor from VECOUT queue\n",
" AscendC::LocalTensor<float> zLocal = outQueueZ.DeQue<float>();\n",
" // copy progress_th tile from local tensor to global tensor\n",
" AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);\n",
" // free output tensor for reuse\n",
" outQueueZ.FreeTensor(zLocal);\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "430b9592-f1f3-4c0c-bb0c-3a3cbf6c2194",
"metadata": {},
"source": [
"### 3.11 重新写入核函数\n",
"由于核函数会调用实现类中的函数,所以需要将2.4中暂存于add_bak.asc中的核函数代码写入add.asc"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "53bef99b-65e5-437b-bc51-b200ab0c6b6b",
"metadata": {},
"outputs": [],
"source": [
"!cat Sources/add_bak.asc >> Sources/add.asc\n",
"!rm Sources/add_bak.asc"
]
},
{
"cell_type": "markdown",
"id": "bc0ad978-f7ee-47e9-a18c-64fc8b61637d",
"metadata": {},
"source": [
"---\n",
"\n",
"## 4. 算子调用\n",
"完成Kernel侧核函数开发后,即可编写Host侧的核函数调用程序。实现从Host侧的APP程序调用算子,执行计算过程。本算子中主要分为以下三个部分。\n",
"\n",
"1. 核函数调用:通过<<<...>>>内核调用符进行算子调用\n",
"\n",
"2. 计算结果比对:比对golden数据核实际输出,验证算子精度\n",
"\n",
"3. 算子验证主程序:生成输入及golden数据,用来进行算子验证"
]
},
{
"cell_type": "markdown",
"id": "3fde0fb4-9990-4945-ae9d-c42cccb6dc7a",
"metadata": {},
"source": [
"### 4.1 核函数调用\n",
"常见的函数调用方式是如下的形式:\n",
"\n",
"```\n",
"function_name(argument list);\n",
"```\n",
"\n",
"核函数使用内核调用符<<<...>>>这种语法形式,来规定核函数的执行配置:\n",
"\n",
"```\n",
"kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);\n",
"```\n",
"\n",
"执行配置由3个参数决定:\n",
"\n",
"- **blockDim**: 规定了核函数将会在几个核上执行。每个执行该核函数的核会被分配一个逻辑ID,即block_idx,可以在核函数的实现中调用GetBlockIdx来获取block_idx;\n",
"\n",
"- **l2ctrl**: 保留参数,暂时设置为固定值nullptr,开发者无需关注;\n",
"\n",
"- **stream**: 类型为aclrtStream,stream用于维护一些异步操作的执行顺序,确保按照应用程序中的代码调用顺序在device上执行。\n",
"\n",
"如下名为add_custom的核函数,实现两个矢量的相加,调用示例如下:\n",
"\n",
"```\n",
"// blockDim设置为8表示在8个核上调用了add_custom核函数,每个核都会独立且并行地执行该核函数,该核函数的参数列表为x,y,z。\n",
"add_custom<<<8, nullptr, stream>>>(x, y, z);\n",
"``` \n",
"核函数的调用是异步的,核函数的调用结束后,控制权立刻返回给主机端,可以调用以下aclrtSynchronizeStream函数来强制主机端程序等待所有核函数执行完毕。\n",
"\n",
"```\n",
"aclError aclrtSynchronizeStream(aclrtStream stream);\n",
"```\n",
"\n",
"在整个调用过程中,需要申请Host和Device内存,在Host侧读入数据后,将数据拷贝到Device测,进而调用<<<..>>>进行算子执行,其逻辑流程图如下所示:\n",
"\n",
"<img src=\"./images/kernel_function_call_code_logic_block_diagram.png\" alt=\"kernel_function_call_code_logic_block_diagram\" width=\"200px\" >\n",
"\n",
"本样例实现如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "548a071d-01ec-49fe-a4f5-494332b0907c",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"std::vector<float> kernel_add(std::vector<float> &x, std::vector<float> &y)\n",
"{\n",
" constexpr uint32_t blockDim = 8;\n",
" uint32_t totalLength = x.size();\n",
" size_t totalByteSize = totalLength * sizeof(float);\n",
" int32_t deviceId = 0;\n",
" aclrtStream stream = nullptr;\n",
" AddCustomTilingData tiling = {/*totalLength:*/totalLength, /*tileNum:*/8};\n",
" uint8_t *xHost = reinterpret_cast<uint8_t *>(x.data());\n",
" uint8_t *yHost = reinterpret_cast<uint8_t *>(y.data());\n",
" uint8_t *zHost = nullptr;\n",
" uint8_t *xDevice = nullptr;\n",
" uint8_t *yDevice = nullptr;\n",
" uint8_t *zDevice = nullptr;\n",
"\n",
" // 初始化\n",
" aclInit(nullptr);\n",
" // 运行管理资源申请\n",
" aclrtSetDevice(deviceId);\n",
" aclrtCreateStream(&stream);\n",
" // 分配Host内存\n",
" aclrtMallocHost((void **)(&zHost), totalByteSize);\n",
" // 分配Device内存\n",
" aclrtMalloc((void **)&xDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);\n",
" aclrtMalloc((void **)&yDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);\n",
" aclrtMalloc((void **)&zDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);\n",
" // 将Host上的输入数据拷贝到Device侧\n",
" aclrtMemcpy(xDevice, totalByteSize, xHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);\n",
" aclrtMemcpy(yDevice, totalByteSize, yHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);\n",
" // 用内核调用符<<<...>>>调用核函数完成指定的运算\n",
" add<<<blockDim, nullptr, stream>>>(xDevice, yDevice, zDevice, tiling);\n",
" aclrtSynchronizeStream(stream);\n",
" // 将Device上的运算结果拷贝回Host\n",
" aclrtMemcpy(zHost, totalByteSize, zDevice, totalByteSize, ACL_MEMCPY_DEVICE_TO_HOST);\n",
" std::vector<float> z((float *)zHost, (float *)(zHost + totalLength));\n",
" // 释放申请的资源\n",
" aclrtFree(xDevice);\n",
" aclrtFree(yDevice);\n",
" aclrtFree(zDevice);\n",
" aclrtFreeHost(zHost);\n",
" // 去初始化\n",
" aclrtDestroyStream(stream);\n",
" aclrtResetDevice(deviceId);\n",
" aclFinalize();\n",
" return z;\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "191f5434-45f4-4563-acc6-2a5835e2e72e",
"metadata": {},
"source": [
"在实现其他算子时,这里大部分代码均可复用,需要修改的点如下:\n",
"\n",
"1. 需要根据实际输入及输出个数,调整Host核Device的内存申请、拷贝、释放以及<<<...>>>中的参数。\n",
"\n",
"2. 需要根据实际要使用的核数,定义blockDim。\n",
"\n",
"3. 需要根据实际的分块策略定义tileNum。"
]
},
{
"cell_type": "markdown",
"id": "9654cc03-9f4e-4eec-808a-6f2fbee17f5a",
"metadata": {},
"source": [
"### 4.2 计算结果比对\n",
"传入实际输出与golden值,进行一一比对,若不相同,则证明存在精度问题。本样例中会同时将output和golden中前20个数打印出来进行查看。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b4c8b79-6c2a-4536-a1ac-b87a902905bb",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"uint32_t VerifyResult(std::vector<float> &output, std::vector<float> &golden)\n",
"{\n",
" auto printTensor = [](std::vector<float> &tensor, const char *name) {\n",
" constexpr size_t maxPrintSize = 20;\n",
" std::cout << name << \": \";\n",
" std::copy(tensor.begin(), tensor.begin() + std::min(tensor.size(), maxPrintSize),\n",
" std::ostream_iterator<float>(std::cout, \" \"));\n",
" if (tensor.size() > maxPrintSize) {\n",
" std::cout << \"...\";\n",
" }\n",
" std::cout << std::endl;\n",
" };\n",
" printTensor(output, \"Output\");\n",
" printTensor(golden, \"Golden\");\n",
" if (std::equal(output.begin(), output.end(), golden.begin())) {\n",
" std::cout << \"[Success] Case accuracy is verification passed.\" << std::endl;\n",
" return 0;\n",
" } else {\n",
" std::cout << \"[Failed] Case accuracy is verification failed!\" << std::endl;\n",
" return 1;\n",
" }\n",
" return 0;\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "9e0c2d96-846a-4e5b-994b-23ecf3192bb3",
"metadata": {},
"source": [
"### 4.3 算子验证主程序\n",
"定义输入数据长度,并生成实际的golden数据。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "305f6acb-3f1d-4475-8d5e-cb2ea02e9840",
"metadata": {},
"outputs": [],
"source": [
"%%writefile -a Sources/add.asc\n",
"\n",
"int32_t main(int32_t argc, char *argv[])\n",
"{\n",
" constexpr uint32_t totalLength = 8 * 2048;\n",
" constexpr float valueX = 1.2f;\n",
" constexpr float valueY = 2.3f;\n",
" std::vector<float> x(totalLength, valueX);\n",
" std::vector<float> y(totalLength, valueY);\n",
"\n",
" std::vector<float> output = kernel_add(x, y);\n",
"\n",
" std::vector<float> golden(totalLength, valueX + valueY);\n",
" return VerifyResult(output, golden);\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "3129c331-445e-45fd-9be3-4939c340c8b7",
"metadata": {},
"source": [
"---\n",
"\n",
"## 5. 编译运行\n",
"完成所有代码的开发工作后,即可进行编译运行。首先执行以下代码编译可执行文件:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "128c9e15-4579-41aa-bf13-52f5e7e742a7",
"metadata": {},
"outputs": [],
"source": [
"!bisheng Sources/add.asc --npu-arch=dav-2201 -o add"
]
},
{
"cell_type": "markdown",
"id": "e5424895-db51-45a0-be54-9f245b9b39d9",
"metadata": {},
"source": [
"再执行以下代码,进行算子的实际运行。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a62ca22e-c6b8-46d7-845f-15962308b102",
"metadata": {},
"outputs": [],
"source": [
"!./add"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "python-3.9.10",
"language": "python",
"name": "python-3.9.10"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}