{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ab6342ba-9185-4806-b202-6009984e8c8d",
   "metadata": {},
   "source": [
    "# 快速入门-基于CANN快速跑通第一个自定义算子\n",
    "本快速入门指南主要介绍Kernel直调工程下Ascend C算子开发的基础流程,核心内容及步骤如下:\n",
    "\n",
    "1. 环境准备:配置程序运行所需的环境变量\n",
    "\n",
    "2. 算子分析:梳理目标算子的原型定义\n",
    "\n",
    "3. 核函数开发:实现算子核心功能代码\n",
    "\n",
    "4. 算子调用:开发算子调用逻辑代码\n",
    "\n",
    "5. 编译运行:编译算子与调用代码并执行,完成功能验证\n",
    "\n",
    "Kernel 直调工程的核心优势为**算子实现与调用代码在同一源文件中**,可快速完成算子的开发与调用测试,是开发者高效开展算子原型开发、功能快速验证的实用利器。本次快速入门的所有开发工作,均基于Sources/add.asc文件开展。\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "219aa56f",
   "metadata": {},
   "source": [
    "## 1. 环境准备\n",
    "正式开始快速入门之前,先要对jupyter环境进行初始化。以下代码完成了初始化并将环境中的变量导入jupyter环境,同时完成了代码目录的创建。保证能正常导入代码以及使用bisheng编译器,完成算子的开发及编译。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8edee15e-4626-4d53-9fc5-834e7492806b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p Sources\n",
    "\n",
    "import os, subprocess\n",
    "env = subprocess.check_output(\"bash -l -c 'source $ASCEND_TOOLKIT_HOME/set_env.sh && env'\", shell=True, text=True)\n",
    "for line in env.splitlines():\n",
    "    if \"=\" in line: os.environ.__setitem__(*line.split(\"=\", 1))\n",
    "print(\"\\n🎉 Environment initialization process completed successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2333a756-f74e-447a-9551-4017a8cc1d39",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2. 算子分析\n",
    "快速入门场景,基于Add算子为例进行自定义算子开发全流程介绍,为简化学习流程,该自定义算子存在以下特点。\n",
    "\n",
    "- 输入shape固定为(8,2048)\n",
    "\n",
    "- 输入类型仅支持float\n",
    "\n",
    "- 固定使用8个核\n",
    "\n",
    "基于上述条件,算子设计规格如下所示:\n",
    "\n",
    "<table style=\"float: left; border-collapse: collapse; margin: 0 10px 10px 0; font-size: 14px;\">\n",
    "<tr style=\"background: #f0f0f0;\">\n",
    "    <td align=\"center\">算子类型(OpType)</td>\n",
    "    <td colspan=\"4\" align=\"center\">Add</td>\n",
    "</tr>\n",
    "<tr>\n",
    "    <td rowspan=\"3\" align=\"center\">算子输入</td>\n",
    "    <td align=\"center\">name</td>\n",
    "    <td align=\"center\">shape</td>\n",
    "    <td align=\"center\">data type</td>\n",
    "    <td align=\"center\">format</td>\n",
    "</tr>\n",
    "<tr>\n",
    "    <td align=\"center\">x</td>\n",
    "    <td align=\"center\">(8, 2048)</td>\n",
    "    <td align=\"center\">float</td>\n",
    "    <td align=\"center\">ND</td>\n",
    "</tr>\n",
    "<tr>\n",
    "    <td align=\"center\">y</td>\n",
    "    <td align=\"center\">(8, 2048)</td>\n",
    "    <td align=\"center\">float</td>\n",
    "    <td align=\"center\">ND</td>\n",
    "</tr>\n",
    "<tr>\n",
    "    <td rowspan=\"1\" align=\"center\">算子输出</td>\n",
    "    <td align=\"center\">z</td>\n",
    "    <td align=\"center\">(8, 2048)</td>\n",
    "    <td align=\"center\">float</td>\n",
    "    <td align=\"center\">ND</td>\n",
    "</tr>\n",
    "<tr>\n",
    "    <td rowspan=\"1\" align=\"center\">核函数名</td>\n",
    "    <td colspan=\"4\" align=\"center\">add</td>\n",
    "</tr>\n",
    "<tr>\n",
    "    <td rowspan=\"1\" align=\"center\">使用核数</td>\n",
    "    <td colspan=\"4\" align=\"center\">8</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1a8ec6f-1290-4e99-ad8a-1b02d3ae885b",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 3. 核函数开发\n",
    "本样例中使用固定8个核并行计算,即把数据进行分片,分配到多个核上进行处理。 Ascend C核函数是在一个核上的处理函数,所以只处理部分数据。  \n",
    "\n",
    "分配方案是:数据整体长度为8 * 2048个元素,平均分配到8个核上运行,每个核上处理的数据大小为2048个元素。对于单核上的处理数据,也可以进行数据切块,实现对数据的流水并行处理。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fbbcc08-003a-4603-a081-cadccc95b1e9",
   "metadata": {},
   "source": [
    "### 3.1 头文件引入\n",
    "进行算子开发时,首先要在add.asc源文件中导入必要的头文件。这些头文件都是固定头文件,在进行其它自定义算子开发时可直接复用。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f642b9e2-906c-4628-acf9-9a1f479d6e54",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile Sources/add.asc\n",
    "\n",
    "#include <cstdint>\n",
    "#include <iostream>\n",
    "#include <vector>\n",
    "#include <algorithm>\n",
    "#include <iterator>\n",
    "#include \"acl/acl.h\"\n",
    "#include \"kernel_operator.h\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbdf5b3e-a00c-475d-95ff-fa5fde97f9c1",
   "metadata": {},
   "source": [
    "### 3.2 定义BufferNum\n",
    "由于AI Core上,矢量计算CopyIn、CopyOut过程使用MTE指令队列(MTE2、MTE3),Compute过程使用Vector指令队列(V),意味着CopyIn、CopyOut过程和Compute过程是可以并行的。\n",
    "\n",
    "DoubleBuffer机制将待处理的数据一分为二,比如Tensor1、Tensor2。如下图所示,当Vector对Tensor1中数据进行Compute时,Tensor2可以执行CopyIn的过程;而当Vector切换到计算Tensor2时,Tensor1可以执行CopyOut的过程。由此,数据的进出搬运和Vector计算实现并行执行。 \n",
    "\n",
    "<img src=\"./images/double_buffer.png\" alt=\"double_buffer\"  width=\"300px\" >\n",
    "\n",
    "使用Double Buffer的简单代码示例如下:\n",
    "\n",
    "```\n",
    "pipe.InitBuffer(inQueueX, 2, 256);\n",
    "```\n",
    "\n",
    "而在实际代码中,一般会用宏进行替换,实际代码如下。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79923704-50a6-4b78-b952-e89214c10b5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "constexpr uint32_t BUFFER_NUM = 2; // tensor num for each queue"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0f9d053-4a1e-44e8-9d2b-15969462fcce",
   "metadata": {},
   "source": [
    "### 3.3 tiling结构体创建\n",
    "根据分配方案,需要定义一个结构体,用于保存并行数据切分相关的参数。结构体可由开发者自行定义,但需要注意以下两点:\n",
    "\n",
    "1. 结构体命名需与算子配套,保证代码可读性\n",
    "\n",
    "2. 结构体只定义必要参数,若存在大量冗余定义,会由于结构体下发速度慢从而导致算子性能下降。\n",
    "\n",
    "本入门体验中,结构体命名为AddTilingData,该结构体定义了如下两个参数:\n",
    "\n",
    "- totalLength:指待处理的数据总大小,本入门体验中固定为(8 * 2048)个元素\n",
    "\n",
    "- tileNum:指每个核需要计算的数据块个数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e988cab4-bbc1-4be4-a126-dc94b54547d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "struct AddCustomTilingData\n",
    "{\n",
    "    uint32_t totalLength;\n",
    "    uint32_t tileNum;\n",
    "};"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45c9a236-9fc0-47bc-b647-2f50bc6b832b",
   "metadata": {},
   "source": [
    "### 3.4 进行核函数的定义\n",
    "核函数(Kernel Function)是Ascend C算子设备侧实现的入口。Ascend C允许用户使用C/C++函数的语法扩展来编写设备端的运行代码,用户在核函数中进行数据访问和计算操作,由此实现该算子的所有功能。区别于普通的C++函数调用时仅执行一次,当核函数被调用时,多个核都执行相同的核函数代码,具有相同的函数入参,并行执行。其定义要求如下:\n",
    "\n",
    "- 使用__global__函数类型限定符来标识它是一个核函数。\n",
    "\n",
    "- 使用__aicore__函数类型限定符来标识该核函数在设备端AI Core上执行。\n",
    "\n",
    "- 指针入参变量需要增加变量类型限定符__gm__,表明该指针变量指向Global Memory上某处内存地址。\n",
    "\n",
    "- 核函数必须具有void返回类型。\n",
    "\n",
    "- 仅支持入参为指针或C/C++内置数据类型(Primitive data types),如:half* s0、float* s1、int32_t c。\n",
    "\n",
    "- 为了统一表达,建议使用GM_ADDR宏来修饰入参,其为是编译器中自带的宏,代表的含义为Global Memory中的地址,其定义如下:  \n",
    "  ```\n",
    "  #define GM_ADDR __gm__ uint8_t*\n",
    "  ```\n",
    "\n",
    "示例如下:\n",
    "```\n",
    "extern \"C\" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)\n",
    "```\n",
    "\n",
    "在本次体验中,由于涉及tiling结构体,所以核函数中增加了一个tiling结构体参数,并在核函数中调用算子类的Init和Process函数,算子类实现在后续步骤中介绍。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e2534e9-7c3e-40fe-8cb5-08f8105bf327",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile Sources/add_bak.asc\n",
    "\n",
    "__global__ __aicore__ void add(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)\n",
    "{\n",
    "    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);    // 设置Kernel类型为Vector核(用于矢量计算)\n",
    "    KernelAdd op;\n",
    "    op.Init(x, y, z, tiling.totalLength, tiling.tileNum);\n",
    "    op.Process();\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38278513-8ef2-4c6f-8d2c-36b73c503cd8",
   "metadata": {},
   "source": [
    "### 3.5 创建核函数类\n",
    "矢量编程范式把算子的实现流程分为3个基本任务:CopyIn,Compute,CopyOut。\n",
    "\n",
    "- **CopyIn**:将输入数据从Global Memory搬运到Local Memory,完成搬运后执行入队列操作;\n",
    "\n",
    "- **Compute**:完成队列出队后,从Local Memory获取数据并计算,计算完成后执行入队操作;\n",
    "\n",
    "- **CopyOut**:完成队列出队后,将计算结果从Local Memory搬运到Global Memory。\n",
    "\n",
    "<img src=\"./images/vector_programming_paradigm.png\" alt=\"vector_programming_paradigm\"  width=\"700px\" >\n",
    "\n",
    "根据矢量编程范式实现算子类,本样例中定义KernelAdd算子类,其具体成员如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d30353b7-9037-4a48-a80b-014872bc2b0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "class KernelAdd {\n",
    "public:\n",
    "    __aicore__ inline KernelAdd(){}\n",
    "    // 初始化函数,完成内存初始化相关操作\n",
    "    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum);\n",
    "    // 核心处理函数,实现算子逻辑,调用私有成员函数CopyIn、Compute、CopyOut完成矢量算子的三级流水操作\n",
    "    __aicore__ inline void Process();\n",
    "\n",
    "private:\n",
    "    // 搬入函数,从Global Memory搬运数据至Local Memory,被核心Process函数调用\n",
    "    __aicore__ inline void CopyIn(int32_t progress);\n",
    "    // 计算函数,完成两个输入参数相加,得到最终结果,被核心Process函数调用\n",
    "    __aicore__ inline void Compute(int32_t progress);\n",
    "    // 搬出函数,将最终结果从Local Memory搬运到Global Memory上,被核心Process函数调用\n",
    "    __aicore__ inline void CopyOut(int32_t progress);\n",
    "\n",
    "private:\n",
    "    AscendC::TPipe pipe;  // TPipe内存管理对象\n",
    "    AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;  // 输入数据Queue队列管理对象,TPosition为VECIN\n",
    "    AscendC::TQue<AscendC::TPosition::VECOUT, BUFFER_NUM> outQueueZ;  // 输出数据Queue队列管理对象,TPosition为VECOUT\n",
    "    AscendC::GlobalTensor<float> xGm;  // 管理输入输出Global Memory内存地址的对象,其中xGm, yGm为输入,zGm为输出\n",
    "    AscendC::GlobalTensor<float> yGm;\n",
    "    AscendC::GlobalTensor<float> zGm;\n",
    "    uint32_t blockLength; // 每个核的计算数据长度\n",
    "    uint32_t tileNum; // 每个核需要计算的数据块个数\n",
    "    uint32_t tileLength; // 每个核内每个数据块的长度\n",
    "};"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fabcedda-4d1f-4939-850c-5a09ad1270a1",
   "metadata": {},
   "source": [
    "内部函数的调用关系如下所示:\n",
    "\n",
    "<img src=\"./images/calling_relationship.png\" alt=\"calling_relationship\"  width=\"700px\" >\n",
    "由此可见除了Init函数完成初始化外,Process中完成了对流水任务“搬入、计算、搬出”的调用,开发者可以重点关注三个流水任务的实现。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2c6d8b0-a2d0-4c8f-9321-6b382d9020a7",
   "metadata": {},
   "source": [
    "### 3.6 init函数实现\n",
    "初始化函数Init主要完成以下内容:\n",
    "\n",
    "- 设置输入输出Global Tensor的Global Memory内存地址。\n",
    "\n",
    "- 通过TPipe内存管理对象为输入输出Queue分配内存。  \n",
    "\n",
    "本算子中,将数据切分成8块,平均分配到8个核上运行,每个核上处理的数据大小为2048个元素。通过将每个核上处理的数据地址在起始地址上增加GetBlockIdx() * blockLength(每个block处理的数据长度)的偏移来获取。来实现多核并行计算的数据切分。\n",
    "\n",
    "以输入x为例,x + blockLength * GetBlockIdx()即为单核处理程序中x在Global Memory上的内存偏移地址,获取偏移地址后,使用GlobalTensor类的SetGlobalBuffer接口设定该核上Global Memory的起始地址以及长度。具体示意图如下。\n",
    "\n",
    "<img src=\"./images/inter_kernel_partition.png\" alt=\"inter_kernel_partition\"  width=\"700px\" >\n",
    "\n",
    "对于单核上的处理数据,可以进行数据切块(Tiling),在本示例中,仅作为参考,将数据切分成8块(并不意味着8块就是性能最优)。切分后的每个数据块再次切分成2块,即可开启double buffer,实现流水线之间的并行。\n",
    "\n",
    "这样单核上的数据(2048个数)被切分成16块,每块tileLength(128)个数据。TPipe为inQueueX分配了两块大小为tileLength * sizeof(float)个字节的内存块,每个内存块能容纳tileLength(128)个float类型数据。数据切分示意图如下。\n",
    "\n",
    "<img src=\"./images/intra_kernel_partition.png\" alt=\"intra_kernel_partition\"  width=\"700px\" >\n",
    "\n",
    "具体的初始化函数代码如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a264a25-cace-448a-a5b7-4c424b8acde1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "__aicore__ inline void KernelAdd::Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum)\n",
    "{\n",
    "    \n",
    "     this->blockLength = totalLength / AscendC::GetBlockNum();     // length computed of each core\n",
    "     this->tileNum = tileNum;                                      // split data into 8 tiles for each core\n",
    "     this->tileLength = this->blockLength / tileNum / BUFFER_NUM;  // separate to 2 parts, due to double buffer\n",
    "     // get start index for current core, core parallel\n",
    "     xGm.SetGlobalBuffer((__gm__ float *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);\n",
    "     yGm.SetGlobalBuffer((__gm__ float *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);\n",
    "     zGm.SetGlobalBuffer((__gm__ float *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);\n",
    "     // pipe alloc memory to queue, the unit is Bytes\n",
    "     pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(float));\n",
    "     pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(float));\n",
    "     pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(float));\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "997ca20f-b51d-4340-8635-7df53f63e167",
   "metadata": {},
   "source": [
    "### 3.7 process函数实现\n",
    "基于矢量编程范式,将核函数的实现分为3个基本任务:CopyIn,Compute,CopyOut。Process函数中通过如下方式调用这三个函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b94774be-580e-4160-ac49-ff6eac065c13",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "__aicore__ inline void KernelAdd::Process()\n",
    "{\n",
    "    // loop count need to be doubled, due to double buffer\n",
    "    int32_t loopCount = this->tileNum * BUFFER_NUM;\n",
    "    // tiling strategy, pipeline parallel\n",
    "    for (int32_t i = 0; i < loopCount; i++) {\n",
    "        CopyIn(i);\n",
    "        Compute(i);\n",
    "        CopyOut(i);\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5d92b25-894a-4321-a9c6-349056de2cdf",
   "metadata": {},
   "source": [
    "### 3.8 CopyIn函数实现\n",
    "CopyIn函数的核心作用是将NPU端的数据拷贝至AICore,具体执行流程如下:\n",
    "\n",
    "- 使用DataCopy接口将GlobalTensor数据拷贝到LocalTensor。\n",
    "\n",
    "- 使用EnQue将LocalTensor放入VecIn的Queue中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2204d1bb-e45f-4631-ae4c-e828048e7c43",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "__aicore__ inline void KernelAdd::CopyIn( int32_t progress)\n",
    "{\n",
    "    // alloc tensor from queue memory\n",
    "    AscendC::LocalTensor<float> xLocal = inQueueX.AllocTensor<float>();\n",
    "    AscendC::LocalTensor<float> yLocal = inQueueY.AllocTensor<float>();\n",
    "    // copy progress_th tile from global tensor to local tensor\n",
    "    AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);\n",
    "    AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);\n",
    "    // enque input tensors to VECIN queue\n",
    "    inQueueX.EnQue(xLocal);\n",
    "    inQueueY.EnQue(yLocal);\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de25efb8-ce99-4bb3-8eea-3929929dfe9e",
   "metadata": {},
   "source": [
    "### 3.9 Compute函数实现\n",
    "Compute为核心计算函数,通过调用Ascend C接口,对已搬入AICore的数据执行实际计算操作,具体流程如下:\n",
    "\n",
    "- 使用DeQue从VecIn中取出LocalTensor。\n",
    "\n",
    "- 使用Ascend C接口Add完成矢量计算。\n",
    "\n",
    "- 使用EnQue将计算结果LocalTensor放入到VecOut的Queue中。\n",
    "\n",
    "- 使用FreeTensor将释放不再使用的LocalTensor。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d1a7922-23ab-4045-8d71-185c1b3303ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "__aicore__ inline void KernelAdd::Compute(int32_t progress)\n",
    "{\n",
    "    // deque input tensors from VECIN queue\n",
    "    AscendC::LocalTensor<float> xLocal = inQueueX.DeQue<float>();\n",
    "    AscendC::LocalTensor<float> yLocal = inQueueY.DeQue<float>();\n",
    "    AscendC::LocalTensor<float> zLocal = outQueueZ.AllocTensor<float>();\n",
    "    // call Add instr for computation\n",
    "    AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);\n",
    "    // enque the output tensor to VECOUT queue\n",
    "    outQueueZ.EnQue<float>(zLocal);\n",
    "    // free input tensors for reuse\n",
    "    inQueueX.FreeTensor(xLocal);\n",
    "    inQueueY.FreeTensor(yLocal);\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b897c7fb-f8fb-40c9-a629-bb502a85a5ca",
   "metadata": {},
   "source": [
    "### 3.10 CopyOut函数实现。\n",
    "CopyOut 函数用于将 AICore 中完成计算的结果数据搬出,具体流程如下:\n",
    "\n",
    "- 使用DeQue接口从VecOut的Queue中取出LocalTensor。\n",
    "\n",
    "- 使用DataCopy接口将LocalTensor拷贝到GlobalTensor上。\n",
    "\n",
    "- 使用FreeTensor将不再使用的LocalTensor进行回收。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "468984a6-6afb-42da-8dc5-4140de449642",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "__aicore__ inline void KernelAdd::CopyOut(int32_t progress)\n",
    "{\n",
    "    // deque output tensor from VECOUT queue\n",
    "    AscendC::LocalTensor<float> zLocal = outQueueZ.DeQue<float>();\n",
    "    // copy progress_th tile from local tensor to global tensor\n",
    "    AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);\n",
    "    // free output tensor for reuse\n",
    "    outQueueZ.FreeTensor(zLocal);\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "430b9592-f1f3-4c0c-bb0c-3a3cbf6c2194",
   "metadata": {},
   "source": [
    "### 3.11 重新写入核函数\n",
    "由于核函数会调用实现类中的函数,所以需要将2.4中暂存于add_bak.asc中的核函数代码写入add.asc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "53bef99b-65e5-437b-bc51-b200ab0c6b6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat Sources/add_bak.asc >> Sources/add.asc\n",
    "!rm Sources/add_bak.asc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc0ad978-f7ee-47e9-a18c-64fc8b61637d",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 4. 算子调用\n",
    "完成Kernel侧核函数开发后,即可编写Host侧的核函数调用程序。实现从Host侧的APP程序调用算子,执行计算过程。本算子中主要分为以下三个部分。\n",
    "\n",
    "1. 核函数调用:通过<<<...>>>内核调用符进行算子调用\n",
    "\n",
    "2. 计算结果比对:比对golden数据核实际输出,验证算子精度\n",
    "\n",
    "3. 算子验证主程序:生成输入及golden数据,用来进行算子验证"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fde0fb4-9990-4945-ae9d-c42cccb6dc7a",
   "metadata": {},
   "source": [
    "### 4.1 核函数调用\n",
    "常见的函数调用方式是如下的形式:\n",
    "\n",
    "```\n",
    "function_name(argument list);\n",
    "```\n",
    "\n",
    "核函数使用内核调用符<<<...>>>这种语法形式,来规定核函数的执行配置:\n",
    "\n",
    "```\n",
    "kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);\n",
    "```\n",
    "\n",
    "执行配置由3个参数决定:\n",
    "\n",
    "- **blockDim**: 规定了核函数将会在几个核上执行。每个执行该核函数的核会被分配一个逻辑ID,即block_idx,可以在核函数的实现中调用GetBlockIdx来获取block_idx;\n",
    "\n",
    "- **l2ctrl**: 保留参数,暂时设置为固定值nullptr,开发者无需关注;\n",
    "\n",
    "- **stream**: 类型为aclrtStream,stream用于维护一些异步操作的执行顺序,确保按照应用程序中的代码调用顺序在device上执行。\n",
    "\n",
    "如下名为add_custom的核函数,实现两个矢量的相加,调用示例如下:\n",
    "\n",
    "```\n",
    "// blockDim设置为8表示在8个核上调用了add_custom核函数,每个核都会独立且并行地执行该核函数,该核函数的参数列表为x,y,z。\n",
    "add_custom<<<8, nullptr, stream>>>(x, y, z);\n",
    "```  \n",
    "核函数的调用是异步的,核函数的调用结束后,控制权立刻返回给主机端,可以调用以下aclrtSynchronizeStream函数来强制主机端程序等待所有核函数执行完毕。\n",
    "\n",
    "```\n",
    "aclError aclrtSynchronizeStream(aclrtStream stream);\n",
    "```\n",
    "\n",
    "在整个调用过程中,需要申请Host和Device内存,在Host侧读入数据后,将数据拷贝到Device测,进而调用<<<..>>>进行算子执行,其逻辑流程图如下所示:\n",
    "\n",
    "<img src=\"./images/kernel_function_call_code_logic_block_diagram.png\" alt=\"kernel_function_call_code_logic_block_diagram\"  width=\"200px\" >\n",
    "\n",
    "本样例实现如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "548a071d-01ec-49fe-a4f5-494332b0907c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "std::vector<float> kernel_add(std::vector<float> &x, std::vector<float> &y)\n",
    "{\n",
    "    constexpr uint32_t blockDim = 8;\n",
    "    uint32_t totalLength = x.size();\n",
    "    size_t totalByteSize = totalLength * sizeof(float);\n",
    "    int32_t deviceId = 0;\n",
    "    aclrtStream stream = nullptr;\n",
    "    AddCustomTilingData tiling = {/*totalLength:*/totalLength, /*tileNum:*/8};\n",
    "    uint8_t *xHost = reinterpret_cast<uint8_t *>(x.data());\n",
    "    uint8_t *yHost = reinterpret_cast<uint8_t *>(y.data());\n",
    "    uint8_t *zHost = nullptr;\n",
    "    uint8_t *xDevice = nullptr;\n",
    "    uint8_t *yDevice = nullptr;\n",
    "    uint8_t *zDevice = nullptr;\n",
    "\n",
    "    // 初始化\n",
    "    aclInit(nullptr);\n",
    "    // 运行管理资源申请\n",
    "    aclrtSetDevice(deviceId);\n",
    "    aclrtCreateStream(&stream);\n",
    "    // 分配Host内存\n",
    "    aclrtMallocHost((void **)(&zHost), totalByteSize);\n",
    "    // 分配Device内存\n",
    "    aclrtMalloc((void **)&xDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);\n",
    "    aclrtMalloc((void **)&yDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);\n",
    "    aclrtMalloc((void **)&zDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);\n",
    "    // 将Host上的输入数据拷贝到Device侧\n",
    "    aclrtMemcpy(xDevice, totalByteSize, xHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);\n",
    "    aclrtMemcpy(yDevice, totalByteSize, yHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);\n",
    "    // 用内核调用符<<<...>>>调用核函数完成指定的运算\n",
    "    add<<<blockDim, nullptr, stream>>>(xDevice, yDevice, zDevice, tiling);\n",
    "    aclrtSynchronizeStream(stream);\n",
    "    // 将Device上的运算结果拷贝回Host\n",
    "    aclrtMemcpy(zHost, totalByteSize, zDevice, totalByteSize, ACL_MEMCPY_DEVICE_TO_HOST);\n",
    "    std::vector<float> z((float *)zHost, (float *)(zHost + totalLength));\n",
    "    // 释放申请的资源\n",
    "    aclrtFree(xDevice);\n",
    "    aclrtFree(yDevice);\n",
    "    aclrtFree(zDevice);\n",
    "    aclrtFreeHost(zHost);\n",
    "    // 去初始化\n",
    "    aclrtDestroyStream(stream);\n",
    "    aclrtResetDevice(deviceId);\n",
    "    aclFinalize();\n",
    "    return z;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "191f5434-45f4-4563-acc6-2a5835e2e72e",
   "metadata": {},
   "source": [
    "在实现其他算子时,这里大部分代码均可复用,需要修改的点如下:\n",
    "\n",
    "1. 需要根据实际输入及输出个数,调整Host核Device的内存申请、拷贝、释放以及<<<...>>>中的参数。\n",
    "\n",
    "2. 需要根据实际要使用的核数,定义blockDim。\n",
    "\n",
    "3. 需要根据实际的分块策略定义tileNum。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9654cc03-9f4e-4eec-808a-6f2fbee17f5a",
   "metadata": {},
   "source": [
    "### 4.2 计算结果比对\n",
    "传入实际输出与golden值,进行一一比对,若不相同,则证明存在精度问题。本样例中会同时将output和golden中前20个数打印出来进行查看。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b4c8b79-6c2a-4536-a1ac-b87a902905bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "uint32_t VerifyResult(std::vector<float> &output, std::vector<float> &golden)\n",
    "{\n",
    "    auto printTensor = [](std::vector<float> &tensor, const char *name) {\n",
    "        constexpr size_t maxPrintSize = 20;\n",
    "        std::cout << name << \": \";\n",
    "        std::copy(tensor.begin(), tensor.begin() + std::min(tensor.size(), maxPrintSize),\n",
    "            std::ostream_iterator<float>(std::cout, \" \"));\n",
    "        if (tensor.size() > maxPrintSize) {\n",
    "            std::cout << \"...\";\n",
    "        }\n",
    "        std::cout << std::endl;\n",
    "    };\n",
    "    printTensor(output, \"Output\");\n",
    "    printTensor(golden, \"Golden\");\n",
    "    if (std::equal(output.begin(), output.end(), golden.begin())) {\n",
    "        std::cout << \"[Success] Case accuracy is verification passed.\" << std::endl;\n",
    "        return 0;\n",
    "    } else {\n",
    "        std::cout << \"[Failed] Case accuracy is verification failed!\" << std::endl;\n",
    "        return 1;\n",
    "    }\n",
    "    return 0;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e0c2d96-846a-4e5b-994b-23ecf3192bb3",
   "metadata": {},
   "source": [
    "### 4.3 算子验证主程序\n",
    "定义输入数据长度,并生成实际的golden数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "305f6acb-3f1d-4475-8d5e-cb2ea02e9840",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile -a Sources/add.asc\n",
    "\n",
    "int32_t main(int32_t argc, char *argv[])\n",
    "{\n",
    "    constexpr uint32_t totalLength = 8 * 2048;\n",
    "    constexpr float valueX = 1.2f;\n",
    "    constexpr float valueY = 2.3f;\n",
    "    std::vector<float> x(totalLength, valueX);\n",
    "    std::vector<float> y(totalLength, valueY);\n",
    "\n",
    "    std::vector<float> output = kernel_add(x, y);\n",
    "\n",
    "    std::vector<float> golden(totalLength, valueX + valueY);\n",
    "    return VerifyResult(output, golden);\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3129c331-445e-45fd-9be3-4939c340c8b7",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 5. 编译运行\n",
    "完成所有代码的开发工作后,即可进行编译运行。首先执行以下代码编译可执行文件:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "128c9e15-4579-41aa-bf13-52f5e7e742a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bisheng Sources/add.asc --npu-arch=dav-2201 -o add"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5424895-db51-45a0-be54-9f245b9b39d9",
   "metadata": {},
   "source": [
    "再执行以下代码,进行算子的实际运行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a62ca22e-c6b8-46d7-845f-15962308b102",
   "metadata": {},
   "outputs": [],
   "source": [
    "!./add"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python-3.9.10",
   "language": "python",
   "name": "python-3.9.10"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}