Basic API Contribution Guide

Overview

Basic API is the instruction-level API layer in the Ascend C programming framework. It directly wraps hardware instructions of Ascend AI processors and uses C++ style function interfaces. Basic API serves as the foundation for building high-level APIs. Developers can implement complex algorithm logic by combining basic APIs.

Core Features of Basic API:

  • Instruction-level encapsulation: Each API maps to one or more hardware instructions.
  • LocalTensor abstraction: Uses LocalTensor<T> type to operate memory.
  • Template design: Supports multiple data types (half, float, int16_t, int32_t, and so on).
  • Dual interfaces: High-dimensional tiling computation (fine control) and first-n elements computation (simplified invocation).
  • Architecture adaptation: Supports different NPU architectures through architecture macro definitions.

Development Process

Requirement Analysis

  • Define API functionality (for example, Add, Mul, Relu).
  • Determine supported data types.
  • Analyze hardware instruction support.

API Design

  • Define function prototypes (using LocalTensor).
  • Design high-dimensional tiling computation and first-n elements computation interfaces.
  • Define parameter specifications (mask, repeat, stride, and so on).

Implementation Development

  • Write interface declarations (include/basic_api/).
  • Implement core logic (impl/basic_api/).
  • Handle architecture differences.

Test and Verification

  • Write unit tests.
  • Verify functional correctness.
  • Check boundary conditions.

Documentation

  • Complete API documentation.
  • Provide usage examples.
  • Explain constraints.

API Introduction

High-dimensional Tiling Computation vs First-n Elements Computation Interface

High-dimensional Tiling Computation (Fine Control)

// Requires manual setting of mask and repeat parameters
template <typename T, bool isSetMask = true>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           uint64_t mask[],           // mask array
                           const uint8_t repeatTime,  // repeat count
                           const BinaryRepeatParams& repeatParams);  // stride parameters

Applicable Scenarios:

  • Require fine control over computation process.
  • Non-contiguous memory access.
  • Performance optimization.

First-n Elements Computation (Simplified Invocation)

// Automatically handles mask and repeat
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count);  // only element count needed

Applicable Scenarios:

  • Contiguous memory block computation.
  • Simplified code.
  • Rapid development.

Directory Planning

Directory Structure

asc-devkit/
├── include/
│   └── basic_api/                    # Basic API header files
│       ├── kernel_operator_common_intf.h        # Common interface
│       ├── kernel_operator_vec_binary_intf.h    # Vector binary operations
│       ├── kernel_operator_vec_unary_intf.h     # Vector unary operations
│       ├── kernel_operator_data_copy_intf.h     # Data movement
│       ├── kernel_operator_fixpipe_intf.h       # Fixpipe
│       ├── kernel_operator_mm_intf.h            # Matrix multiplication
│       ├── kernel_operator_scalar_intf.h        # Scalar operations
│       ├── kernel_operator_sys_var_intf.h       # System variables
│       ├── kernel_operator_atomic_intf.h        # Atomic operations
│       ├── kernel_tensor.h                      # Tensor definition
│       └── kernel_struct_*.h                    # Parameter structures
│
├── impl/
│   └── basic_api/                    # Basic API implementation
│       ├── dav_m200/                 # NPU ARCH 200x architecture
│       │   ├── kernel_operator_vec_binary_impl.h
│       │   └── ...
│       ├── dav_c220/                 # NPU ARCH 220x architecture
│       │   ├── kernel_operator_vec_binary_impl.h
│       │   └── ...
│       └── CMakeLists.txt
│
├── tests/
│   └── api/
│       └── basic_api/                # Basic API tests
│           ├── tikcpp_case_common/
│           │   └── test_operator_axpy.cpp
│           ├── tikcpp_case_ascend910/
│           │   └── ...
│           └── tikcpp_case_ascend910b1/
│               └── ...
│
└── docs/
    └── api/
        └── context/
            └── ...              # Basic API documentation

File Naming Conventions

File Type Naming Convention Example
Interface header kernel_operator_<category>_intf.h kernel_operator_vec_binary_intf.h
Implementation file kernel_operator_<category>_impl.h kernel_operator_vec_binary_impl.h
Test file test_operator_<category>.cpp test_operator_vec_binary.cpp
Documentation file <api>.md Add.md

API Categories

Category Description Example APIs
vec_binary Vector binary operations Add, Sub, Mul, Div, Max, Min
vec_unary Vector unary operations Relu, Exp, Cast, Abs
vec_reduce Vector reduction Sum, Max, Mean
data_copy Data movement DataCopy, LoadData
fixpipe Pipeline control Fixpipe
mm Matrix multiplication Mmad, Conv2D
scalar Scalar operations ToFloat
atomic Atomic operations AtomicAdd, AtomicCAS

Architecture Design

Implementation Layers

Layer 1: Interface Declaration Layer (include/basic_api/)

// include/basic_api/kernel_operator_vec_binary_intf.h
#ifndef ASCENDC_MODULE_OPERATOR_VEC_BINARY_INTERFACE_H
#define ASCENDC_MODULE_OPERATOR_VEC_BINARY_INTERFACE_H

#include "kernel_tensor.h"
#include "kernel_struct_binary.h"

namespace AscendC {

// Add - High-dimensional tiling computation
template <typename T, bool isSetMask = true>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           uint64_t mask[],
                           const uint8_t repeatTime,
                           const BinaryRepeatParams& repeatParams);

// Add - First-n elements computation
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count);

}  // namespace AscendC

#include "impl/basic_api/kernel_operator_vec_binary_intf_impl.h"
#endif

Layer 2: Instruction Implementation Layer (impl/basic_api/)

// impl/basic_api/dav_c220/kernel_operator_vec_binary_impl.h
#ifndef ASCENDC_MODULE_OPERATOR_VEC_BINARY_IMPL_H
#define ASCENDC_MODULE_OPERATOR_VEC_BINARY_IMPL_H

namespace AscendC {

// Add implementation - First-n elements computation
template <typename T>
__aicore__ inline void AddImpl(__ubuf__ T* dst, __ubuf__ T* src0,
                               __ubuf__ T* src1, const int32_t& count) {
    if ASCEND_IS_AIV {
        // 1. Set mask
        set_mask_count();
        set_vector_mask(0, count);

        // 2. Call underlying instruction
        vadd(dst, src0, src1, 1,
             DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE,
             DEFAULT_REPEAT_STRIDE, DEFAULT_REPEAT_STRIDE, DEFAULT_REPEAT_STRIDE);

        // 3. Restore mask
        set_mask_norm();
        set_vector_mask(static_cast<uint64_t>(-1), static_cast<uint64_t>(-1));
    }
}

// Add implementation - High-dimensional tiling computation
template <typename T, bool isSetMask = true>
__aicore__ inline void AddImpl(__ubuf__ T* dst, __ubuf__ T* src0,
                               __ubuf__ T* src1, const uint64_t mask[],
                               const uint8_t repeatTime,
                               const BinaryRepeatParams& repeatParams) {
    if ASCEND_IS_AIV {
        // Set mask (if needed)
        if (isSetMask) {
            AscendCUtils::SetMask<T, isSetMask>(mask[1], mask[0]);
        }

        // Call underlying instruction
        vadd(dst, src0, src1, repeatTime,
             repeatParams.dstBlkStride, repeatParams.src0BlkStride, repeatParams.src1BlkStride,
             repeatParams.dstRepStride, repeatParams.src0RepStride, repeatParams.src1RepStride);
    }
}

}  // namespace AscendC

#endif

Layer 3: Interface Wrapper Layer

// impl/basic_api/kernel_operator_vec_binary_intf_impl.h
namespace AscendC {

// First-n elements computation interface wrapper
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count) {
    AddImpl<T>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), count);
}

// High-dimensional tiling computation interface wrapper
template <typename T, bool isSetMask = true>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           uint64_t mask[],
                           const uint8_t repeatTime,
                           const BinaryRepeatParams& repeatParams) {
    AddImpl<T, isSetMask>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(),
                          mask, repeatTime, repeatParams);
}

}  // namespace AscendC

Architecture Adaptation

Hardware may differ across NPU architectures and requires reimplementation.


Development Example: Implementing Axpy Basic API

API Requirement Analysis

Implement vector multiply-add: dst = src * scalar + dst

  • Supported data types: half, float
  • Interface type: First-n elements computation (simplified invocation)
  • Hardware support: Confirm hardware support

Review Existing API Structure

Basic API uses LocalTensor<T> as parameters. The first-n elements computation interface only requires the count parameter:

// Reference existing Add interface
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count);

Interface Design

Add in include/basic_api/kernel_operator_vec_binary_intf.h:

/* **************************************************************************************************
 * Axpy                                             *
 * ************************************************************************************************* */
/*
 * @ingroup Axpy
 * @brief dst = dst + src * scalar
 * @param [out] dst output LocalTensor
 * @param [in] src input LocalTensor
 * @param [in] scalar scalar value
 * @param [in] count number Number of data involved in calculation
 */
template <typename T, typename U>
__aicore__ inline void Axpy(const LocalTensor<T>& dst,
                           const LocalTensor<U>& src,
                           const U scalar,
                           const int32_t& count);

Implementation Code

Reference other interface implementations.

Interface Wrapper

Add in impl/basic_api/kernel_operator_vec_binary_intf_impl.h:

template <typename T, typename U>
__aicore__ inline void Axpy(const LocalTensor<T>& dst,
                           const LocalTensor<U>& src,
                           const U scalar,
                           const int32_t& count) {
    AxpyImpl<T, U>(dst.GetPtr(), src.GetPtr(), scalar, count);
}

Test Code

Add test code for the corresponding interface.


Test and Verification Requirements

Functional Testing

Verify API computation correctness.

Boundary Testing

TEST_F(TestAxpy, BoundaryTest) {
    // Test boundary values: count=0, 1, 256, 257
    // Test different data type combinations
    // Test special values (NaN, Inf)
}

Data Type Testing

INSTANTIATE_TEST_CASE_P(TEST_AXPY_TYPES, AxpyTestsuite,
    ::testing::Values(
        BinaryTestParams { 256, 2, 2, main_axpy<half, half> },
        BinaryTestParams { 256, 4, 2, main_axpy<float, half> },
        BinaryTestParams { 256, 4, 4, main_axpy<float, float> }
    )
);

Code Standards

Naming Conventions

// Function name: PascalCase, first letter uppercase
void Add(...);
void Relu(...);
void Axpy(...);

// Parameter name: camelCase
LocalTensor<T> dstTensor;
int32_t elementCount;

// Macro definition: UPPERCASE_WITH_UNDERSCORES
#define ASCENDC_ASSERT(cond, msg) ...

// Type name: PascalCase
struct BinaryRepeatParams;
class LocalTensor;

Code Style

// 1. Indentation: 4 spaces
// 2. Braces: K&R style
// 3. Spaces: Spaces around operators
// 4. Comments: Doxygen style

/**
 * @brief Vector addition operation
 * @param dst Destination LocalTensor
 * @param src0 Source LocalTensor 0
 * @param src1 Source LocalTensor 1
 * @param count Element count
 */
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count) {
    // Parameter validation
    ASCENDC_ASSERT(count > 0, "count must be positive");

    // Call implementation
    AddImpl<T>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), count);
}

Error Handling

// 1. Parameter validation (Debug mode)
ASCENDC_ASSERT(count > 0, "count must be greater than 0");
ASCENDC_ASSERT(dst != nullptr, "dst cannot be nullptr");
ASCENDC_ASSERT(src != nullptr, "src cannot be nullptr");

// 2. Type checking
static_assert(SupportType<T, half, float, int16_t, int32_t>(),
              "Unsupported data type");

// 3. Architecture checking
#if !defined(__NPU_ARCH__) || (__NPU_ARCH__ != 2201 && __NPU_ARCH__ != 3510)
    #error "Unsupported NPU architecture"
#endif