cann-robotdocs: add English translation for power README

Basic API Contribution Guide

Overview

Basic API is the instruction-level API layer in the Ascend C programming framework. It directly wraps hardware instructions of Ascend AI processors and uses C++ style function interfaces. Basic API serves as the foundation for building high-level APIs. Developers can implement complex algorithm logic by combining basic APIs.

Core Features of Basic API:

Instruction-level encapsulation: Each API maps to one or more hardware instructions.
LocalTensor abstraction: Uses LocalTensor<T> type to operate memory.
Template design: Supports multiple data types (half, float, int16_t, int32_t, and so on).
Dual interfaces: High-dimensional tiling computation (fine control) and first-n elements computation (simplified invocation).
Architecture adaptation: Supports different NPU architectures through architecture macro definitions.

Development Process

Requirement Analysis

Define API functionality (for example, Add, Mul, Relu).
Determine supported data types.
Analyze hardware instruction support.

API Design

Define function prototypes (using LocalTensor).
Design high-dimensional tiling computation and first-n elements computation interfaces.
Define parameter specifications (mask, repeat, stride, and so on).

Implementation Development

Write interface declarations (include/basic_api/).
Implement core logic (impl/basic_api/).
Handle architecture differences.

Test and Verification

Write unit tests.
Verify functional correctness.
Check boundary conditions.

Documentation

Complete API documentation.
Provide usage examples.
Explain constraints.

API Introduction

High-dimensional Tiling Computation vs First-n Elements Computation Interface

High-dimensional Tiling Computation (Fine Control)

// Requires manual setting of mask and repeat parameters
template <typename T, bool isSetMask = true>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           uint64_t mask[],           // mask array
                           const uint8_t repeatTime,  // repeat count
                           const BinaryRepeatParams& repeatParams);  // stride parameters

Applicable Scenarios:

Require fine control over computation process.
Non-contiguous memory access.
Performance optimization.

First-n Elements Computation (Simplified Invocation)

// Automatically handles mask and repeat
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count);  // only element count needed

Applicable Scenarios:

Contiguous memory block computation.
Simplified code.
Rapid development.

Directory Planning

Directory Structure

asc-devkit/
├── include/
│   └── basic_api/                    # Basic API header files
│       ├── kernel_operator_common_intf.h        # Common interface
│       ├── kernel_operator_vec_binary_intf.h    # Vector binary operations
│       ├── kernel_operator_vec_unary_intf.h     # Vector unary operations
│       ├── kernel_operator_data_copy_intf.h     # Data movement
│       ├── kernel_operator_fixpipe_intf.h       # Fixpipe
│       ├── kernel_operator_mm_intf.h            # Matrix multiplication
│       ├── kernel_operator_scalar_intf.h        # Scalar operations
│       ├── kernel_operator_sys_var_intf.h       # System variables
│       ├── kernel_operator_atomic_intf.h        # Atomic operations
│       ├── kernel_tensor.h                      # Tensor definition
│       └── kernel_struct_*.h                    # Parameter structures
│
├── impl/
│   └── basic_api/                    # Basic API implementation
│       ├── dav_m200/                 # NPU ARCH 200x architecture
│       │   ├── kernel_operator_vec_binary_impl.h
│       │   └── ...
│       ├── dav_c220/                 # NPU ARCH 220x architecture
│       │   ├── kernel_operator_vec_binary_impl.h
│       │   └── ...
│       └── CMakeLists.txt
│
├── tests/
│   └── api/
│       └── basic_api/                # Basic API tests
│           ├── tikcpp_case_common/
│           │   └── test_operator_axpy.cpp
│           ├── tikcpp_case_ascend910/
│           │   └── ...
│           └── tikcpp_case_ascend910b1/
│               └── ...
│
└── docs/
    └── api/
        └── context/
            └── ...              # Basic API documentation

File Naming Conventions

File Type	Naming Convention	Example
Interface header	`kernel_operator_<category>_intf.h`	`kernel_operator_vec_binary_intf.h`
Implementation file	`kernel_operator_<category>_impl.h`	`kernel_operator_vec_binary_impl.h`
Test file	`test_operator_<category>.cpp`	`test_operator_vec_binary.cpp`
Documentation file	`<api>.md`	`Add.md`

API Categories

Category	Description	Example APIs
vec_binary	Vector binary operations	Add, Sub, Mul, Div, Max, Min
vec_unary	Vector unary operations	Relu, Exp, Cast, Abs
vec_reduce	Vector reduction	Sum, Max, Mean
data_copy	Data movement	DataCopy, LoadData
fixpipe	Pipeline control	Fixpipe
mm	Matrix multiplication	Mmad, Conv2D
scalar	Scalar operations	ToFloat
atomic	Atomic operations	AtomicAdd, AtomicCAS

Architecture Design

Implementation Layers

Layer 1: Interface Declaration Layer (include/basic_api/)

// include/basic_api/kernel_operator_vec_binary_intf.h
#ifndef ASCENDC_MODULE_OPERATOR_VEC_BINARY_INTERFACE_H
#define ASCENDC_MODULE_OPERATOR_VEC_BINARY_INTERFACE_H

#include "kernel_tensor.h"
#include "kernel_struct_binary.h"

namespace AscendC {

// Add - High-dimensional tiling computation
template <typename T, bool isSetMask = true>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           uint64_t mask[],
                           const uint8_t repeatTime,
                           const BinaryRepeatParams& repeatParams);

// Add - First-n elements computation
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count);

}  // namespace AscendC

#include "impl/basic_api/kernel_operator_vec_binary_intf_impl.h"
#endif

Layer 2: Instruction Implementation Layer (impl/basic_api/)

// impl/basic_api/dav_c220/kernel_operator_vec_binary_impl.h
#ifndef ASCENDC_MODULE_OPERATOR_VEC_BINARY_IMPL_H
#define ASCENDC_MODULE_OPERATOR_VEC_BINARY_IMPL_H

namespace AscendC {

// Add implementation - First-n elements computation
template <typename T>
__aicore__ inline void AddImpl(__ubuf__ T* dst, __ubuf__ T* src0,
                               __ubuf__ T* src1, const int32_t& count) {
    if ASCEND_IS_AIV {
        // 1. Set mask
        set_mask_count();
        set_vector_mask(0, count);

        // 2. Call underlying instruction
        vadd(dst, src0, src1, 1,
             DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE,
             DEFAULT_REPEAT_STRIDE, DEFAULT_REPEAT_STRIDE, DEFAULT_REPEAT_STRIDE);

        // 3. Restore mask
        set_mask_norm();
        set_vector_mask(static_cast<uint64_t>(-1), static_cast<uint64_t>(-1));
    }
}

// Add implementation - High-dimensional tiling computation
template <typename T, bool isSetMask = true>
__aicore__ inline void AddImpl(__ubuf__ T* dst, __ubuf__ T* src0,
                               __ubuf__ T* src1, const uint64_t mask[],
                               const uint8_t repeatTime,
                               const BinaryRepeatParams& repeatParams) {
    if ASCEND_IS_AIV {
        // Set mask (if needed)
        if (isSetMask) {
            AscendCUtils::SetMask<T, isSetMask>(mask[1], mask[0]);
        }

        // Call underlying instruction
        vadd(dst, src0, src1, repeatTime,
             repeatParams.dstBlkStride, repeatParams.src0BlkStride, repeatParams.src1BlkStride,
             repeatParams.dstRepStride, repeatParams.src0RepStride, repeatParams.src1RepStride);
    }
}

}  // namespace AscendC

#endif

Layer 3: Interface Wrapper Layer

// impl/basic_api/kernel_operator_vec_binary_intf_impl.h
namespace AscendC {

// First-n elements computation interface wrapper
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count) {
    AddImpl<T>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), count);
}

// High-dimensional tiling computation interface wrapper
template <typename T, bool isSetMask = true>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           uint64_t mask[],
                           const uint8_t repeatTime,
                           const BinaryRepeatParams& repeatParams) {
    AddImpl<T, isSetMask>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(),
                          mask, repeatTime, repeatParams);
}

}  // namespace AscendC

Architecture Adaptation

Hardware may differ across NPU architectures and requires reimplementation.

Development Example: Implementing Axpy Basic API

API Requirement Analysis

Implement vector multiply-add: dst = src * scalar + dst

Supported data types: half, float
Interface type: First-n elements computation (simplified invocation)
Hardware support: Confirm hardware support

Review Existing API Structure

Basic API uses LocalTensor<T> as parameters. The first-n elements computation interface only requires the count parameter:

// Reference existing Add interface
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count);

Interface Design

Add in include/basic_api/kernel_operator_vec_binary_intf.h:

/* **************************************************************************************************
 * Axpy                                             *
 * ************************************************************************************************* */
/*
 * @ingroup Axpy
 * @brief dst = dst + src * scalar
 * @param [out] dst output LocalTensor
 * @param [in] src input LocalTensor
 * @param [in] scalar scalar value
 * @param [in] count number Number of data involved in calculation
 */
template <typename T, typename U>
__aicore__ inline void Axpy(const LocalTensor<T>& dst,
                           const LocalTensor<U>& src,
                           const U scalar,
                           const int32_t& count);

Implementation Code

Reference other interface implementations.

Interface Wrapper

Add in impl/basic_api/kernel_operator_vec_binary_intf_impl.h:

template <typename T, typename U>
__aicore__ inline void Axpy(const LocalTensor<T>& dst,
                           const LocalTensor<U>& src,
                           const U scalar,
                           const int32_t& count) {
    AxpyImpl<T, U>(dst.GetPtr(), src.GetPtr(), scalar, count);
}

Test Code

Add test code for the corresponding interface.

Test and Verification Requirements

Functional Testing

Verify API computation correctness.

Boundary Testing

TEST_F(TestAxpy, BoundaryTest) {
    // Test boundary values: count=0, 1, 256, 257
    // Test different data type combinations
    // Test special values (NaN, Inf)
}

Data Type Testing

INSTANTIATE_TEST_CASE_P(TEST_AXPY_TYPES, AxpyTestsuite,
    ::testing::Values(
        BinaryTestParams { 256, 2, 2, main_axpy<half, half> },
        BinaryTestParams { 256, 4, 2, main_axpy<float, half> },
        BinaryTestParams { 256, 4, 4, main_axpy<float, float> }
    )
);

Code Standards

Naming Conventions

// Function name: PascalCase, first letter uppercase
void Add(...);
void Relu(...);
void Axpy(...);

// Parameter name: camelCase
LocalTensor<T> dstTensor;
int32_t elementCount;

// Macro definition: UPPERCASE_WITH_UNDERSCORES
#define ASCENDC_ASSERT(cond, msg) ...

// Type name: PascalCase
struct BinaryRepeatParams;
class LocalTensor;

Code Style

// 1. Indentation: 4 spaces
// 2. Braces: K&R style
// 3. Spaces: Spaces around operators
// 4. Comments: Doxygen style

/**
 * @brief Vector addition operation
 * @param dst Destination LocalTensor
 * @param src0 Source LocalTensor 0
 * @param src1 Source LocalTensor 1
 * @param count Element count
 */
template <typename T>
__aicore__ inline void Add(const LocalTensor<T>& dst,
                           const LocalTensor<T>& src0,
                           const LocalTensor<T>& src1,
                           const int32_t& count) {
    // Parameter validation
    ASCENDC_ASSERT(count > 0, "count must be positive");

    // Call implementation
    AddImpl<T>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), count);
}

Error Handling

// 1. Parameter validation (Debug mode)
ASCENDC_ASSERT(count > 0, "count must be greater than 0");
ASCENDC_ASSERT(dst != nullptr, "dst cannot be nullptr");
ASCENDC_ASSERT(src != nullptr, "src cannot be nullptr");

// 2. Type checking
static_assert(SupportType<T, half, float, int16_t, int32_t>(),
              "Unsupported data type");

// 3. Architecture checking
#if !defined(__NPU_ARCH__) || (__NPU_ARCH__ != 2201 && __NPU_ARCH__ != 3510)
    #error "Unsupported NPU architecture"
#endif