msKPP API Reference

APIs

The msKPP tool provides two types of APIs: basic APIs and instruction APIs. Basic APIs simulate the chip platform and basic data for operator computing. Instruction APIs simulate specific operator instruction operations, including Vector and Cube computing instructions.

Table 1 msKPP API list

API Description
Basic APIs -
Chip Chip platform for performance modeling, which initializes the profile data of the chip.
Core Simulated AI Core inside the chip.
Tensor Basic data type for operator execution.
Tensor.load Data transfer API, which models data transfer between different units.
Synchronization Instruction APIs -
set_flag Instruction API for synchronization between pipes in a core, which is used with wait_flag.
wait_flag Instruction API for synchronization between pipes in a core, which is used with set_flag.
Instruction APIs -
mmad API for modeling mmad performance of cube instructions.
vadd API for modeling vadd performance of vector instructions.
vbrcb API for modeling vbrcb performance of vector instructions.
vconv API for modeling vconv performance of vector instructions.
vconv_deq API for modeling vconv_deq performance of vector instructions.
vconv_vdeq API for modeling vconv_vdeq performance of vector instructions.
vector_dup API for modeling vector_dup performance of vector instructions.
vexp API for modeling vexp performance of vector instructions.
vln API for modeling vln performance of vector instructions.
vmax API for modeling vmax performance of vector instructions.
vmul API for modeling vmul performance of vector instructions.
vmuls API for modeling vmuls performance of vector instructions.
vsub API for modeling vsub performance of vector instructions.
vdiv API for modeling vdiv performance of vector instructions.
vcadd API for modeling vcadd performance of vector instructions.
vabs API for modeling vabs performance of vector instructions.
vaddrelu API for modeling vaddrelu performance of vector instructions.
vaddreluconv API for modeling vaddreluconv performance of vector instructions.
vadds API for modeling vadds performance of vector instructions.
vand API for modeling vand performance of vector instructions.
vaxpy API for modeling vaxpy performance of vector instructions.
vbitsort API for modeling vbitsort performance of vector instructions.
vcgadd API for modeling vcgadd performance of vector instructions.
vcgmax API for modeling vcgmax performance of vector instructions.
vcgmin API for modeling vcgmin performance of vector instructions.
vcmax API for modeling vcmax performance of vector instructions.
vcmin API for modeling vcmin performance of vector instructions.
vcmp__xxx_ API for modeling vcmp_xxx performance of vector instructions.
vcmpv__xxx_ API for modeling vcmpv_xxx performance of vector instructions.
vcmpvs__xxx_ API for modeling vcmpvs_xxx performance of vector instructions.
vcopy API for modeling vcopy performance of vector instructions.
vcpadd API for modeling vcpadd performance of vector instructions.
vgather API for modeling vgather performance of vector instructions.
vgatherb API for modeling vgatherb performance of vector instructions.
vlrelu API for modeling vlrelu performance of vector instructions.
vmadd API for modeling vmadd performance of vector instructions.
vmaddrelu API for modeling vmaddrelu performance of vector instructions.
vmaxs API for modeling vmaxs performance of vector instructions.
vmin API for modeling vmin performance of vector instructions.
vmins API for modeling vmins performance of vector instructions.
vmla API for modeling vmla performance of vector instructions.
vmrgsort API for modeling vmrgsort performance of vector instructions.
vmulconv API for modeling vmulconv performance of vector instructions.
vnot API for modeling vnot performance of vector instructions.
vor API for modeling vor performance of vector instructions.
vrec API for modeling vrec performance of vector instructions.
vreduce API for modeling vreduce performance of vector instructions.
vreducev2 API for modeling vreducev2 performance of vector instructions.
vrelu API for modeling vrelu performance of vector instructions.
vrsqrt API for modeling vrsqrt performance of vector instructions.
vsel API for modeling vsel performance of vector instructions.
vshl API for modeling vshl performance of vector instructions.
vshr API for modeling vshr performance of vector instructions.
vsqrt API for modeling vsqrt performance of vector instructions.
vsubrelu API for modeling vsubrelu performance of vector instructions.
vsubreluconv API for modeling vsubreluconv performance of vector instructions.
vtranspose API for modeling vtranspose performance of vector instructions.

Basic APIs

Chip

Function

Processor abstraction, which is instantiated and used in the with statement to explicitly model a type of the Ascend AI Processor.

Prototype

class Chip(name, debug_mode=False)

Parameter Description

Parameter Input Type Description
name String Processor name. Currently, most data is collected from the Atlas A2 training products/Atlas A2 inference products. You can use npu-smi info to view the Ascend AI Processor type of a device.
debug_mode Bool Whether to enable the debug mode. The default value is False. After the debug mode is enabled, you can view the instructions that are not properly executed, but no output is generated.
True: enabled
False: disabled

Member Description

Member Description
chip.enable_trace() Enables the operator simulation pipeline function to generate the pipeline chart file trace.json.
chip.enable_metrics() Enables single instruction and pipeline information, and generates instruction statistics (Instruction_statistic.csv), transfer pipeline statistics (Pipe_statistic.csv), and instruction proportion pie chart (instruction_cycle_consumption.html).
chip.set_cache_hit_ratio(config) Enables manual adjustment of the L2 cache hit ratio. The value of config is {"cache_hit_ratio": 0.6}. For details, see the cache hit ratio modeling setion.
chip.set_prof_summary_path("xxx/PipeUtilization.csv") PipeUtilization.csv is an example of the msProf result, which is used to compare the theoretical values of the pipeline information with values measured by msProf. For details, see "comparison between theoretical values of pipeline information and values measured by msProf".
chip.disable_instr_log() After this function is enabled, the log printing is suppressed after the instruction task is added and scheduled.

Constraints

This class needs to be initialized under the with statement.

Example

from mskpp import Chip
# For details about how to view the Ascend AI Processor type of the current device, see the following description.
with Chip("Ascendxxxyy") as chip:    # Ascendxxxyy needs to be replaced with the actual processor type.
    chip.enable_trace()   # Call this function to enable the operator simulation pipeline function and generate a pipeline chart file.
    chip.enable_metrics()  # Call this function to enable the single instruction and pipeline information, and generate the transfer pipeline statistics, instruction information statistics, and instruction proportion pie chart.

NOTE

For servers other than the Atlas A3 training products/Atlas A3 inference products: Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the chip name. Note that the actual value is represented by AscendChip name. For example, if the chip name is xxxyy, the actual value is Ascendxxxyy. If Ascendxxxyy is the path of the code sample, set this parameter to ascendxxxyy.

Returns

None

Core

Function

AI Core abstraction, which is instantiated and used in the with statement to model an AI Core type.

Prototype

class Core(core_type_name)

Parameter Description

Parameter Input Type Description
core_type_name String Character string of the Ascend compute unit type, which can be expressed as AICx or AIVx, where x is a number that corresponds to the sequence number of the used AI Cube Core/AI Vector Core. Only one or more characters from [A-Za-z0-9] are supported.

Constraints

This class needs to be initialized under the with statement.

Example

from mskpp import Core
with Core("AIC0") as aic:
    # Code related to the operator compute logic on AI Cube Core 0.
    ...

Returns

None

Tensor

Function

Onboard tensor abstraction, in which the memory location, data type, size, and format of tensors can be specified as the data dependency identifiers of instructions.

Prototype

class Tensor(mem_type, dtype=None, size=None, format=None, is_inited=False)

Parameter Description

Parameter Input Type Description
mem_type String Location of the memory space where the abstracted tensor is located, such as GM, UB, L1, L0A, L0B, L0C, FB, and BT.
dtype String Data type, such as BOOL, UINT1, UINT2, UINT8, UINT16, UINT32, BF16, UINT64, INT4, INT8, INT16, INT32, INT64, FP16, and FP32.
size List Shape of a tensor.
format String Data layout format. For details, see "Programming Guide" > "Concepts and Terms" > "Neural Networks and Operators" > "Data Layout Format" in Ascend C Operator Development Guide.
is_inited Bool Switch that indicates whether the tensor class is ready. Once enabled, instructions that utilize the tensors as the input can be initiated.

Member Description

Member Description
tensor.set_valid() Enables the current tensor to be ready. Once enabled, instructions that utilize the tensor as the input can be initiated immediately.
tensor.set_invalid() Disables the current tensor to be ready. Once disabled, instructions that utilize the tensor as the input cannot be initiated immediately.
tensor.is_valid() Obtains the current tensor ready status.

Constraints

You need to create a tensor whose shape is [1] and is_inited=True for scalar creation.

Example

from mskpp import Tensor, Core
gm_tmp= Tensor("GM", "FP16", [48, 16], format="ND")
with Core("AIV0") as aiv:  # Computing logic on AIV0.
    ...
    gm_tmp.load(result, set_value=0)
with Core("AIC0") as aic:
    in_x = Tensor("GM", "FP16", [48, 16], format="ND")
    in_x.load(gm_tmp, expect_value=0) # Computing logic on AIC0.
    ...

Returns

None

Tensor.load

Function

All data transfer instructions in the msKPP tool are abstracted as the load method, and users only need to focus on the reasonable transfer channels in Ascend AI Processors, without considering the complex stride concept in the transfer instructions.

Prototype

Tensor.load(tensor, repeat=1, set_value=-1, expect_value=-1)

Parameter Description

Parameter Input Type Description
tensor Variable Another input tensor, whose function is the same as that defined in the API.
repeat int This parameter simulates the transfer instruction repeat. You can input this parameter to obtain the bandwidth of each transfer channel with varying repeat values. The bandwidth is used to calculate the time consumed by the transfer instruction. This parameter is optional. The default value is 1. You are advised to set it to an integer within the range of [1, 255]. If the input value of repeat does not meet the requirement, the system throws exception "input repeat = xx invalid." where xx is the input abnormal value of repeat.
set_value int Identifier indicating that the tensor data is dependent by others. This parameter can be customized and must be used in conjunction with expect_value. This parameter is optional. If it is not specified, the dependency relationship is not enabled.
expect_value int Identifier indicating that loading of the tensor data depends on other data. This parameter can be customized and must be used in conjunction with set_value. This parameter is optional. If it is not specified, the dependency relationship is not enabled.

Constraints

set_value and expect_value must be used in pairs. Otherwise, pipeline blocking may occur.

The repeat parameter supports only the following four transfer channels: L1_TO_L0A, L1_TO_L0B, GM_TO_L0A, and GM_TO_L0B.

Returns

None

Synchronization Instruction APIs

set_flag

Function

Ensures the synchronization of different instructions between pipelines in a core. After pipe_src is scheduled, pipe_dst is unblocked. After set_flag and wait_flag are set, the Instruction Pipeline Chart (Using MindStudio Insight as an Example) will better meet the user's expectations.

Prototype

set_flag(pipe_src, pipe_dst, event_id)

Parameter Description

Parameter Input/Output Description
pipe_src Input Source pipeline. After pipe_src is scheduled, set event_id.
The input format is aicore_PIPE, for example, aic0_PIPE-MTE1. For details about the value range of aicore, see the basic function API Core. The value range of PIPE is PIPE-MTE1, PIPE-MTE2, PIPE-MTE3, PIPE-FIX, PIPE-M, PIPE-V, and PIPE-S. If aicore is not specified, you can directly enter the value of PIPE.
Data type: string.
This parameter is required.
pipe_dst Input Destination pipeline. After pipe_src is scheduled, pipe_dst is unblocked.
The input format is aicore_PIPE, for example, aic0_PIPE-MTE1. For details about the value range of aicore, see the basic function API Core. The value range of PIPE is PIPE-MTE1, PIPE-MTE2, PIPE-MTE3, PIPE-FIX, PIPE-M, PIPE-V, and PIPE-S. If aicore is not specified, you can directly enter the value of PIPE.
Data type: string.
This parameter is required.
event_id Input Unique value of the dependency across synchronization instructions.
Value range: [0, 65535]
Data type: int.
This parameter is required.

Constraints

  • The number of set_flag instructions must match the number of wait_flag instructions in the same core.
  • Duplicate set_flag instructions should not exist in the same core.
  • In the same core, if the values of pipe_src and pipe_dst in set_flag and wait_flag are the same, event_id must be unique.

Example

from mskpp import Tensor, Chip, set_flag, wait_flag
with Chip("Ascendxxyy") as chip:
    gm_weight = Tensor("GM", "FP16", [128, 256], format="ND")
    l1_weight = Tensor("L1", "FP16", [128, 256], format="ND")
    for conv_idx in range(4):  # Before data is loaded to L0A, the GM is loaded to L1 in batches.
        gm_weight_part = gm_weight[:, 64]
        l1_weight_part = l1_weight[:, 64]
        l1_weight_part.load(gm_weight_part)
        if conv_idx == 3:
            set_flag("PIPE-MTE2", "PIPE-MTE1", 1)  # MTE1 can be executed only after MTE2 execution is complete.
    x = Tensor("L0A")   # L0A
    # MTE2 is being executed. MTE1 can be executed only after MTE2 execution is complete.
    l1_weight.set_valid()  # Manually enable L1.
    wait_flag("PIPE-MTE2", "PIPE-MTE1", 1)
    x.load(l1_weight)

Returns

None

wait_flag

Function

Ensures the synchronization of different instructions across pipelines in a core. pipe_dst is unblocked after pipe_src is scheduled.

Prototype

wait_flag(pipe_src, pipe_dst, event_id)

Parameter Description

Parameter Input/Output Description
pipe_src Input Source pipeline. After pipe_src is scheduled, set event_id.
The input format is aicore_PIPE, for example, aic0_PIPE-MTE1. For details about the value range of aicore, see the basic function API Core. The value range of PIPE is PIPE-MTE1, PIPE-MTE2, PIPE-MTE3, PIPE-FIX, PIPE-M, PIPE-V, and PIPE-S. If aicore is not specified, you can directly enter the value of PIPE.
Data type: string.
This parameter is required.
pipe_dst Input Destination pipeline. After pipe_src is scheduled, pipe_dst is unblocked.
The input format is aicore_PIPE, for example, aic0_PIPE-MTE1. For details about the value range of aicore, see the basic function API Core. The value range of PIPE is PIPE-MTE1, PIPE-MTE2, PIPE-MTE3, PIPE-FIX, PIPE-M, PIPE-V, and PIPE-S. If aicore is not specified, you can directly enter the value of PIPE.
Data type: string.
This parameter is required.
event_id Input Unique value of the dependency across synchronization instructions.
Value range: [0, 65535]
Data type: int.
This parameter is required.

Constraints

  • The number of set_flag instructions must match the number of wait_flag instructions in the same core.
  • Duplicate set_flag instructions should not exist in the same core.
  • In the same core, if the values of pipe_src and pipe_dst in set_flag and wait_flag are the same, event_id must be unique.

Example

from mskpp import Tensor, Chip, set_flag, wait_flag
with Chip("Ascendxxyy") as chip:
    gm_weight = Tensor("GM", "FP16", [128, 256], format="ND")
    l1_weight = Tensor("L1", "FP16", [128, 256], format="ND")
    for conv_idx in range(4):  # Before data is loaded to L0A, the GM is loaded to L1 in batches.
        gm_weight_part = gm_weight[:, 64]
        l1_weight_part = l1_weight[:, 64]
        l1_weight_part.load(gm_weight_part)
        if conv_idx == 3:
            set_flag("PIPE-MTE2", "PIPE-MTE1", 1)  # MTE1 can be executed only after MTE2 is executed.
    x = Tensor("L0A")   # L0A
    # MTE2 is being executed. MTE1 can be executed only after MTE2 execution is complete.
    l1_weight.set_valid()  # Manually enable L1.
    wait_flag("PIPE-MTE2", "PIPE-MTE1", 1)
    x.load(l1_weight)

Returns

None

Instruction APIs

mmad

Function

Performs matrix multiplication and addition.

Prototype

class mmad(x, y, b, is_inited=False)

Parameter Description

Parameter Data Type Description
x Tensor variable Left matrix in the L0A space. FP16 is supported.
y Tensor variable Right matrix in the L0B space. FP16 is supported.
b Tensor variable Bias, which can be in the L0C space or bias table space. FP32 is supported
is_inited Bool When the input is in the L0C space, is_inited=True needs to be added because there is no direct channel to transfer data from the GM to the L0C.

Constraints

When the bias term is in the bias table space, the tensor data format must be ND and shape must be [n, ].

Example

from mskpp import mmad, Tensor
in_x = Tensor("GM", "FP16", [32, 48], format="ND")
in_y = Tensor("GM", "FP16", [48, 16], format="ND")
in_z = Tensor("GM", "FP32", [32, 16], format="NC1HWC0")
out_z = mmad(in_x, in_y, in_z)()

Returns

None

vadd

Function

vadd instruction abstraction.

z = x + y, where x and y are added by element.

Prototype

class vadd(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, INT16, and INT32 are supported.
z Output Tensor variable Output vector tensor.

Constraints

The tensors of all input and output data of vector instructions are in the UB space, and their shapes must be the same.

Example

from mskpp import vadd, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vadd(ub_x, ub_y, ub_z)()

Returns

None

vbrcb

Function

vbrcb instruction abstraction.

Expands the dimensions of the tensors based on the instruction stride. However, the msKPP instruction system does not support stride. Therefore, you need to specify the dimension expansion factor and ensure that the shapes of the input and output tensors are the same.

Prototype

class vbrcb(x, y, broadcast_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16 and UINT32 are supported.
y Output Tensor variable Output y vector tensor. UINT16 and UINT32 are supported.
broadcast_num Input Int Number of times that the last dimension is expanded. Empirical profile data shows that different expansion factors have little impact on performance. Therefore, the standard expansion factor of 16, denoted as dstBlockStride=1 and dstRepeatStride=8 in the instruction, is commonly employed.

Example

from mskpp import vbrcb, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
broadcast_num = 16
ub_x.load(gm_x)
out = vbrcb(ub_x, ub_y, broadcast_num)()

Returns

None

vconv

Function

vconv instruction abstraction.

y = vconv(x, dtype), where vconv indicates the vector calculation for type conversion of input data.

Currently, the following type conversion is supported: BF16->FP32, FP16->FP32, FP16->INT16, FP16->INT32, FP16->INT4, FP16->INT8, FP16->UINT8, FP32->BF16, FP32->FP16, FP32->INT32, FP32->INT64, INT4->FP16, INT64->FP32, INT8->FP16, and UINT8->FP16.

Prototype

class vconv(x, y, dtype)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor.
y Output Tensor variable Output y vector tensor.
dtype Input String Data type of the target tensor.

Example

from mskpp import vconv, Tensor
ub_x, ub_y = Tensor("UB", "FP16"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vconv(ub_x, ub_y, "FP32")()

Returns

None

vconv_deq

Function

vconv_deq instruction abstraction.

y = vconv_deq(x, dtype), where vconv_deq indicates vector calculation for quantization on input data.

Currently, conversions from FP16 to INT8 and from INT32 to FP16 are supported.

Prototype

class vconv_deq(x, y, dtype)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor.
y Output Tensor variable Output y vector tensor.
dtype Input String Data type of the target tensor.

Example

from mskpp import vconv_deq, Tensor
ub_x, ub_y = Tensor("UB", "FP16"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vconv_deq(ub_x, ub_y, "FP32")()

Returns

None

vconv_vdeq

Function

vconv_vdeq instruction abstraction.

y = vconv_vdeq(x, dtype), where vconv_vdeq indicates vector calculation for quantization on input data.

Currently, the conversion from INT16 to INT8 is supported.

Prototype

class vconv_vdeq(x, y, dtype)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor.
y Output Tensor variable Output y vector tensor.
dtype Input String Data type of the target tensor.

Example

from mskpp import vconv_vdeq, Tensor
ub_x, ub_y = Tensor("UB", "FP16"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vconv_vdeq(ub_x, ub_y, "FP32")()

Returns

None

vector_dup

Function

vector_dup instruction abstraction.

y = vector_dup(x), where x and y are filled in by element.

Prototype

class vector_dup(x, y, fill_shape)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, INT16, INT32, UINT16, and UINT32 are supported.
y Output Tensor variable Output y vector tensor. FP16, FP32, INT16, INT32, UINT16, and UINT32 are supported.
fill_shape Input List Shape value to be expanded of the target tensor.

Constraints

Since the input to this instruction is only a scalar, you need to create a tensor whose shape is [1] and is_inited=True as the simulated scalar input, without increasing performance overhead.

Example

from mskpp import vector_dup, Tensor
ub_x = Tensor("UB", "FP16", [1], format="ND", is_inited=True)
ub_y = Tensor("UB")
out = vector_dup(ub_x, ub_y, [8, 2048])()

Returns

None

vexp

Function

vexp instruction abstraction.

y = vexp(x), where x and y take exponents by element.

Prototype

class vexp(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vexp, Tensor
ub_x = Tensor("UB")
ub_x.load(gm_x)
ub_y = Tensor("UB")
out = vexp(ub_x, ub_y)()

Returns

None

vln

Function

vln instruction abstraction.

y = vln(x), where x and y take logarithms by element.

Prototype

class vln(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vln, Tensor
ub_x = Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
ub_y = Tensor("UB")
out = vln(ub_x, ub_y)()

Returns

None

vmax

Function

vmax instruction abstraction.

z = vmax(x, y), where x and y take the maximum value by element.

Prototype

class vmax(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, INT16, and INT32 are supported.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.

Example

from mskpp import vmax, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmax(ub_x, ub_y, ub_z)()

Returns

None

vmul

Function

vmul instruction abstraction.

z = x * y, where x and y are multiplied by element.

Prototype

class vmul(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, INT16, and INT32 are supported.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.

Example

from mskpp import vmul, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmul(ub_x, ub_y, ub_z)()

Returns

None

vmuls

Function

vmuls instruction abstraction.

z = vmuls(x, y), where vmuls evaluates the product of vector x and scalar y.

Prototype

class vmuls(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Python scalar Input scalar. The program does not process this parameter.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.

Example

from mskpp import vmuls, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vmuls(ub_x, 5, ub_z)()  // 5 is the value of scalar y.

Returns

None

vsub

Function

vsub instruction abstraction.

z = x - y, where x is subtracted by y by element.

Prototype

class vsub(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, INT16, and INT32 are supported.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.

Example

from mskpp import vsub, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vsub(ub_x, ub_y, ub_z)()

Returns

None

vdiv

Function

vdiv instruction abstraction.

z = x/y, where x is divided by y by element.

Prototype

class vdiv(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Input Tensor variable Input y vector tensor. FP16 and FP32 are supported.
z Output Tensor variable Output vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vdiv, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vdiv(ub_x, ub_y, ub_z)()

Returns

None

vcadd

Function

vcadd instruction abstraction.

Reduces the tensor dimensions based on the input parameters of the instruction. In the msKPP instruction system, reduce_num controls the shape reduction multiple and ensures that the shapes of the input and output tensors are the same. When the last dimension of the shape is reduced to 1, the dimension is eliminated. Ensure that the last dimension of the shape can be exactly divided by reduce_num and is not 0.

Prototype

class vcadd(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
reduce_num Input Int Number of times that the last dimension is reduced. The value of this parameter does not affect the instruction performance.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vcadd, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
reduce_num = 16
ub_x.load(gm_x)
out = vcadd(ub_x, ub_y, reduce_num)()

Constraints

The value of reduce_num cannot be 0.

Returns

None

vabs

Function

vabs instruction abstraction.

y = vabs(x), where x and y take the absolute value by element.

Prototype

class vabs(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vabs, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vabs(ub_x, ub_y)()

Returns

None

vaddrelu

Function

vaddrelu instruction abstraction.

z = vaddrelu(x, y), where x and y are added by element before the relu value is calculated.

Prototype

class vaddrelu(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, and INT16 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, and INT16 are supported.
z Output Tensor variable Output vector tensor. FP16, FP32, and INT16 are supported.

Example

from mskpp import vaddrelu, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vaddrelu(ub_x, ub_y, ub_z)()

Returns

None

vaddreluconv

Function

vaddreluconv instruction abstraction.

z = vaddreluconv(x, y), where x and y are added by element, the relu value is calculated, and the output is quantized.

The following conversion types are supported: FP16->INT8, FP32->FP16, and INT16->INT8.

Prototype

class vaddreluconv(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, and INT16 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, and INT16 are supported.
z Output Tensor variable Output vector tensor. FP16 and INT8 are supported.

Example

from mskpp import vaddreluconv, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vaddreluconv(ub_x, ub_y, ub_z)()

Returns

None

vadds

Function

vadds instruction abstraction.

z = vadds(x, y), where vadds evaluates the sum of vector x and scalar y.

Prototype

class vadds(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input scalar. The program does not process this parameter.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.

Example

from mskpp import vadds, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vadds(ub_x, 5, ub_z)() // 5 is the value of scalar y.

Returns

None

vand

Function

vand instruction abstraction.

vand(x, y, z), where z can be obtained when x and y perform AND operation by element.

Prototype

class vand(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. INT16 and UINT16 are supported.
y Input Tensor variable Input y vector tensor. INT16 and UINT16 are supported.
z Output Tensor variable Output vector tensor. INT16 and UINT16 are supported.

Example

from mskpp import vand, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vand(ub_x, ub_y, ub_z)()

Returns

None

vaxpy

Function

vaxpy instruction abstraction.

z = x * y + z. vaxpy calculates the product of vector x and scalar y, and adds the target address z. The output data type can be specified as FP32 by using if_mix.

Prototype

vaxpy(x, y, z, if_mix=False)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input scalar. The program does not process this parameter.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.
if_mix Input Tensor variable The default value is False. If this parameter is set to True, the output data type is FP32.

Example

from mskpp import vaxpy, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vaxpy(ub_x, ub_y, ub_z)()

Returns

None

vbitsort

Function

vbitsort instruction abstraction.

Sorts data based on the x input and provides the original index data of the elements after sorting. Therefore, the shape of the output vector tensor is twice that of the x data.

Prototype

class vbitsort(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input vector tensor. FP16 and FP32 are supported.
y Input Tensor variable Input vector tensor. UINT32 is supported.
z Output Tensor variable Output vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vbitsort, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM") 
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vbitsort(ub_x, ub_y, ub_z)()

Returns

None

vcgadd

Function

vcgadd instruction abstraction.

Calculates the sum of elements in each block. There are eight blocks in total. Mixed addresses are not supported.

Prototype

class vcgadd(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.
reduce_num Input Int Reduction factor for the shape.

Constraints

The value of reduce_num cannot be 0.

Example

from mskpp import vcgadd, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
reduce_num = 16
ub_x.load(gm_x)
out = vcgadd(ub_x, ub_y, reduce_num)()

Returns

None

vcgmax

Function

vcgmax instruction abstraction.

Calculates the maximum element of each block. There are eight blocks in total. Mixed addresses are not supported.

Prototype

class vcgmax(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.
reduce_num Input Int Number of times that the last dimension is reduced. The value of this parameter does not affect the instruction performance.

Constraints

The value of reduce_num cannot be 0.

Example

from mskpp import vcgmax, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
reduce_num = 16
ub_x.load(gm_x)
out = vcgmax(ub_x, ub_y, reduce_num)()

Returns

None

vcgmin

Function

vcgmin instruction abstraction.

Calculates the minimum element of each block. There are eight blocks in total. Mixed addresses are not supported.

Prototype

class vcgmin(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 is supported.
y Output Tensor variable Output y vector tensor. FP16 is supported.
reduce_num Input Int Reduction factor for the last dimension. Empirical profile data shows that the reduction has no impact on performance.

Constraints

The value of reduce_num cannot be 0.

Example

from mskpp import vcgmin, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
reduce_num = 16
ub_x.load(gm_x)
out = vcgmin(ub_x, ub_y, reduce_num)()

Returns

None

vcmax

Function

vcmax instruction abstraction.

Calculates the maximum element value in the input vector.

Prototype

class vcmax(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.
reduce_num Input Int Reduction factor for the last dimension. Empirical profile data shows that the reduction has no impact on performance.

Constraints

The value of reduce_num cannot be 0.

Example

from mskpp import vcmax, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
reduce_num = 16
ub_x.load(gm_x)
out = vcmax(ub_x, ub_y, reduce_num)()

Returns

None

vcmin

Function

vcmin instruction abstraction.

Calculates the minimum element value in the input vector.

Prototype

class vcmin(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.
reduce_num Input Int Reduction factor for the last dimension. Empirical profile data shows that the reduction has no impact on performance.

Constraints

The value of reduce_num cannot be 0.

Example

from mskpp import vcmin, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
reduce_num = 16
ub_x.load(gm_x)
out = vcmin(ub_x, ub_y, reduce_num)()

Returns

None

vcmp_xxx

Function

vcmp_[eq|ge|gt|le|lt|ne] instruction abstraction. The following six instructions have the same performance.

vcmp_eq: z = (x == y), where z can be obtained when x is equal to y by element-wise comparison.

vcmp_ge: z = (x >= y), where z can be obtained when x is greater than or equal to y by element-wise comparison.

vcmp_gt: z = (x > y), where z can be obtained when x is greater than y by element-wise comparison.

vcmp_le: z = (x <= y), where z can be obtained when x is less than or equal to y by element-wise comparison.

vcmp_lt: z = (x < y), where z can be obtained when x is smaller than y by element-wise comparison.

vcmp_ne: z = (x != y), where z can be obtained when x is not equal to y by element-wise comparison.

Prototype

class vcmp(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.

Constraints

The tensors of all input and output data of vector instructions are in the UB space, and their shapes must be the same.

Example

from mskpp import vcmp, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vcmp(ub_x, ub_y)()

Returns

None

vcmpv_xxx

Function

vcmpv_[eq|ge|gt|le|lt|ne] instruction abstraction. The following six instructions have the same performance.

vcmpv_eq: z = (x == y), where z can be obtained when x is equal to y by element-wise comparison.

vcmpv_ge: z = (x >= y), where z can be obtained when x is greater than or equal to y by element-wise comparison.

vcmpv_gt: z = (x > y), where z can be obtained when x is greater than y by element-wise comparison.

vcmpv_le: z = (x <= y), where z can be obtained when x is less than or equal to y by element-wise comparison.

vcmpv_lt: z = (x < y), where z can be obtained when x is smaller than y by element-wise comparison.

vcmpv_ne: z = (x != y), where z can be obtained when x is not equal to y by element-wise comparison.

Prototype

class vcmpv(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Input Tensor variable Input y vector tensor. FP16 and FP32 are supported.
z Output Tensor variable Output vector tensor.

Constraints

The tensors of all input and output data of vector instructions are in the UB space, and their shapes must be the same.

Example

from mskpp import vcmpv, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vcmpv(ub_x, ub_y, ub_z)()

Returns

None

vcmpvs_xxx

Function

vcmpvs_[eq|ge|gt|le|lt|ne] instruction abstraction. The following six instructions have the same performance.

vcmpvs_eq: z = (x == y), where z can be obtained when x is equal to scalar stored in y by element-wise comparison.

vcmpvs_ge: z = (x >= y), where z can be obtained when x is greater than or equal to scalar stored in y by element-wise comparison.

vcmpvs_gt: z = (x > y), where z can be obtained when x is greater than scalar stored in y by element-wise comparison.

vcmpvs_le: z = (x <= y), where z can be obtained when x is less than or equal to scalar stored in y by element-wise comparison.

vcmpvs_lt: z = (x < y), where z can be obtained when x is smaller than scalar stored in y by element-wise comparison.

vcmpvs_ne: z = (x != y), where z can be obtained when x is not equal to scalar stored in y by element-wise comparison.

Prototype

class vcmpvs(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Input Tensor variable Input y vector tensor. FP16 and FP32 are supported.
z Output Tensor variable Output vector tensor.

Constraints

The tensors of all input and output data of vector instructions are in the UB space, and their shapes must be the same.

Example

from mskpp import vcmpvs, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vcmpvs(ub_x, ub_y, ub_z)()

Returns

None

vcopy

Function

vcopy instruction abstraction.

Copies tensors at the source address to the destination address.

Prototype

class vcopy(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input vector tensor. int16, int32, uint16, and uint32 are supported.
y Output Tensor variable Output vector tensor. int16, int32, uint16, and uint32 are supported.

Example

from mskpp import vcopy, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vcopy(ub_x, ub_y)()

Returns

None

vcpadd

Function

vcpadd instruction abstraction.

Calculates the sum of n and n+1 of the input x vector, and writes the result back to y. n is an even index. reduce_num controls the output type.

Prototype

class vcpadd(x, y, reduce_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. fp16 and fp32 are supported.
y Output Tensor variable Output y vector tensor. fp16 and fp32 are supported.
reduce_num Input Int Reduction factor for the shape.

Example

from mskpp import vcpadd, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vcpadd(ub_x, ub_y, reduce_num)()

Returns

None

vgather

Function

Gathers given input tensors by element to the result tensor based on the offset address tensor provided.

Prototype

class vgather(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16 and UINT32 are supported.
y Output Tensor variable Output y vector tensor. UINT16 and UINT32 are supported.

Example

from mskpp import vgather, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vgather(ub_x, ub_y)()

Returns

None

vgatherb

Function

Gathers a given input tensor to the result tensor based on the offset address tensor provided.

Prototype

class vgatherb(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16 and UINT32 are supported.
y Output Tensor variable Output y vector tensor. UINT16 and UINT32 are supported.

Example

from mskpp import vgatherb, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vgatherb(ub_x, ub_y)()

Returns

None

vlrelu

Function

vlrelu instruction abstraction.

If x is greater than or equal to 0, z = x. If x is less than 0, z=x*y, where x is multiplied by scalar y by element.

Prototype

class vlrelu(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16 and float32 are supported.
y Input Tensor variable Input y scalar. float16 and float32 are supported.
z Output Tensor variable Output vector tensor. float16 and float32 are supported.

Example

from mskpp import vlrelu, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
scalar_y = 5  // 5 is the value of scalar y.
ub_x.load(gm_x)
out = vlrelu(ub_x, scalar_y, ub_z)()

Returns

None

vmadd

Function

vmadd instruction abstraction.

z = x × z + y. Performs multiplication and addition on each element of the two vectors.

Prototype

class vmadd(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16 and float32 are supported.
y Input Tensor variable Input y vector tensor. float16 and float32 are supported.
z Output Tensor variable Output vector tensor. float16 and float32 are supported.

Example

from mskpp import vmadd, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmadd(ub_x, ub_y, ub_z)()

Returns

None

vmaddrelu

Function

vmaddrelu instruction abstraction.

z = RELU(x * z + y): Performs multiplication and addition on each element of the two vectors, and then performs an MADDRELU operation on each element in the result.

Prototype

class vmaddrelu(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16 and float32 are supported.
y Input Tensor variable Input y vector tensor. float16 and float32 are supported.
z Output Tensor variable Output vector tensor. float16 and float32 are supported.

Example

from mskpp import vmaddrelu, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmaddrelu(ub_x, ub_y, ub_z)()

Returns

None

vmaxs

Function

vmaxs instruction abstraction.

Compares each element in the vector with a scalar and returns the larger one.

Prototype

class vmaxs(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16, float32, int16, and int32 are supported.
y Input Tensor variable Input scalar. The program does not process this parameter.
z Output Tensor variable Output vector tensor. float16, float32, int16, and int32 are supported.

Example

from mskpp import vmaxs, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vmaxs(ub_x, 5, ub_z)()

Returns

None

vmin

Function

vmin instruction abstraction.

Compares each element in two vectors with a scalar and returns the smaller one.

Prototype

class vmin(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16, float32, int16, and int32 are supported.
y Input Tensor variable Input y vector tensor. float16, float32, int16, and int32 are supported.
z Output Tensor variable Output vector tensor. float16, float32, int16, and int32 are supported.

Example

from mskpp import vmin, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmin(ub_x, ub_y, ub_z)()

Returns

None

vmins

Function

vmins instruction abstraction.

Compares each element in the vector with a scalar and returns the smaller one.

Prototype

class vmins(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16, float32, int16, and int32 are supported.
y Input Tensor variable Input scalar. The program does not process this parameter.
z Output Tensor variable Output vector tensor. float16, float32, int16, and int32 are supported.

Example

from mskpp import vmins, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vmins(ub_x, 5, ub_z)()  // 5 is a scalar value of y.

Returns

None

vmla

Function

vmla instruction abstraction.

z = x * y + z, where x and y are multiplied by element, and the multiplication result is added to z by element. The output data type can be specified as FP32 by using if_mix.

The value can be:

type = f16: f16 = f16 × f16 + f16

type = f32: f32 = f32 × f32 + f32

if_mix = True: f32 = f16 × f16 + f32. The x and y vectors use 64-element f16 data for calculation. The source vector uses only the lower four blocks, and the upper four blocks are ignored. Xd is 64-element f32 data with eight blocks, and is used as both the target vector and the third source vector.

Prototype

class vmla(x, y, z, if_mix=False)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Input Tensor variable Input y vector tensor. FP16 and FP32 are supported.
z Output Tensor variable Output vector tensor. FP16 and FP32 are supported.
if_mix Input Tensor variable The default value is False. If this parameter is set to True, the output data type is FP32.

Constraints

The tensors of input and output data of vector instructions are in the UB space.

Example

from mskpp import vmla, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmla(ub_x, ub_y, ub_z)()

Returns

None

vmrgsort

Function

Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields.

Prototype

class vmrgsort(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Input Tensor variable Input y vector tensor. UINT64 is supported.
z Output Tensor variable Output vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vmrgsort, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmrgsort(ub_x, ub_y, ub_z)()

Returns

None

vmulconv

Function

vmulconv instruction abstraction.

z = vmulconv(x, y), where x and y are multiplied by element, and the output is quantized.

Prototype

class vmulconv(x, y, z, dtype)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 is supported.
y Input Tensor variable Input y vector tensor. FP16 is supported.
z Output Tensor variable Output vector tensor.
dtype Input Tensor variable Input data type, including UINT8 and INT8. The output data type of z is determined by dtype.

Example

from mskpp import vmulconv, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vmulconv(ub_x, ub_y, ub_z, 'UINT8')()

Returns

None

vnot

Function

vnot instruction abstraction.

Performs bitwise NOT on input vectors. Each vector has 8 × 256 bits.

Prototype

class vnot(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. INT16 and UINT16 are supported.
y Output Tensor variable Output y vector tensor. INT16 and UINT16 are supported.

Example

from mskpp import vnot, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vnot(ub_x, ub_y)()

Constraints

This instruction supports only the common mask mode and counter mode.

Returns

None

vor

Function

vor instruction abstraction.

Performs bitwise OR on input vectors. Each vector has 8 × 256 bits.

Prototype

class vor(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. INT16 and UINT16 are supported.
y Input Tensor variable Input y vector tensor. INT16 and UINT16 are supported.
z Output Tensor variable Output z vector tensor. INT16 and UINT16 are supported.

Example

from mskpp import vor, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x,gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vor(ub_x, ub_y, ub_z)()

Constraints

This instruction supports only the common mask mode and counter mode.

Returns

None

vrec

Function

vrec instruction abstraction.

Performs floating-point reciprocal estimation and finds an approximate reciprocal for each vector.

Prototype

class vrec(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16 and FP32 are supported.
y Output Tensor variable Output y vector tensor. FP16 and FP32 are supported.

Example

from mskpp import vrec, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out=vrec(ub_x, ub_y)()

Returns

None

vreduce

Function

vreduce instruction abstraction.

Determines which elements of the x vector are to be stored in the z vector based on the mask data of the input y vector. Because the tensor in msKPP lacks actual elements, the reserve_num parameter is added to determine the shape of the z output.

Prototype

class vreduce(x, y, z, reserve_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16 and UINT32 are supported.
y Input Tensor variable Input y vector tensor. UINT16 and UINT32 are supported.
z Output Tensor variable Output z vector tensor. UINT16 and UINT32 are supported.
reserve_num Input Int Number of output elements.

Example

from mskpp import vreduce, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y, gm_z = Tensor("GM"), Tensor("GM"), Tensor("GM")
reserve_num = 16
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vreduce(ub_x, ub_y, ub_z, reserve_num)()
gm_z.load(out[0])

Returns

None

vreducev2

Function

vreducev2 instruction abstraction.

Determines which block-level elements of the x vector are to be stored in the z vector based on the mask data of the input y vector. Because the tensor in msKPP lacks related concepts, the reserve_num parameter is added to determine the shape of the z output.

Prototype

class vreducev2(x, y, z, reserve_num)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16 and UINT32 are supported.
y Input Tensor variable Input y vector tensor. UINT16 and UINT32 are supported.
z Output Tensor variable Output z vector tensor. UINT16 and UINT32 are supported.
reserve_num Input Int Number of output elements.

Example

from mskpp import vreducev2, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y, gm_z = Tensor("GM"), Tensor("GM"), Tensor("GM")
reserve_num = 16
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vreducev2(ub_x, ub_y, ub_z, reserve_num)()
gm_z.load(out[0])

Returns

None

vrelu

Function

vrelu instruction abstraction.

Performs the relu operation on each element, which takes 0 if the element is less than 0, and takes the element itself if it is greater than or equal to 0.

Prototype

class vrelu(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16, float32, and int32 are supported.
y Output Tensor variable Output vector tensor. float16, float32, and int32 are supported.

Example

from mskpp import vrelu, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vrelu(ub_x, ub_y)()

Returns

None

vrsqrt

Function

vrsqrt instruction abstraction.

Calculates the reciprocal square root of a floating point number.

Prototype

class vrsqrt(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16 and float32 are supported.
y Output Tensor variable Output vector tensor. float16 and float32 are supported.

Example

from mskpp import vrsqrt, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vrsqrt(ub_x, ub_y)()

Returns

None

vsel

Function

vsel instruction abstraction.

This function is usually used in conjunction with vcmp, which selects an element in the corresponding positions of x and y based on the obtained cmp_mask.

Prototype

class vsel(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, INT16, and INT32 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, INT16, and INT32 are supported.
z Output Tensor variable Output vector tensor. FP16, FP32, INT16, and INT32 are supported.

Example

from mskpp import vsel, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vsel(ub_x, ub_y, ub_z)()

Returns

None

vshl

Function

vshl instruction abstraction.

Performs logical left shift or arithmetic left shift based on the input type.

Prototype

class vshl(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16, UINT32, INT16, and INT32 are supported.
y Output Tensor variable Output y vector tensor. UINT16, UINT32, INT16, and INT32 are supported.

Example

from mskpp import vshl, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vshl(ub_x, ub_z)()

Returns

None

vshr

Function

vshr instruction abstraction.

Performs logical right shift or arithmetic right shift based on the input type.

Prototype

class vshr(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. UINT16, UINT32, INT16, and INT32 are supported.
y Output Tensor variable Output y vector tensor. UINT16, UINT32, INT16, and INT32 are supported.

Example

from mskpp import vshr, Tensor
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vshr(ub_x, ub_y)()

Returns

None

vsqrt

Function

vsqrt instruction abstraction.

y = √x, which takes the square root of x by element.

Prototype

class vsqrt(x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16 and float32 are supported.
y Output Tensor variable Output y vector tensor. float16 and float32 are supported.

Example

from mskpp import vsqrt, Tensor
ub_x, ub_z = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vsqrt(ub_x, ub_y)()

Constraints

The input value must be a positive number. Otherwise, the result is unknown and an exception occurs.

Returns

None

vsubrelu

Function

vsubrelu instruction abstraction.

z = vsubrelu(x, y), where the relu value is calculated after x is subtracted by y by element.

Prototype

class vsubrelu (x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. float16 and float32 are supported.
y Input Tensor variable Input y vector tensor. float16 and float32 are supported.
z Output Tensor variable Output vector tensor. float16 and float32 are supported.

Example

from mskpp import vsubrelu, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vsubrelu(ub_x, ub_y, ub_z)()

Returns

None

vsubreluconv

Function

vsubreluconv instruction abstraction.

z = vsubreluconv(x, y), where x and y are subtracted by element, the relu value is calculated, and the output is quantized.

The following conversion types are supported: FP16->INT8, FP32->FP16, and INT16->INT8.

Prototype

class vsubreluconv(x, y, z)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. FP16, FP32, and INT16 are supported.
y Input Tensor variable Input y vector tensor. FP16, FP32, and INT16 are supported.
z Output Tensor variable Output vector tensor. FP16 and INT8 are supported.

Example

from mskpp import vsubreluconv, Tensor
ub_x, ub_y, ub_z = Tensor("UB"), Tensor("UB"), Tensor("UB")
gm_x, gm_y = Tensor("GM"), Tensor("GM")
ub_x.load(gm_x)
ub_y.load(gm_y)
out = vsubreluconv(ub_x, ub_y, ub_z)()

Returns

None

vtranspose

Function

vtranspose instruction abstraction.

Transposes a 16 × 16 matrix starting from the input address x (32-byte aligned). Each element has 16 bits. The result is output to y. The input and output are continuous 512-byte storage spaces.

Prototype

class vtranspose (x, y)

Parameter Description

Parameter Input/Output Data Type Description
x Input Tensor variable Input x vector tensor. INT16 is supported.
y Output Tensor variable Output vector tensor. INT16 is supported.

Example

from mskpp import vtranspose, Tensor
ub_x, ub_y = Tensor("UB"), Tensor("UB")
gm_x = Tensor("GM")
ub_x.load(gm_x)
out = vtranspose(ub_x, ub_y)()

Returns

None