TADDReluConv
Tile Operation Diagram
Introduction
Fused elementwise add, ReLU clamp, and type conversion. Per element: dst = convert(max(0, src0 + src1)).
At the ISA level, this is a single fused instruction (TADDRELUCONV): add + ReLU clamp + down-cast in one semantic step. Backend realization is architecture-dependent (for example, A2A3 can use a single fused intrinsic, while A5 lowers to an equivalent VF sequence), but user-visible semantics are identical.
Math Interpretation
For each element (i, j) in the valid region:
dsti,j=convert (max (0, src0i,j+src1i,j))\mathrm{dst}_{i,j} = \mathrm{convert}\!\left(\max\!\left(0,\;\mathrm{src0}_{i,j} + \mathrm{src1}_{i,j}\right)\right)
where convert narrows the result from the source type to the destination type with saturating behavior. For floating-point down-casts, rounding follows round-to-nearest-even.
Assembly Syntax
PTO-AS form: see PTO-AS Specification.
Synchronous form:
%dst = taddreluconv %src0, %src1 : !pto.tile<...>
AS Level 1 (SSA)
%dst = pto.taddreluconv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
AS Level 2 (DPS)
pto.taddreluconv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic
Declared in include/pto/common/pto_instr.hpp:
template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
PTO_INST RecordEvent TADDRELUCONV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
Constraints
- Source type pair:
src0andsrc1must have the same element type (ST). - Supported source -> destination type pairs:
float->halfhalf->int8_tint16_t->int8_t
- Layout: All tiles must be row-major (
TileData::isRowMajor). - Location: All tiles must live in
TileType::Vec. - Valid region:
validRow > 0andvalidCol > 0;src0andsrc1valid shapes must matchdstvalid shapes. - Implementation notes (A2A3): Uses the single
vaddreluconv_*hardware intrinsic (e.g.vaddreluconv_f322f16,vaddreluconv_f162s8,vaddreluconv_s162s8) for the fused add + relu + convert. - Implementation notes (A5): No single
vaddreluconv_*intrinsic; the fusion is expressed in the VF register-compute model:vlds(src0); vlds(src1); -> add -> maxs(0) -> vcvt -> vsts. Forint16_t -> int8_t, A5 has no s16->s8 vcvt, so the (already non-negative) sum is clamped to[0, 127]and narrowed via s16->u8 vcvt; for values in[0, 127]theuint8_tandint8_tbyte patterns are equal.
Examples
Auto
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using SrcTileT = Tile<TileType::Vec, float, 16, 16>;
using DstTileT = Tile<TileType::Vec, half, 16, 16>;
SrcTileT src0, src1;
DstTileT dst;
TADDRELUCONV(dst, src0, src1);
}
Manual
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using SrcTileT = Tile<TileType::Vec, float, 16, 16>;
using DstTileT = Tile<TileType::Vec, half, 16, 16>;
SrcTileT src0, src1;
DstTileT dst;
TASSIGN(src0, 0x1000);
TASSIGN(src1, 0x2000);
TASSIGN(dst, 0x3000);
TADDRELUCONV(dst, src0, src1);
}
ASM Form Examples
Auto Mode
# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.taddreluconv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
Manual Mode
# Manual mode: resources must be bound explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.taddreluconv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
PTO Assembly Form
%dst = taddreluconv %src0, %src1 : !pto.tile<...>
# AS Level 2 (DPS)
pto.taddreluconv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)