TCONCAT
Tile Operation Diagram
Basic Form (3 arguments)
Indexed Form (5-6 arguments)
Introduction
Concatenate two source tiles (src0 and src1) horizontally into a destination tile (dst) along the column dimension. Each row of dst contains the concatenation of corresponding rows from src0 and src1.
TCONCAT is used for:
- Concatenating two tiles along the column axis (horizontal concatenation)
- Joining tiles in attention and transformer architectures (e.g., concatenating KV cache entries)
- Combining partial results from split operations
Math Interpretation
For each row i in the valid region:
dsti,j={src0i,jif 0≤j<validCols0src1i,j−validCols0if validCols0≤j<validCols0+validCols1\mathrm{dst}_{i, j} = \begin{cases} \mathrm{src0}_{i, j} & \text{if } 0 \le j < \mathrm{validCols0} \\ \mathrm{src1}_{i, j - \mathrm{validCols0}} & \text{if } \mathrm{validCols0} \le j < \mathrm{validCols0} + \mathrm{validCols1} \end{cases}
Where validCols0 = src0.GetValidCol() and validCols1 = src1.GetValidCol().
Assembly Syntax
AS Level 1 (SSA)
%dst = pto.tconcat %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
AS Level 2 (DPS)
pto.tconcat ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic
Declared in include/pto/npu/a5/TConcat.hpp:
template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST void TCONCAT(TileDst &dst, TileSrc0 &src0, TileSrc1 &src1);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename TileSrc0Idx, typename TileSrc1Idx>
PTO_INST void TCONCAT(TileDst &dst, TileSrc0 &src0, TileSrc1 &src1, TileSrc0Idx &src0Idx, TileSrc1Idx &src1Idx);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename TileDstIdx, typename TileSrc0Idx, typename TileSrc1Idx>
PTO_INST void TCONCAT(TileDst &dst, TileSrc0 &src0, TileSrc1 &src1, TileDstIdx &dstIdx, TileSrc0Idx &src0Idx, TileSrc1Idx &src1Idx);
Constraints
General constraints / checks
TCONCAThas three overload variants:- basic form:
TCONCAT(dst, src0, src1)- concatenates full valid regions - indexed form (5 args):
TCONCAT(dst, src0, src1, src0Idx, src1Idx)- uses per-row index tiles to specify dynamic column counts - indexed form (6 args):
TCONCAT(dst, src0, src1, dstIdx, src0Idx, src1Idx)- also outputs the concatenated column count per row
- basic form:
- All tiles must have
TileType::Vec(vector tiles) - All tiles must use row-major layout (
isRowMajor == true)
Shape constraints
- Basic form:
dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow()dst.GetValidCol() == src0.GetValidCol() + src1.GetValidCol()
- Indexed form:
- Same row count constraints as basic form
- Column counts are determined dynamically from index tiles
dstIdx.GetValidRow() == 1for 6-argument form
Data type constraints
- Supported element types:
int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,bfloat16_t,float - Source and destination tiles must have identical element type
- Index tiles must use integer types (
int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t)
A5 implementation checks
- All tiles must be
TileType::Vec - All tiles must be row-major layout
validRowsmust not exceed the physical tile rows for any operand- Index tiles (if provided) must satisfy type compatibility checks
Examples
Auto Mode
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using TileT = Tile<TileType::Vec, float, 16, 32>;
TileT src0(16, 16);
TileT src1(16, 16);
TileT dst(16, 32);
TCONCAT(dst, src0, src1);
}
Manual Mode
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using TileT = Tile<TileType::Vec, half, 16, 64, BLayout::RowMajor, 16, 64>;
TileT src0, src1, dst;
TASSIGN(src0, 0x1000);
TASSIGN(src1, 0x2000);
TASSIGN(dst, 0x3000);
src0.SetValidRegion(16, 32);
src1.SetValidRegion(16, 32);
TCONCAT(dst, src0, src1);
}
Indexed Form Example
#include <pto/pto-inst.hpp>
using namespace pto;
void example_indexed() {
using TileT = Tile<TileType::Vec, float, 16, 64>;
using IdxTileT = Tile<TileType::Vec, int32_t, 16, 1>;
TileT src0(16, 32);
TileT src1(16, 32);
TileT dst(16, 64);
IdxTileT src0Idx, src1Idx;
TCONCAT(dst, src0, src1, src0Idx, src1Idx);
}
ASM Form Examples
Auto Mode
# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.tconcat %src0, %src1 : (!pto.tile<16x32xf32>, !pto.tile<16x32xf32>) -> !pto.tile<16x64xf32>
Manual Mode
# Manual mode: resources must be bound explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %src0, @tile(0x1000)
# pto.tassign %src1, @tile(0x2000)
# pto.tassign %dst, @tile(0x3000)
%dst = pto.tconcat %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>