Using Ascend C Operator Debugging APIs in a CATLASS Sample Project
Ascend C operator debugging APIs are the debugging capabilities of Ascend C. They can be used to print internal kernel information (printf) and view tensor content (DumpTensor).
Examples
The following uses 00_basic_matmul as an example to demonstrate the test process based on Ascend C operator debugging APIs.
Inserting Debugging Code
Add the debugging API call at the desired level. For example, add the following code to the kernel function in include/catlass/gemm/kernel/basic_matmul.hpp:
// include/catlass/gemm/kernel/basic_matmul.hpp
template <>
CATLASS_DEVICE
void operator()<AscendC::AIC>(Params const ¶ms) {
BlockScheduler matmulBlockScheduler(params.problemShape, MakeCoord(L1TileShape::M, L1TileShape::N));
uint32_t coreLoops = matmulBlockScheduler.GetCoreLoops();
+ AscendC::printf("CoreLoops is %d\n", coreLoops);
Arch::Resource<ArchTag> resource;
BlockMmad blockMmad(resource);
// Represent the full gm
AscendC::GlobalTensor<ElementA> gmA;
gmA.SetGlobalBuffer((__gm__ ElementA *)params.ptrA);
AscendC::GlobalTensor<ElementB> gmB;
gmB.SetGlobalBuffer((__gm__ ElementB *)params.ptrB);
AscendC::GlobalTensor<ElementC> gmC;
gmC.SetGlobalBuffer((__gm__ ElementC *)params.ptrC);
+ AscendC::DumpTensor(gmA, coreLoops, 16);
for (uint32_t loopIdx = AscendC::GetBlockIdx(); loopIdx < coreLoops; loopIdx += AscendC::GetBlockNum()) {
// Compute block location
GemmCoord blockCoord = matmulBlockScheduler.GetBlockCoord(loopIdx);
GemmCoord actualBlockShape = matmulBlockScheduler.GetActualBlockShape(blockCoord);
// Compute initial location in logical coordinates
MatrixCoord offsetA{blockCoord.m() * L1TileShape::M, blockCoord.k() * L1TileShape::K};
MatrixCoord offsetB{blockCoord.k() * L1TileShape::K, blockCoord.n() * L1TileShape::N};
MatrixCoord offsetC{blockCoord.m() * L1TileShape::M, blockCoord.n() * L1TileShape::N};
int64_t gmOffsetA = params.layoutA.GetOffset(offsetA);
int64_t gmOffsetB = params.layoutB.GetOffset(offsetB);
int64_t gmOffsetC = params.layoutC.GetOffset(offsetC);
// Compute block-scoped matrix multiply-add
blockMmad(gmA[gmOffsetA], params.layoutA,
gmB[gmOffsetB], params.layoutB,
gmC[gmOffsetC], params.layoutC,
actualBlockShape);
}
}
Build and Execution
- Build the operator sample by referring to Quick Start. In the current version, no additional build option is required. If the debugging API is called in the code, the compiler is automatically enabled.
bash scripts/build.sh 00_basic_matmul
- Switch to the
output/bindirectory where the executable file is compiled and run the operator sample program.
cd output/bin
# Executable file name | Matrix M-axis | N-axis | K-axis | Device ID (optional)
./00_basic_matmul 256 512 1024 0
- ⚠ Precautions
DumpTensordoes not support printing values onL0A,L0B, andFixPipe. In particular, onAscend 950PR/Ascend 950DT, values onL1cannot be printed.
Output Example (For Reference Only. The Actual Output May Vary Depending on the Hardware and Operator Implementation.)
./00_basic_matmul 256 512 1024 0
opType=device_gemm, DumpHead: AIC-0, CoreType=AIC, block dim=24, total_block_num=24, block_remain_len=1048408, block_initial_space=1048576, rsv=0, magic=5aa5bccd
CoreLoops is 4
DumpTensor: desc=4, addr=c0013000, data_type=float16, position=GM
[3.402344, -1.056641, 2.830078, 2.984375, 4.117188, -3.025391, -1.647461, 2.681641, -2.222656, 0.539551, -0.226074, 1.289062, -1.352539, 0.134033, 4.523438, 4.160156]
... # Each cube core outputs information once.
Compare success.