在CATLASS样例工程使用msDebug
msDebug是用于调试在NPU侧运行的算子程序的一个工具,该工具向算子开发人员提供了在昇腾设备上调试算子的手段。调试手段包括了读取昇腾设备内存与寄存器、暂停与恢复程序运行状态等。
- ⚠️ 注意 此功能依赖社区版
CANN包版本为8.2.RC1.alpha003。
使用示例
下面以对00_basic_matmul为例,进行msDebug调试的使用说明。
使能驱动的调试功能
参考msDebug工具概述,以debug模式安装驱动,或在full模式安装的驱动下执行echo 1 > /proc/debug_switch打开调试通道。
为了避免出现安全问题,请勿在生产环境启用调试通道!
- 若出现以下问题,说明驱动版本较低,需更新驱动。
msdebug failed to initialize. please install HDK.
[ERROR] error code: 0x20102
terminate called after throwing an instance of 'MSDEBUG_ERROR_CODE'
编译运行
- 基于快速上手,打开工具的编译开关
--debug --msdebug,使能debug与msdebug编译算子样例。
bash scripts/build.sh --debug --msdebug 00_basic_matmul
--debug同时控制host与device侧代码的debug开关,--msdebug控制device侧代码的debug开关。- 若只增加
--debug,只会启用host的调试功能,仅能用gdb/lldb调试host侧代码。
- 切换到可执行文件的编译目录
output/bin下,使用msdebug执行算子样例程序。
cd output/bin
# 可执行文件名 |矩阵m轴|n轴|k轴|Device ID(可选)
msdebug ./00_basic_matmul 256 512 1024 0
msdebug ./00_basic_matmul 256 512 1024 0
msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools.
The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware.
This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments.
(msdebug) target create "./00_basic_matmul"
Current executable set to '/home/catlass/output/bin/00_basic_matmul' (aarch64).
(msdebug) settings set -- target.run-args "256" "512" "1024" "0"
(msdebug)
命令行调试
设置断点和程序执行
通过命令b basic_matmul.cpp:81和b basic_matmul.cpp:128设置两个断点,再用breakpoint list查看已有断点。
(msdebug) b basic_matmul.cpp:81
Breakpoint 1: where = 00_basic_matmul`Run(Options const&) + 416 at basic_matmul.cpp:81:18, address = 0x000000000019e8dc
(msdebug) b basic_matmul.cpp:128
Breakpoint 2: where = 00_basic_matmul`Run(Options const&) + 2816 at basic_matmul.cpp:128:39, address = 0x000000000019f23c
(msdebug) breakpoint list
Current breakpoints:
1: file = 'basic_matmul.cpp', line = 81, exact_match = 0, locations = 1
1.1: where = 00_basic_matmul`Run(Options const&) + 416 at basic_matmul.cpp:81:18, address = 00_basic_matmul[0x000000000019e8dc], unresolved, hit count = 0
2: file = 'basic_matmul.cpp', line = 128, exact_match = 0, locations = 1
2.1: where = 00_basic_matmul`Run(Options const&) + 2816 at basic_matmul.cpp:138:39, address = 00_basic_matmul[0x000000000019f23c], unresolved, hit count = 0
(msdebug)
执行命令r,程序将开始运行直到第一个断点处,再执行命令c,程序将运行到下一个断点。需要注意的是,对于多核程序而言,算子程序通常会被下发至多个加速核并发运行,一旦某一个加速核命中了断点,会通过中断通知其他的加速核立即停下,因此不保证其他的加速核也一定同时在该断点停下,而且相同的断点也可能被其他的加速核再次命中,开发者可配合禁用/删除断点命令来防止加速核不停命中同一个断点的情况。
(msdebug) r
Process 813993 launched: '/home/catlass/output/bin/00_basic_matmul' (aarch64)
Process 813993 stopped
* thread #1, name = '00_basic_matmul', stop reason = breakpoint 1.1
frame #0: 0x0000aaaaaac3e8dc 00_basic_matmul`Run(options=0x0000ffffffffe340) at basic_matmul.cpp:81:18
78 ACL_CHECK(aclrtCreateStream(&stream));
79
80 uint32_t m = options.problemShape.m();
-> 81 uint32_t n = options.problemShape.n();
82 uint32_t k = options.problemShape.k();
83
84 size_t lenA = static_cast<size_t>(m) * k;
(msdebug) c
Process 813993 resuming
Process 813993 stopped
* thread #1, name = '00_basic_matmul', stop reason = breakpoint 2.1
frame #0: 0x0000aaaaaac3f23c 00_basic_matmul`Run(options=0x0000ffffffffe340) at basic_matmul.cpp:138:39
135 using MatmulKernel = Gemm::Kernel::BasicMatmul<BlockMmad, BlockEpilogue, BlockScheduler>;
136
137 using MatmulAdapter = Gemm::Device::DeviceGemm<MatmulKernel>;
-> 138 MatmulKernel::Arguments arguments{options.problemShape, deviceA, deviceB, deviceC};
139 MatmulAdapter matmul_op;
140 matmul_op.CanImplement(arguments);
141 size_t sizeWorkspace = matmul_op.GetWorkspaceSize(arguments);
(msdebug) c
Process 813993 resuming
[Launch of Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel11BasicMatmulINS1_5Blo on Device 0]
Compare success.
Process 813993 exited with status = 0 (0x00000000)
(msdebug)
查看变量和内存
如果想查看标量,通过p指令,可以直接查看当前n变量的值。
Process 814079 launched: '/home/catlass/output/bin/00_basic_matmul' (aarch64)
Process 814079 stopped
* thread #1, name = '00_basic_matmul', stop reason = breakpoint 1.1
frame #0: 0x0000aaaaaac3e8dc 00_basic_matmul`Run(options=0x0000ffffffffe340) at basic_matmul.cpp:81:18
78 ACL_CHECK(aclrtCreateStream(&stream));
79
80 uint32_t m = options.problemShape.m();
-> 81 uint32_t n = options.problemShape.n();
82 uint32_t k = options.problemShape.k();
83
84 size_t lenA = static_cast<size_t>(m) * k;
(msdebug) p n
(uint32_t) $0 = 0
如果想查看内存,先通过p指令,查看当前内存的信息。
通过x -m UB -f float16[] 65536 -c 4 -s 4命令,可以打印accumulatorBuffer内存中的值,一次最多打印1024字节。
(msdebug) c
Process 814339 resuming
Process 814339 stopped
[Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv]
* thread #1, name = '09_splitk_matmu', stop reason = breakpoint 2.1
frame #0: 0x000000000000bf98 device_debugdata`_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this=0x00000000001cf838, dst=0x00000000001cf930, src=0x00000000001cf908, elementCount=131072, splitkFactor=2) at splitk_matmul.hpp:136:19
133
134 AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(outputEventIds[bufferIndex]);
135 AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(outputEventIds[bufferIndex]);
-> 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen);
137 AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(outputEventIds[bufferIndex]);
138
139 bufferIndex = (bufferIndex + 1) % BUFFER_NUM;
(msdebug) p outputBuffer
(AscendC::LocalTensor<__fp16>[2]) $2 = {
[0] = {
AscendC::BaseLocalTensor<__fp16> = { # 内存、数据类型
address_ = (dataLen = 131072, bufferAddr = 65536, bufferHandle = "", logicPos = '\v') # 起始地址、数据长度
}
shapeInfo_ = {
shapeDim = '\x88'
originalShapeDim = '\xf8'
shape = {}
originalShape = {}
dataFormat = ND
}
}
[1] = {
AscendC::BaseLocalTensor<__fp16> = {
address_ = (dataLen = 49152, bufferAddr = 147456, bufferHandle = "", logicPos = '\v')
}
shapeInfo_ = {
shapeDim = '\x88'
originalShapeDim = '\xf8'
shape = {}
originalShape = {}
dataFormat = ND
}
}
}
(msdebug) x -m UB -f float16[] 65536 -c 4 -s 4 # 在UB内存中从65536的地址分打印4行4字节的fp16数据
0x00010000: {355.5 188.75}
0x00010004: {244.125 -364.75}
0x00010008: {-104.875 -156}
0x0001000c: {232 -100.75}
(msdebug) x -m UB -f float16[] 65536 -c 4 -s 8 # 在UB内存中从65536的地址分打印4行8字节的fp16数据
0x00010000: {355.5 188.75 244.125 -364.75}
0x00010008: {-104.875 -156 232 -100.75}
0x00010010: {-47.4062 105.875 -322.5 -265.75}
0x00010018: {260 200.125 -139.25 -190.625}
(msdebug)
如果想逐行调试,运行命令n,使程序运行至下一行
(msdebug) n
Process 814339 stopped
[Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv]
* thread #1, name = '09_splitk_matmu', stop reason = step over
frame #0: 0x000000000000bfe4 device_debugdata`_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this=0x00000000001cf838, dst=0x00000000001cf930, src=0x00000000001cf908, elementCount=131072, splitkFactor=2) at splitk_matmul.hpp:137:73
134 AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(outputEventIds[bufferIndex]);
135 AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(outputEventIds[bufferIndex]);
136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen);
-> 137 AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(outputEventIds[bufferIndex]);
138
139 bufferIndex = (bufferIndex + 1) % BUFFER_NUM;
140 }
(msdebug) n
Process 814339 stopped
[Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv]
* thread #1, name = '09_splitk_matmu', stop reason = step over
frame #0: 0x000000000000c000 device_debugdata`_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this=0x00000000001cf838, dst=0x00000000001cf930, src=0x00000000001cf908, elementCount=131072, splitkFactor=2) at splitk_matmul.hpp:139:28
136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen);
137 AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(outputEventIds[bufferIndex]);
138
-> 139 bufferIndex = (bufferIndex + 1) % BUFFER_NUM;
140 }
141
142 AscendC::WaitFlag<AscendC::HardEvent::V_MTE2>(inputEventIds[0]);
(msdebug) n
Process 814339 stopped
[Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv]
* thread #1, name = '09_splitk_matmu', stop reason = step over
frame #0: 0x000000000000c014 device_debugdata`_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this=0x00000000001cf838, dst=0x00000000001cf930, src=0x00000000001cf908, elementCount=131072, splitkFactor=2) at splitk_matmul.hpp:96:68
93 AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(accumulatorEventIds[1]);
94
95 uint32_t loops = (elementCount + tileLen - 1) / tileLen;
-> 96 for (uint32_t loopIdx = aivId; loopIdx < loops; loopIdx += aivNum) {
97 uint32_t actualTileLen = tileLen;
98 if (loopIdx == loops - 1) {
99 actualTileLen = elementCount - loopIdx * tileLen;
(msdebug)
通过var命令,可以查看当前栈帧的全部变量。
(msdebug) var
(Catlass::Gemm::Kernel::ReduceAdd<Catlass::Arch::AtlasA2, float, __fp16, 8192> *__stack__) this = 0x00000000001cf838
(const AscendC::GlobalTensor<__fp16> &__stack__) dst = 0x00000000001cf930: {
AscendC::BaseGlobalTensor<__fp16> = {
address_ = 0x000012c0c0094000
oriAddress_ = 0x000012c0c0094000
}
bufferSize_ = 1898896
shapeInfo_ = {
shapeDim = 'h'
originalShapeDim = '\xf9'
shape = {}
originalShape = {}
dataFormat = ND
}
cacheMode_ = CACHE_MODE_NORMAL
}
(const AscendC::GlobalTensor<float> &__stack__) src = 0x00000000001cf908: {
AscendC::BaseGlobalTensor<float> = {
address_ = 0x000012c041400000
oriAddress_ = 0x000012c041400000
}
bufferSize_ = 1898904
shapeInfo_ = {
shapeDim = 'H'
originalShapeDim = '\xf9'
shape = {}
originalShape = {}
dataFormat = ND
}
cacheMode_ = CACHE_MODE_NORMAL
}
(uint64_t) elementCount = 131072
(uint32_t) splitkFactor = 2
(const uint32_t) ELE_PER_VECTOR_BLOCK = 64
(uint32_t) aivNum = 48
(uint32_t) aivId = 26
(uint64_t) taskPerAiv = 2752
(uint32_t) tileLen = 2752
(uint32_t) loops = 48
(uint32_t) loopIdx = 26
(msdebug)
退出调试
调试完成后,通过命令q退出msdebug,若通过Ctrl+C等手段强行退出,则msdebug进程不会结束,仍在后台运行,此时可通过ps -ef | grep msdebug查找对应的进程pid,再用kill -9 进程pid杀掉对应进程即可。不能同时起多个msdebug进程进行调试。
(msdebug) q
Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y
常用命令表
| 命令 | 命令缩写 | 作用 | 示例 |
|---|---|---|---|
| breakpoint filename:lineNo | b | 增加断点 | b add_custom.cpp:85 b my_function |
| run | r | 重新运行 | r |
| continue | c | 继续运行 | c |
| p | 打印变量 | p zLocal | |
| frame variable | var | 打印当前帧所有变量 | var |
| memory read | x | 读内存 -m 指定内存位置,支持GM/UB/L0A/L0B/L0C -f 指定字节转换格式 -s 指定每行打印字节数 -c 指定打印的行数 |
x -m GM -f float16[] 1000 -c 2 -s 128 |
| register read | re r | 读取寄存器值 -a 读取所有寄存器值 $REG_NAME 读取指定名称的寄存器值 |
register read -are r $PC |
| thread step-over | next n |
在同一个调用栈中,移动到下一个可执行的代码行 | n |
| ascend info devices | / | 查询device信息 | ascend info devices |
| ascend info cores | / | 查询算子所运行的aicore相关信息 | ascend info cores |
| ascend info tasks | / | 查询算子所运行的task相关信息 | ascend info tasks |
| ascend info stream | / | 查询算子所运行的stream相关信息 | ascend info stream |
| ascend info blocks | / | 查询算子所运行的block相关信息 可选参数: -d/–details显示所有blocks当前中断处代码 |
ascend info blocks |
| ascend aic core | / | 切换调试器所聚焦的cube核 | ascend aic 1 |
| ascend aiv core | / | 切换调试器所聚焦的vector核 | ascend aiv 5 |
| target modules addkernel.o | image addkernel.o | PyTorch框架拉起算子时,导入算子调试信息 (注:当程序执行run命令后再执行本命令导入调试信息, 则还需额外执行image load命令以使调试信息生效) |
image addAddCustom_xxx.o |
| target modules load –f kernel.o –s address | image load -f kernel.o -s address | 在程序运行后,使导入的调试信息生效 | image load -f AddCustom_xxx.o -s 0 |
附录
msdebug支持的数据格式
Valid values are:
"default"
'B' or "boolean"
'b' or "binary"
'y' or "bytes"
'Y' or "bytes with ASCII"
'c' or "character"
'C' or "printable character"
'F' or "complex float"
's' or "c-string"
'd' or "decimal"
'E' or "enumeration"
'x' or "hex"
'X' or "uppercase hex"
'f' or "float"
"brain float16"
'o' or "octal"
'O' or "OSType"
'U' or "unicode16"
"unicode32"
'u' or "unsigned decimal"
'p' or "pointer"
"char[]"
"int8_t[]"
"uint8_t[]"
"int16_t[]"
"uint16_t[]"
"int32_t[]"
"uint32_t[]"
"int64_t[]"
"uint64_t[]"
"bfloat16[]"
"float16[]"
"float32[]"
"float64[]"
"uint128_t[]"
'I' or "complex integer"
'a' or "character array"
'A' or "address"
"hex float"
'i' or "instruction"
'v' or "void"
'u' or "unicode8"
指定调试使用的NPU卡
配置环境变量ASCEND_RT_VISIBLE_DEVICES为需要使用的NPU卡号,例如
# 指定当前进程仅使用Device ID为2的Device
export ASCEND_RT_VISIBLE_DEVICES=2