Host 诊断调优工具
简介
当前 AI 训练、推理业务场景中,Host侧(CPU)的任务下发(如算子调度、内存分配)与Device侧(NPU)的任务执行是异步进行的。当Host侧任务下发耗时超过Device侧任务执行耗时时,Device会因等待新任务而处于空闲状态,形成性能瓶颈,即HostBound问题。
针对以上问题,我们设计了Host侧的诊断调优工具,提供简单、易用的绑核能力,通过将进程和线程分别绑定在不同CPU上执行以减少互相之间的干扰与资源竞争
使用前准备
- 获取仓库中提供的参数解析、绑核脚本:entrance.py,cpu_binder.py
功能介绍
自定义绑核功能
功能说明
提供自定义绑核能力,根据用户输入json中的配置方案完成进程/线程级绑核;缺省输入时以经验最优方案进行绑核:
- 为每张卡的关键线程 acl_thread/release_thread 单独分配一个 CPU 核,dev[i]_sq_task 单独分配一个 CPU 核(宿主机可绑,未查询到时跳过),算子相关中断 sq_send_trigger_irq 和 cq_update_irq 各单独分配一个 CPU 核(需要拥有/proc目录写权限,权限不足时跳过),其余推理线程共同分配到 6 个 CPU 核上,即每张 NPU 卡绑定到 11 个 CPU 核(分配 CPU 时考虑 NPU 亲和及跨 NUMA 内存访问时延)
命令格式
python3 entrance.py bind [-l] [-c <config_path>]
参数说明
| 参数 | 可选/必选 | 说明 |
|---|---|---|
| -l/--log-level | 可选 | 指定打印时的日志等级,类型为int,可选值为[0, 1, 2, 3],默认值为1 |
| -c/--config | 可选 | 指定自定义绑核所需的json配置文件,缺省时按默认策略进行绑核,类型为str,默认值为None |
json配置文件参数说明
| 参数 | 可选/必选 | 说明 |
|---|---|---|
| custom_bind | 必选 | json键值,类型为str,对应value需要为List[Dict],将每一组绑核对象存放在List中 |
| process_name | 可选 | 需要绑定的进程或线程名称,类型为str,由is_thread参数决定是进程或线程,指定非NPU卡对应进程或线程时进程数或线程数应等于cpu_list参数长度 默认值为None |
| pid | 可选 | 需要绑定的PID号,可以是一个或多个PID,该参数长度应等于cpu_list参数长度,类型为List[int] 默认值为[] |
| cpu_list | 必选 | 绑定CPU列表,可以是一个或多个CPU区间,类型为List[str] 默认值为[] |
| mem_bind | 可选 | 选择是否需要在绑定CPU后将内存迁移至对应NUMA节点,类型为bool,取值为: • true:表示绑核后将使用内存迁移至对应NUMA节点 • false:表示绑核后不需要将使用内存迁移至对应NUMA节点 默认值为false |
| is_thread | 可选 | 选择绑定的ID是进程ID还是线程ID,类型为bool,取值为: • true:表示绑定线程ID • false:表示绑定进程ID 默认值为false |
| is_irq | 可选 | 选择绑定的是否为硬件中断,类型为bool,取值为: • true:表示绑定的是硬件中断 • false:表示绑定的不是硬件中断 默认值为false |
| irq_id | 可选 | 需要绑定的硬件中断的中断号,类型为List[int] |
| bind_sub_process | 可选 | 选择绑核时是否需要绑定进程ID或线程ID的所有子线程,类型为bool,取值为: • true:表示绑核时需要同时绑定进程ID或线程ID的所有子线程 • false:表示绑核时不需要绑定进程ID或线程ID的所有子线程 默认值为false |
json配置文件示例
{
"custom_bind": [
{
"process_name": "VLLM::Worker_TP",
"cpu_list": ["4-10","16-22","52-58","64-70"],
"bind_sub_process": true
},
{
"process_name": "acl_thread",
"cpu_list": ["11","23","59","71"],
"mem_bind": true,
"is_thread": true
},
{
"process_name": "release_thread",
"cpu_list": ["12,13","24,25","60,61","72,73"],
"is_thread": true
},
{
"process_name": "VLLM::EngineCore",
"cpu_list": ["44"],
"bind_sub_process": false,
"mem_bind": true
},
{
"pid": [110351, 110352, 110353, 110354],
"cpu_list": ["32-35"]
},
{
"irq_id": [3008, 3009, 3264, 3265, 2496, 2497, 2752, 2753],
"cpu_list": ["15", "16", "27", "28", "64", "65", "76", "77"],
"is_irq": true
}
]
}
使用示例
-
使用示例1
python3 entrance.py bind -c ./bind_design.json按照json配置文件中的方案进行绑核,假设当前仅NPU4,5,6,7在运行,则:
- 以进程名"VLLM::Worker_TP"匹配NPU4,5,6,7中的该进程,并分别绑定到CPU "4-10","16-22","52-58","64-70",绑定时会同时绑定该进程下的所有子线程
- 以线程名"acl_thread"匹配NPU4,5,6,7中的该线程,并分别绑定到CPU "11","23","59","71"
- 以线程名"release_thread"匹配NPU4,5,6,7中的该线程,并分别绑定到CPU "12,13","24,25","60,61","72,73"
- 以进程名"VLLM::EngineCore"匹配环境中的该进程,并绑定到CPU "44"
- 分别绑定pid "110351","110352","110353","110354"到CPU "32","33","34","35"
- 分别绑定中断 "3008","3009","3264","3265","2496","2497","2752","2753"到CPU "15", "16", "27", "28", "64", "65", "76", "77"
-
使用示例2
python3 entrance.py bind- 结合NPU亲和性以及可访问CPU数量,为每张NPU卡分配对应的CPU区间(每张卡分配的CPU区间都会在一个NUMA节点内)
- 绑定每张NPU卡对应的sq_send_trigger_irq和cq_update_irq(算子下发硬件中断)到其CPU区间的前两个CPU核,需要有/proc/irq/<irq_id>/smp_affinity文件的写权限,否则会执行失败
- 绑定每张NPU卡对应的dev[i]_sq_task(NPU驱动进程)到其CPU区间的第三个CPU核,需要在宿主机才能查询到此进程,否则会执行失败
- 绑定每张NPU卡对应的主进程和所有子线程到其CPU区间的第四到第九个CPU核
- 绑定每张NPU卡对应的acl_thread(算子下发线程)到其CPU区间的第十个CPU核
- 绑定每张NPU卡对应的release_thread(资源释放线程)到其CPU区间的第十一个CPU核
输出示例
以使用示例1为例,会输出以下信息展示将绑定结果:
[2026-03-13 10:34:27,406] [INFO]:Start binding core round 1: {"process_name": "VLLM::Worker_TP", "cpu_list": ["4-10","16-22","52-58","64-70"], "bind_sub_process": true}
[2026-03-13 10:34:28,490] [INFO]:Bind the target (pid=1603713) to CPU4,5,6,7,8,9,10
[2026-03-13 10:34:28,498] [INFO]:Bind the target (pid=1603714) to CPU16,17,18,19,20,21,22
[2026-03-13 10:34:28,506] [INFO]:Bind the target (pid=1603715) to CPU52,53,54,55,56,57,58
[2026-03-13 10:34:28,514] [INFO]:Bind the target (pid=1603716) to CPU64,65,66,67,68,69,70
[2026-03-13 10:34:28,522] [INFO]:===== Round 1 of core binding has ended =====
[2026-03-13 10:34:28,533] [INFO]:Start binding core round 2: {"process_name": "acl_thread", "cpu_list": ["11","23","59","71"], "mem_bind": true, "is_thread": true}
[2026-03-13 10:34:29,592] [INFO]:Bind the target (pid=1648512) to CPU11
[2026-03-13 10:34:29,603] [INFO]:Bind the target (pid=1648576) to CPU23
[2026-03-13 10:34:29,613] [INFO]:Bind the target (pid=1648615) to CPU59
[2026-03-13 10:34:29,623] [INFO]:Bind the target (pid=1648667) to CPU71
[2026-03-13 10:34:29,633] [INFO]:===== Round 2 of core binding has ended =====
[2026-03-13 10:34:29,633] [INFO]:Start binding core round 3: {"process_name": "release_thread", "cpu_list": ["12","24","60","72"], "is_thread": true}
[2026-03-13 10:34:30,644] [INFO]:Bind the target (pid=1648513) to CPU12
[2026-03-13 10:34:30,729] [INFO]:Bind the target (pid=1648577) to CPU24
[2026-03-13 10:34:30,736] [INFO]:Bind the target (pid=1648616) to CPU60
[2026-03-13 10:34:30,743] [INFO]:Bind the target (pid=1648668) to CPU72
[2026-03-13 10:34:30,750] [INFO]:===== Round 3 of core binding has ended =====
[2026-03-13 10:34:30,757] [INFO]:Start binding core round 4: {"process_name": "VLLM::EngineCore", "cpu_list": ["44"], "bind_sub_process": false, "mem_bind": true}
[2026-03-13 10:34:31,490] [INFO]:Bind the target (pid=1603113) to CPU44
[2026-03-13 10:34:31,522] [INFO]:===== Round 4 of core binding has ended =====
[2026-03-13 10:34:31,348] [INFO]:Start binding core round 5: {"pid": [110351, 110352, 110353, 110354], "cpu_list": ["32-35"]}
[2026-03-13 10:34:32,356] [INFO]:Bind the target (pid=110351) to CPU32,33,34,35
[2026-03-13 10:34:32,363] [INFO]:Bind the target (pid=110352) to CPU32,33,34,35
[2026-03-13 10:34:32,370] [INFO]:Bind the target (pid=110353) to CPU32,33,34,35
[2026-03-13 10:34:32,378] [INFO]:Bind the target (pid=110354) to CPU32,33,34,35
[2026-03-13 10:34:32,385] [INFO]:===== Round 5 of core binding has ended =====
[2026-03-13 10:34:32,406] [INFO]:Start binding core round 6: {"irq_id": [3008, 3009, 3264, 3265, 2496, 2497, 2752, 2753], "cpu_list": ["15", "16", "27", "28", "64", "65", "76", "77"], "is_irq": true}
[2026-03-13 10:34:33,540] [INFO]:Bind the interrupt of IRQ-3008 to CPU15
[2026-03-13 10:34:33,540] [INFO]:Bind the interrupt of IRQ-3009 to CPU16
[2026-03-13 10:34:33,540] [INFO]:Bind the interrupt of IRQ-3264 to CPU27
[2026-03-13 10:34:33,540] [INFO]:Bind the interrupt of IRQ-3265 to CPU28
[2026-03-13 10:34:33,540] [INFO]:Bind the interrupt of IRQ-2496 to CPU64
[2026-03-13 10:34:33,541] [INFO]:Bind the interrupt of IRQ-2497 to CPU65
[2026-03-13 10:34:33,541] [INFO]:Bind the interrupt of IRQ-2752 to CPU76
[2026-03-13 10:34:33,541] [INFO]:Bind the interrupt of IRQ-2753 to CPU77
[2026-03-13 10:34:33,542] [INFO]:===== Round 6 of core binding has ended =====