| aclgraph 新增OOM异常拦截并转错误码
Co-authored-by: xuchi<xuchicolson@163.com>
# message auto-generated for no-merge-commit merge:
!1024 merge A00256 into dev
aclgraph 新增OOM异常拦截并转错误码
Created-by: martinXuc
Commit-by: xuchi
Merged-by: ascend-robot
Description: <!--
PR描述模板更新日期:20251225
-->
# 合入背景
Fix part of #588
仿照 atb-models 捕获 OOM 异常转故障码的方式,在 aclgraph 代码路径上新增 OOM 异常拦截,使 OOM 场景下能够打出正确的错误码(MIE05E000006),便于定位和诊断。
# 修改内容
1. **新增 utils/decorators/exception_handler.py**:封装 @exception_handler 类装饰器,自动 wrap 目标方法并捕获 torch.OutOfMemoryError,打错误码日志后重抛为 RuntimeError
2. **model_runner_exp.py**:对 ModelRunnerExp 应用 @exception_handler 装饰器,拦截 forward() / compile() / load_weights() 中的 OOM
3. **error_code.py**:新增 ACL_GRAPH_OUT_OF_MEMORY = “MIE05E000006”;按编码前缀对 ErrorCode 枚举重新排序,提升可维护性
# 资料变更
不涉及。
# 接口变更
不涉及。
# 测试结果
## 测试用例 01:HCCL OOM 异常
### 触发方式
```shell
export HCCL_BUFFSIZE=200000
```
### 关键异常打屏
```shell
File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 38, in wrapper
raise RuntimeError(f"{error_msg}. Error_code: {error_code}") from e
RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
...
torch.OutOfMemoryError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnDispatchFFNCombine.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[ERROR] 2026-05-24-15:45:50 (PID:297309, Device:14, RankID:-1) ERR00100 PTA call acl api failed.
[PID: 297309] 2026-05-24-15:45:50.987.747 Memory_Allocation_Failure(EL0004): Failed to allocate memory requested by HCCL module.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
alloc memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
Failed to allocate [size:419431448576] bytes of NPU memory.
Nnopbase fails to invoke the HcclAllocComResourceByTiling function of the hccl module. ret = 24, comm = 0xfff01a18be90.
Check nnopbase::IndvHcclWrapper::GetInstance().HcclAllocComResourceByTiling(commHandle, stream, (op::internal::PtrCastTo<NnopbaseTilingData>(executor->args->tilingInfo.tilingData))->GetData(), &contextAddr) failed
Check NnopbaseGetHcomResource(executor, stream) failed
Check NnopbaseExecutorGetMc2Num(executor, stream, &argsAddr, &mc2Num) failed
Check NnopbaseExecutorPrepareParamsExt(executor, stream) failed
Check NnopbaseRunWithWorkspace(executor, stream, workspace, workspaceSize) failed
```
## 测试用例 02:PTA OOM 异常
### 触发方式
```shell
vim /usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py
```
```python
Returns:
Logits tensor, or tuple of (logits, hidden_states) if speculative tokens enabled.
"""
# temp trigger OOM
torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device)
if self._kv_cache_info.check_diff(kv_caches):
self._bind_kv_cache(kv_caches)
```
### 关键异常日志
```shell
[2026-05-24 16:41:28,672] [334601] [281462080598432] [llm] [ERROR] [plugin_manager.py-299] : Error encountered in generate_token (trace_ids=[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]). trigger recovery or terminate inference thread. Error: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py", line 298, in forward
torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device)
torch.OutOfMemoryError: NPU out of memory. Tried to allocate 16.00 GiB (NPU 1; 61.28 GiB total capacity; 48.03 GiB already allocated; 48.03 GiB current active; 12.99 GiB free; 48.12 GiB reserved in total by PyTorch).If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
The above exception was the direct cause of the following exception:
...
RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006
Traceback (most recent call last):
...
```
# CheckList
> PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。
- [x] 代码注释完备
- [x] 正确记录错误日志
- [x] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验)
- [x] 进行了空指针校验
- [x] 若存在资源申请,使用后资源被正确的释放了
- [x] 若涉及多线程场景,考虑了并发场景,不存在死锁问题
- [x] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码
- [x] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md)
See merge request: Ascend/MindIE-LLM!1024 | 6 天前 |