文件最后提交记录最后更新时间
layerwise_disaggregated边云协同重构切chunk策略,解决非标长度获取出来policy不相等的bug Co-authored-by: zxf_00617641<zhangxiaofeng44@huawei.com> # message auto-generated for no-merge-commit merge: !889 merge dev into dev layerwise_disaggregated边云协同重构切chunk策略,解决非标长度获取出来policy不相等的bug Created-by: zxf_boluochuishui Commit-by: zxf_00617641 Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > 请描述为什么要做这个PR内的改动。\ > 如涉及,请关联前序PR或同特性/需求下的其他PR。\ > 如果是修复之前PR引入的问题,请关联引入问题的PR。\ > 注意:Fixes #ISSUE ID会自动关闭issue,如问题部分解决请不要使用Fixes,可以用Fix part of #ISSUE ID替代. Fixes [#518](https://gitcode.com/Ascend/MindIE-LLM/issues/518) 边云协同特性在长序列中,需要切分chunk,由于边和云切分时限制了其比例(如云的一段等于边的3段),并且限制了边和云的最小单位,边云切分chunk时的最小单位可以不一致,因而边云切分出来的chunk长度可能无法对应上,进而导致运行时概率报错。 # 修改内容 本次PR系列的核心目标是重构边云协同推理场景下的长序列预填充分块策略,以支持更精细和灵活的分块控制。主要解决了原有简单均分策略无法满足边侧与云侧分块长度需保持整数倍关系的复杂约束问题,确保了在分布式异构计算环境下的数据切分正确性。同时,通过引入统一的数据结构封装分块策略,重构了请求路由器、元数据管理模块及相关单元测试,解决了原有代码中逻辑分散、耦合度高、并发处理能力不足的问题,提升了系统的可维护性、调度效率和代码健壮性。 引入 RatioInfo 数据类统一管理分块比例和最小对齐单位,并重构了核心分块算法 split_long_seq_by_ratio,使其能综合考虑边云两侧的约束,确保云侧分块长度是边侧对应分块长度之和的整数倍。在架构层面,引入了 ChunkPolicyData 数据结构,在请求处理早期统一计算并封装边侧策略、云侧策略及其映射关系,替代了原有分散、重复的计算逻辑,简化了请求路由器(RequestRouterCloud 和 RequestRouterEdge)中的数据流。此外,将输入元数据管理从基于队列的方式重构为基于 request_key 的映射管理,支持并发请求的精准隔离。最后,全面重构并增强了相关单元测试,以适配新接口并验证复杂的长序列处理场景。 # 资料变更 > 请确认是否涉及资料变更。如涉及,需要在PR中体现,并简要说明修改内容。如不涉及,需填写“不涉及”。 # 接口变更 > 请确认是否涉及跨代码仓或者客户面可见的接口变更。如涉及,需要详细说明接口以及对应的变更内容,同时需要在资料中体现。如不涉及,需填写“不涉及”。 # 测试结果 > 请说明测试场景,测试方法以及测试结果。\ > 测试用例设计时需考虑硬件、部署方式、功能、性能、精度、显存等维度。 分布式边云协同 长/短 序列qwen、deepseek。 # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [ ] 代码注释完备 - [ ] 正确记录错误日志 - [ ] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [ ] 进行了空指针校验 - [ ] 若存在资源申请,使用后资源被正确的释放了 - [ ] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [ ] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [ ] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!8897 天前
[Feat] 结构化输出支持PD分离场景及重计算 Co-authored-by: earthmanylf<yulinfeng2@huawei.com> Co-authored-by: stanzzzzz<zonghaoxin@huawei.com> Co-authored-by: KaiMa<KaiMa_SDU@outlook.com> Co-authored-by: zhaokerui<zhaokerui@huawei.com> Co-authored-by: ella07<xiechangxiu3@h-partners.com> Co-authored-by: czy6<cuizhengyao@huawei.com> Co-authored-by: LinWei100<linwei100@huawei.com> Co-authored-by: dongyubin1<dongyubin1@h-partners.com> Co-authored-by: yangshuai047<yangshuai153@huawei.com> Co-authored-by: shiqiangA<shiqiang15@huawei.com> Co-authored-by: taojiovvo<jitao12@huawei.com> Co-authored-by: zhang980530<zhanghao680@h-partners.com> # message auto-generated for no-merge-commit merge: !603 merge cherry-pick-eeaa6db into dev [Feat] 结构化输出支持PD分离场景及重计算 Created-by: LinWei100 Commit-by: LinWei100;shiqiangA;ella07;zhaokerui;KaiMa;taojiovvo;stanzzzzz;yangshuai047;zhang980530;dongyubin1;czy6;earthmanylf Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > 请描述为什么要做这个PR内的改动。\ > 如涉及,请关联前序PR或同特性/需求下的其他PR。\ > 如果是修复之前PR引入的问题,请关联引入问题的PR。\ > 注意:Fixes #ISSUE ID会自动关闭issue,如问题部分解决请不要使用Fixes,可以用Fix part of #ISSUE ID替代. Fixes [#358](https://gitcode.com/Ascend/MindIE-LLM/issues/358) # 修改内容 > 请描述修改内容的具体实现,涉及哪些组件之间进行交互,可以用1、2、3、...进行罗列。\ > 如果是需求或者重构类的PR,需要补充详细设计文档(说明上下游组件关系、时序图、类图、DFX能力等内容)。 1. response_format改为context机制 2. seqid改为context handle 3. 支持pd分离、重计算场景 (replay机制) 4. mtp叠加结构化输出校验拦截 # 资料变更 > 请确认是否涉及资料变更。如涉及,需要在PR中体现,并简要说明修改内容。如不涉及,需填写“不涉及”。 # 接口变更 > 请确认是否涉及跨代码仓或者客户面可见的接口变更。如涉及,需要详细说明接口以及对应的变更内容,同时需要在资料中体现。如不涉及,需填写“不涉及”。 # 测试结果 > 请说明测试场景,测试方法以及测试结果。\ > 测试用例设计时需考虑硬件、部署方式、功能、性能、精度、显存等维度。 ![image.png](https://raw.gitcode.com/user-images/assets/8772840/72540674-9c3c-4361-b6a8-1e35bbd609fe/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8772840/fdb221c6-93c1-4706-94d0-0aceb05f4dc5/image.png 'image.png') # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [ ] 代码注释完备 - [ ] 正确记录错误日志 - [ ] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [ ] 进行了空指针校验 - [ ] 若存在资源申请,使用后资源被正确的释放了 - [ ] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [ ] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [ ] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!6032 个月前
ut去npu Co-authored-by: shiqiangA<shiqiang15@huawei.com> # message auto-generated for no-merge-commit merge: !877 merge dev_npu0414 into dev ut去npu Created-by: shiqiangA Commit-by: shiqiangA Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > 请描述为什么要做这个PR内的改动。\ > 如涉及,请关联前序PR或同特性/需求下的其他PR。\ > 如果是修复之前PR引入的问题,请关联引入问题的PR。\ > 注意:Fixes #ISSUE ID会自动关闭issue,如问题部分解决请不要使用Fixes,可以用Fix part of #ISSUE ID替代. #[351]https://gitcode.com/Ascend/MindIE-LLM/issues/351 # 修改内容 > 请描述修改内容的具体实现,涉及哪些组件之间进行交互,可以用1、2、3、...进行罗列。\ > 如果是需求或者重构类的PR,需要补充详细设计文档(说明上下游组件关系、时序图、类图、DFX能力等内容)。 # 资料变更 > 请确认是否涉及资料变更。如涉及,需要在PR中体现,并简要说明修改内容。如不涉及,需填写“不涉及”。 # 接口变更 > 请确认是否涉及跨代码仓或者客户面可见的接口变更。如涉及,需要详细说明接口以及对应的变更内容,同时需要在资料中体现。如不涉及,需填写“不涉及”。 # 测试结果 > 请说明测试场景,测试方法以及测试结果。\ > 测试用例设计时需考虑硬件、部署方式、功能、性能、精度、显存等维度。 # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [ ] 代码注释完备 - [ ] 正确记录错误日志 - [ ] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [ ] 进行了空指针校验 - [ ] 若存在资源申请,使用后资源被正确的释放了 - [ ] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [ ] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [ ] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!8771 个月前
aclgraph 新增OOM异常拦截并转错误码 Co-authored-by: xuchi<xuchicolson@163.com> # message auto-generated for no-merge-commit merge: !1024 merge A00256 into dev aclgraph 新增OOM异常拦截并转错误码 Created-by: martinXuc Commit-by: xuchi Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 Fix part of #588 仿照 atb-models 捕获 OOM 异常转故障码的方式,在 aclgraph 代码路径上新增 OOM 异常拦截,使 OOM 场景下能够打出正确的错误码(MIE05E000006),便于定位和诊断。 # 修改内容 1. **新增 utils/decorators/exception_handler.py**:封装 @exception_handler 类装饰器,自动 wrap 目标方法并捕获 torch.OutOfMemoryError,打错误码日志后重抛为 RuntimeError 2. **model_runner_exp.py**:对 ModelRunnerExp 应用 @exception_handler 装饰器,拦截 forward() / compile() / load_weights() 中的 OOM 3. **error_code.py**:新增 ACL_GRAPH_OUT_OF_MEMORY = “MIE05E000006”;按编码前缀对 ErrorCode 枚举重新排序,提升可维护性 # 资料变更 不涉及。 # 接口变更 不涉及。 # 测试结果 ## 测试用例 01:HCCL OOM 异常 ### 触发方式 ```shell export HCCL_BUFFSIZE=200000 ``` ### 关键异常打屏 ```shell File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 38, in wrapper raise RuntimeError(f"{error_msg}. Error_code: {error_code}") from e RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ ... torch.OutOfMemoryError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnDispatchFFNCombine. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1. Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging. [ERROR] 2026-05-24-15:45:50 (PID:297309, Device:14, RankID:-1) ERR00100 PTA call acl api failed. [PID: 297309] 2026-05-24-15:45:50.987.747 Memory_Allocation_Failure(EL0004): Failed to allocate memory requested by HCCL module. Possible Cause: Available memory is insufficient. Solution: Close applications not in use. TraceBack (most recent call last): alloc memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148] Failed to allocate [size:419431448576] bytes of NPU memory. Nnopbase fails to invoke the HcclAllocComResourceByTiling function of the hccl module. ret = 24, comm = 0xfff01a18be90. Check nnopbase::IndvHcclWrapper::GetInstance().HcclAllocComResourceByTiling(commHandle, stream, (op::internal::PtrCastTo<NnopbaseTilingData>(executor->args->tilingInfo.tilingData))->GetData(), &contextAddr) failed Check NnopbaseGetHcomResource(executor, stream) failed Check NnopbaseExecutorGetMc2Num(executor, stream, &argsAddr, &mc2Num) failed Check NnopbaseExecutorPrepareParamsExt(executor, stream) failed Check NnopbaseRunWithWorkspace(executor, stream, workspace, workspaceSize) failed ``` ## 测试用例 02:PTA OOM 异常 ### 触发方式 ```shell vim /usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py ``` ```python Returns: Logits tensor, or tuple of (logits, hidden_states) if speculative tokens enabled. """ # temp trigger OOM torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device) if self._kv_cache_info.check_diff(kv_caches): self._bind_kv_cache(kv_caches) ``` ### 关键异常日志 ```shell [2026-05-24 16:41:28,672] [334601] [281462080598432] [llm] [ERROR] [plugin_manager.py-299] : Error encountered in generate_token (trace_ids=[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]). trigger recovery or terminate inference thread. Error: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py", line 298, in forward torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device) torch.OutOfMemoryError: NPU out of memory. Tried to allocate 16.00 GiB (NPU 1; 61.28 GiB total capacity; 48.03 GiB already allocated; 48.03 GiB current active; 12.99 GiB free; 48.12 GiB reserved in total by PyTorch).If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. The above exception was the direct cause of the following exception: ... RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): ... ``` # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [x] 代码注释完备 - [x] 正确记录错误日志 - [x] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [x] 进行了空指针校验 - [x] 若存在资源申请,使用后资源被正确的释放了 - [x] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [x] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [x] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!10246 天前
MIES_INSTALL_PATH替换 Co-authored-by: xqchen7<chenxueqing7@huawei.com> # message auto-generated for no-merge-commit merge: !313 merge dev_127 into dev MIES_INSTALL_PATH替换 Created-by: xqchen7 Commit-by: xqchen7 Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > llm仓\ > MIES_INSTALL_PATH -> MINDIE_LLM_HOME_PATH。\ >'Fixes #155' # 修改内容 > LLM仓里的MIES_INSTALL_PATH全部替换为 MINDIE_LLM_HOME_PATH。\ # 资料变更 > docs/zh/user_guide/user_manual/environment_variable.md \ docs/zh/user_guide/user_manual/prefill_decode_mixed_deployment.md \ examples/atb_models/examples/models/telechat/README.md \ 里的MIES_INSTALL_PATH全部替换为 MINDIE_LLM_HOME_PATH。 # 接口变更 > “不涉及”。 # 测试结果 > pd混部 ![image.png](https://raw.gitcode.com/user-images/assets/8772840/f62ab605-5cfb-44f7-b4f2-1b2be1569eb7/image.png 'image.png') > pd分离 ![image.png](https://raw.gitcode.com/user-images/assets/8772840/902d7cd0-5955-4edd-83c2-f4bcb900a1ae/image.png 'image.png')。 # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [x] 代码注释完备 - [x] 正确记录错误日志 - [x] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [x] 进行了空指针校验 - [x] 若存在资源申请,使用后资源被正确的释放了 - [x] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [x] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [x] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!3133 个月前
[fix] 结构化输出 请求参数允许 type:"text" Co-authored-by: LinWei100<linwei100@huawei.com> # message auto-generated for no-merge-commit merge: !894 merge fix/text into dev [fix] 结构化输出 请求参数允许 type:"text" Created-by: LinWei100 Commit-by: LinWei100 Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > 请描述为什么要做这个PR内的改动。\ > 如涉及,请关联前序PR或同特性/需求下的其他PR。\ > 如果是修复之前PR引入的问题,请关联引入问题的PR。\ > 注意:Fixes #ISSUE ID会自动关闭issue,如问题部分解决请不要使用Fixes,可以用Fix part of #ISSUE ID替代. Fixes #523 # 修改内容 > 请描述修改内容的具体实现,涉及哪些组件之间进行交互,可以用1、2、3、...进行罗列。\ > 如果是需求或者重构类的PR,需要补充详细设计文档(说明上下游组件关系、时序图、类图、DFX能力等内容)。 1. infer_param.cpp 放开参数校验中response_format 的 type允许范围,从json_object, json_schema放开为json_object, json_schema, **text** 。当text时,请求不写入response_format参数 2. single_req_infer_interface_base.cpp 参数校验场景启用结构化输出条件排除type: "text" # 资料变更 > 请确认是否涉及资料变更。如涉及,需要在PR中体现,并简要说明修改内容。如不涉及,需填写“不涉及”。 增加type: "text" 说明 # 接口变更 > 请确认是否涉及跨代码仓或者客户面可见的接口变更。如涉及,需要详细说明接口以及对应的变更内容,同时需要在资料中体现。如不涉及,需填写“不涉及”。 # 测试结果 > 请说明测试场景,测试方法以及测试结果。\ > 测试用例设计时需考虑硬件、部署方式、功能、性能、精度、显存等维度。 # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [ ] 代码注释完备 - [ ] 正确记录错误日志 - [ ] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [ ] 进行了空指针校验 - [ ] 若存在资源申请,使用后资源被正确的释放了 - [ ] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [ ] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [ ] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!8941 个月前
aclgraph 新增OOM异常拦截并转错误码 Co-authored-by: xuchi<xuchicolson@163.com> # message auto-generated for no-merge-commit merge: !1024 merge A00256 into dev aclgraph 新增OOM异常拦截并转错误码 Created-by: martinXuc Commit-by: xuchi Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 Fix part of #588 仿照 atb-models 捕获 OOM 异常转故障码的方式,在 aclgraph 代码路径上新增 OOM 异常拦截,使 OOM 场景下能够打出正确的错误码(MIE05E000006),便于定位和诊断。 # 修改内容 1. **新增 utils/decorators/exception_handler.py**:封装 @exception_handler 类装饰器,自动 wrap 目标方法并捕获 torch.OutOfMemoryError,打错误码日志后重抛为 RuntimeError 2. **model_runner_exp.py**:对 ModelRunnerExp 应用 @exception_handler 装饰器,拦截 forward() / compile() / load_weights() 中的 OOM 3. **error_code.py**:新增 ACL_GRAPH_OUT_OF_MEMORY = “MIE05E000006”;按编码前缀对 ErrorCode 枚举重新排序,提升可维护性 # 资料变更 不涉及。 # 接口变更 不涉及。 # 测试结果 ## 测试用例 01:HCCL OOM 异常 ### 触发方式 ```shell export HCCL_BUFFSIZE=200000 ``` ### 关键异常打屏 ```shell File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 38, in wrapper raise RuntimeError(f"{error_msg}. Error_code: {error_code}") from e RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ ... torch.OutOfMemoryError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnDispatchFFNCombine. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1. Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging. [ERROR] 2026-05-24-15:45:50 (PID:297309, Device:14, RankID:-1) ERR00100 PTA call acl api failed. [PID: 297309] 2026-05-24-15:45:50.987.747 Memory_Allocation_Failure(EL0004): Failed to allocate memory requested by HCCL module. Possible Cause: Available memory is insufficient. Solution: Close applications not in use. TraceBack (most recent call last): alloc memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148] Failed to allocate [size:419431448576] bytes of NPU memory. Nnopbase fails to invoke the HcclAllocComResourceByTiling function of the hccl module. ret = 24, comm = 0xfff01a18be90. Check nnopbase::IndvHcclWrapper::GetInstance().HcclAllocComResourceByTiling(commHandle, stream, (op::internal::PtrCastTo<NnopbaseTilingData>(executor->args->tilingInfo.tilingData))->GetData(), &contextAddr) failed Check NnopbaseGetHcomResource(executor, stream) failed Check NnopbaseExecutorGetMc2Num(executor, stream, &argsAddr, &mc2Num) failed Check NnopbaseExecutorPrepareParamsExt(executor, stream) failed Check NnopbaseRunWithWorkspace(executor, stream, workspace, workspaceSize) failed ``` ## 测试用例 02:PTA OOM 异常 ### 触发方式 ```shell vim /usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py ``` ```python Returns: Logits tensor, or tuple of (logits, hidden_states) if speculative tokens enabled. """ # temp trigger OOM torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device) if self._kv_cache_info.check_diff(kv_caches): self._bind_kv_cache(kv_caches) ``` ### 关键异常日志 ```shell [2026-05-24 16:41:28,672] [334601] [281462080598432] [llm] [ERROR] [plugin_manager.py-299] : Error encountered in generate_token (trace_ids=[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]). trigger recovery or terminate inference thread. Error: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py", line 298, in forward torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device) torch.OutOfMemoryError: NPU out of memory. Tried to allocate 16.00 GiB (NPU 1; 61.28 GiB total capacity; 48.03 GiB already allocated; 48.03 GiB current active; 12.99 GiB free; 48.12 GiB reserved in total by PyTorch).If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. The above exception was the direct cause of the following exception: ... RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): ... ``` # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [x] 代码注释完备 - [x] 正确记录错误日志 - [x] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [x] 进行了空指针校验 - [x] 若存在资源申请,使用后资源被正确的释放了 - [x] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [x] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [x] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!10246 天前