文件最后提交记录最后更新时间
[dev]Synchronize code Co-authored-by: forcekeng<gengli8@huawei.com> 5 个月前
aclgraph 新增OOM异常拦截并转错误码 Co-authored-by: xuchi<xuchicolson@163.com> # message auto-generated for no-merge-commit merge: !1024 merge A00256 into dev aclgraph 新增OOM异常拦截并转错误码 Created-by: martinXuc Commit-by: xuchi Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 Fix part of #588 仿照 atb-models 捕获 OOM 异常转故障码的方式,在 aclgraph 代码路径上新增 OOM 异常拦截,使 OOM 场景下能够打出正确的错误码(MIE05E000006),便于定位和诊断。 # 修改内容 1. **新增 utils/decorators/exception_handler.py**:封装 @exception_handler 类装饰器,自动 wrap 目标方法并捕获 torch.OutOfMemoryError,打错误码日志后重抛为 RuntimeError 2. **model_runner_exp.py**:对 ModelRunnerExp 应用 @exception_handler 装饰器,拦截 forward() / compile() / load_weights() 中的 OOM 3. **error_code.py**:新增 ACL_GRAPH_OUT_OF_MEMORY = “MIE05E000006”;按编码前缀对 ErrorCode 枚举重新排序,提升可维护性 # 资料变更 不涉及。 # 接口变更 不涉及。 # 测试结果 ## 测试用例 01:HCCL OOM 异常 ### 触发方式 ```shell export HCCL_BUFFSIZE=200000 ``` ### 关键异常打屏 ```shell File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 38, in wrapper raise RuntimeError(f"{error_msg}. Error_code: {error_code}") from e RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ ... torch.OutOfMemoryError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnDispatchFFNCombine. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1. Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging. [ERROR] 2026-05-24-15:45:50 (PID:297309, Device:14, RankID:-1) ERR00100 PTA call acl api failed. [PID: 297309] 2026-05-24-15:45:50.987.747 Memory_Allocation_Failure(EL0004): Failed to allocate memory requested by HCCL module. Possible Cause: Available memory is insufficient. Solution: Close applications not in use. TraceBack (most recent call last): alloc memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148] Failed to allocate [size:419431448576] bytes of NPU memory. Nnopbase fails to invoke the HcclAllocComResourceByTiling function of the hccl module. ret = 24, comm = 0xfff01a18be90. Check nnopbase::IndvHcclWrapper::GetInstance().HcclAllocComResourceByTiling(commHandle, stream, (op::internal::PtrCastTo<NnopbaseTilingData>(executor->args->tilingInfo.tilingData))->GetData(), &contextAddr) failed Check NnopbaseGetHcomResource(executor, stream) failed Check NnopbaseExecutorGetMc2Num(executor, stream, &argsAddr, &mc2Num) failed Check NnopbaseExecutorPrepareParamsExt(executor, stream) failed Check NnopbaseRunWithWorkspace(executor, stream, workspace, workspaceSize) failed ``` ## 测试用例 02:PTA OOM 异常 ### 触发方式 ```shell vim /usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py ``` ```python Returns: Logits tensor, or tuple of (logits, hidden_states) if speculative tokens enabled. """ # temp trigger OOM torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device) if self._kv_cache_info.check_diff(kv_caches): self._bind_kv_cache(kv_caches) ``` ### 关键异常日志 ```shell [2026-05-24 16:41:28,672] [334601] [281462080598432] [llm] [ERROR] [plugin_manager.py-299] : Error encountered in generate_token (trace_ids=[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]). trigger recovery or terminate inference thread. Error: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/exception_handler.py", line 26, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/runtime/model_runner/model_runner_exp.py", line 298, in forward torch.zeros(1024 * 1024 * 1024 * 4, dtype=torch.float32, device=self.device) torch.OutOfMemoryError: NPU out of memory. Tried to allocate 16.00 GiB (NPU 1; 61.28 GiB total capacity; 48.03 GiB already allocated; 48.03 GiB current active; 12.99 GiB free; 48.12 GiB reserved in total by PyTorch).If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. The above exception was the direct cause of the following exception: ... RuntimeError: Device out of memory (OOM) reported by PyTorch, but it can possibly triggered by HCCL. Enable logs: export ASCEND_SLOG_PRINT_TO_STDOUT=1, export ASCEND_GLOBAL_LOG_LEVEL=3 to check if there's HCCL error messages. Error_code: MIE05E000006 Traceback (most recent call last): ... ``` # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [x] 代码注释完备 - [x] 正确记录错误日志 - [x] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [x] 进行了空指针校验 - [x] 若存在资源申请,使用后资源被正确的释放了 - [x] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [x] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [x] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!10246 天前
[RFC]: 安全编译选项修改,禁止用runpath等 Co-authored-by: KaiMa<KaiMa_SDU@outlook.com> # message auto-generated for no-merge-commit merge: !932 merge options into dev [RFC]: 安全编译选项修改,禁止用runpath等 Created-by: KaiMa Commit-by: KaiMa Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > 请描述为什么要做这个PR内的改动。\ > 如涉及,请关联前序PR或同特性/需求下的其他PR。\ > 如果是修复之前PR引入的问题,请关联引入问题的PR。\ > 注意:Fixes #ISSUE ID会自动关闭issue,如问题部分解决请不要使用Fixes,可以用Fix part of #ISSUE ID替代. Fixes #[533](https://gitcode.com/Ascend/MindIE-LLM/issues/533) # 修改内容 > 请描述修改内容的具体实现,涉及哪些组件之间进行交互,可以用1、2、3、...进行罗列。\ > 如果是需求或者重构类的PR,需要补充详细设计文档(说明上下游组件关系、时序图、类图、DFX能力等内容)。 # 资料变更 > 请确认是否涉及资料变更。如涉及,需要在PR中体现,并简要说明修改内容。如不涉及,需填写“不涉及”。 # 接口变更 > 请确认是否涉及跨代码仓或者客户面可见的接口变更。如涉及,需要详细说明接口以及对应的变更内容,同时需要在资料中体现。如不涉及,需填写“不涉及”。 # 测试结果 > 请说明测试场景,测试方法以及测试结果。\ > 测试用例设计时需考虑硬件、部署方式、功能、性能、精度、显存等维度。 # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [ ] 代码注释完备 - [ ] 正确记录错误日志 - [ ] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [ ] 进行了空指针校验 - [ ] 若存在资源申请,使用后资源被正确的释放了 - [ ] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [ ] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [ ] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!9321 个月前
[op] Update DispatchFFNCombine Co-authored-by: forcekeng<gengli8@huawei.com> # message auto-generated for no-merge-commit merge: !1034 merge dev_op_vllmB113 into dev [op] Update DispatchFFNCombine Created-by: forcekeng Commit-by: forcekeng Merged-by: Katrina-CXY Description: 根据[vllm v1.13.0](https://github.com/vllm-project/vllm-ascend/tree/releases/v0.13.0),更新DispatchFFNCombine算子 See merge request: Ascend/MindIE-LLM!10346 天前
[dev]Synchronize code Co-authored-by: forcekeng<gengli8@huawei.com> 5 个月前
[dev]Synchronize code Co-authored-by: forcekeng<gengli8@huawei.com> 5 个月前
ut去npu Co-authored-by: shiqiangA<shiqiang15@huawei.com> # message auto-generated for no-merge-commit merge: !934 merge dev_npu0414 into dev ut去npu Created-by: shiqiangA Commit-by: shiqiangA Merged-by: ascend-robot Description: <!-- PR描述模板更新日期:20251225 --> # 合入背景 > 请描述为什么要做这个PR内的改动。\ > 如涉及,请关联前序PR或同特性/需求下的其他PR。\ > 如果是修复之前PR引入的问题,请关联引入问题的PR。\ > 注意:Fixes #ISSUE ID会自动关闭issue,如问题部分解决请不要使用Fixes,可以用Fix part of #ISSUE ID替代. #[351]https://gitcode.com/Ascend/MindIE-LLM/issues/351 # 修改内容 > 请描述修改内容的具体实现,涉及哪些组件之间进行交互,可以用1、2、3、...进行罗列。\ > 如果是需求或者重构类的PR,需要补充详细设计文档(说明上下游组件关系、时序图、类图、DFX能力等内容)。 1、对Mindie_llm中python用例通过mock方式进行了去torch_npu依赖;去npu依赖后ut执行用时减少5min # 资料变更 > 请确认是否涉及资料变更。如涉及,需要在PR中体现,并简要说明修改内容。如不涉及,需填写“不涉及”。 # 接口变更 > 请确认是否涉及跨代码仓或者客户面可见的接口变更。如涉及,需要详细说明接口以及对应的变更内容,同时需要在资料中体现。如不涉及,需填写“不涉及”。 # 测试结果 > 请说明测试场景,测试方法以及测试结果。\ > 测试用例设计时需考虑硬件、部署方式、功能、性能、精度、显存等维度。 # CheckList > PR提交人对以下CheckList自检项进行全量自检,自检通过或不涉及,均修改 [ ] 为 [x]。 - [ ] 代码注释完备 - [ ] 正确记录错误日志 - [ ] 进行了返回值校验 (禁止使用void屏蔽安全函数、自研函数返回值;考虑接口的异常场景;调用底层组件接口时,需要进行返回值校验) - [ ] 进行了空指针校验 - [ ] 若存在资源申请,使用后资源被正确的释放了 - [ ] 若涉及多线程场景,考虑了并发场景,不存在死锁问题 - [ ] 按照[代码仓中提供的格式模板](https://gitcode.com/Ascend/MindIE-LLM/blob/master/.clang-format),使用clang-format工具格式化代码 - [ ] 符合Ascend社区的编码规范。[C++ 语言编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-coding-style-guide.md) | [C++ 语言安全编程指导](https://gitcode.com/Ascend/community/blob/master/docs/contributor/Ascend-cpp-secure-coding-guide.md) See merge request: Ascend/MindIE-LLM!9341 个月前