ascend-robot[Feature]Agent Skill For Host CPU Bind

文件	最后提交记录	最后更新时间
assets	[Feature]Agent Skill For Host CPU Bind Co-authored-by: lijing-git<1390732074@qq.com> # message auto-generated for no-merge-commit merge: !84 merge msagent_cpu_binding_skill into master [Feature]Agent Skill For Host CPU Bind Created-by: ljing-git Commit-by: lijing-git Merged-by: ascend-robot Description: # 感谢您贡献的Pull Request 在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) <!-- - 请明确说明您提交PR的变更内容。本部分旨在概述所做的变更，以及此PR是如何解决该问题的。请尽可能地提供有助于评审人员更高效、更快速完成检视审查的实用说明。 - 请说明为何需要这些更改，例如具体的使用场景或bug描述。 - 关联issue号(如果有) - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes #<issue-number> --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 面向用户的变更 (Does this PR introduce _any_ user-facing change)? <!-- 请注意，这里指的是任何面向用户的变更，包括但不限于API、用户界面或其他使用方式上的变更。 Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 功能验证 (How was this patch tested?) <!-- 请确认CI已通过增量及存量的单元测试用例。如果本次测试方式与常规单元测试不同，请详细说明您的测试步骤(最好提供完整的可复现的操作路径及关键截图)，以便Committer能够快速复现验证，也便于后续的维护。如果未添加测试，请说明未添加的原因，以及为何难添加测试。 - [_] 功能自验 - [_] 本地自验截图(涉及个人标识符等敏感信息请注意脱敏) - [_] 新增/变更内容是否已新增/适配UT测试用例看护 CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. - [_] Self-verification of the feature. - [_] Screenshot of local self-verification (please anonymize any sensitive information such as personal identifiers) - [_] Have new or modified unit test (UT) cases been added or adapted to cover the newly added or changed content? --> ![20260609142823_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/0850dd25-9b1d-47b5-88b5-5be858651845/20260609142823_rec_.gif '20260609142823_rec_.gif') ![20260609142903_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/2ad66e2a-4f44-4bb6-bec6-28f796bf86c8/20260609142903_rec_.gif '20260609142903_rec_.gif') See merge request: Ascend/msagent!84	16 天前
docs	[Feature]Agent Skill For Host CPU Bind Co-authored-by: lijing-git<1390732074@qq.com> # message auto-generated for no-merge-commit merge: !84 merge msagent_cpu_binding_skill into master [Feature]Agent Skill For Host CPU Bind Created-by: ljing-git Commit-by: lijing-git Merged-by: ascend-robot Description: # 感谢您贡献的Pull Request 在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) <!-- - 请明确说明您提交PR的变更内容。本部分旨在概述所做的变更，以及此PR是如何解决该问题的。请尽可能地提供有助于评审人员更高效、更快速完成检视审查的实用说明。 - 请说明为何需要这些更改，例如具体的使用场景或bug描述。 - 关联issue号(如果有) - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes #<issue-number> --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 面向用户的变更 (Does this PR introduce _any_ user-facing change)? <!-- 请注意，这里指的是任何面向用户的变更，包括但不限于API、用户界面或其他使用方式上的变更。 Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 功能验证 (How was this patch tested?) <!-- 请确认CI已通过增量及存量的单元测试用例。如果本次测试方式与常规单元测试不同，请详细说明您的测试步骤(最好提供完整的可复现的操作路径及关键截图)，以便Committer能够快速复现验证，也便于后续的维护。如果未添加测试，请说明未添加的原因，以及为何难添加测试。 - [_] 功能自验 - [_] 本地自验截图(涉及个人标识符等敏感信息请注意脱敏) - [_] 新增/变更内容是否已新增/适配UT测试用例看护 CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. - [_] Self-verification of the feature. - [_] Screenshot of local self-verification (please anonymize any sensitive information such as personal identifiers) - [_] Have new or modified unit test (UT) cases been added or adapted to cover the newly added or changed content? --> ![20260609142823_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/0850dd25-9b1d-47b5-88b5-5be858651845/20260609142823_rec_.gif '20260609142823_rec_.gif') ![20260609142903_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/2ad66e2a-4f44-4bb6-bec6-28f796bf86c8/20260609142903_rec_.gif '20260609142903_rec_.gif') See merge request: Ascend/msagent!84	16 天前
scripts	[Feature]Agent Skill For Host CPU Bind Co-authored-by: lijing-git<1390732074@qq.com> # message auto-generated for no-merge-commit merge: !84 merge msagent_cpu_binding_skill into master [Feature]Agent Skill For Host CPU Bind Created-by: ljing-git Commit-by: lijing-git Merged-by: ascend-robot Description: # 感谢您贡献的Pull Request 在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) <!-- - 请明确说明您提交PR的变更内容。本部分旨在概述所做的变更，以及此PR是如何解决该问题的。请尽可能地提供有助于评审人员更高效、更快速完成检视审查的实用说明。 - 请说明为何需要这些更改，例如具体的使用场景或bug描述。 - 关联issue号(如果有) - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes #<issue-number> --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 面向用户的变更 (Does this PR introduce _any_ user-facing change)? <!-- 请注意，这里指的是任何面向用户的变更，包括但不限于API、用户界面或其他使用方式上的变更。 Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 功能验证 (How was this patch tested?) <!-- 请确认CI已通过增量及存量的单元测试用例。如果本次测试方式与常规单元测试不同，请详细说明您的测试步骤(最好提供完整的可复现的操作路径及关键截图)，以便Committer能够快速复现验证，也便于后续的维护。如果未添加测试，请说明未添加的原因，以及为何难添加测试。 - [_] 功能自验 - [_] 本地自验截图(涉及个人标识符等敏感信息请注意脱敏) - [_] 新增/变更内容是否已新增/适配UT测试用例看护 CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. - [_] Self-verification of the feature. - [_] Screenshot of local self-verification (please anonymize any sensitive information such as personal identifiers) - [_] Have new or modified unit test (UT) cases been added or adapted to cover the newly added or changed content? --> ![20260609142823_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/0850dd25-9b1d-47b5-88b5-5be858651845/20260609142823_rec_.gif '20260609142823_rec_.gif') ![20260609142903_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/2ad66e2a-4f44-4bb6-bec6-28f796bf86c8/20260609142903_rec_.gif '20260609142903_rec_.gif') See merge request: Ascend/msagent!84	16 天前
templates	[Feature]Agent Skill For Host CPU Bind Co-authored-by: lijing-git<1390732074@qq.com> # message auto-generated for no-merge-commit merge: !84 merge msagent_cpu_binding_skill into master [Feature]Agent Skill For Host CPU Bind Created-by: ljing-git Commit-by: lijing-git Merged-by: ascend-robot Description: # 感谢您贡献的Pull Request 在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) <!-- - 请明确说明您提交PR的变更内容。本部分旨在概述所做的变更，以及此PR是如何解决该问题的。请尽可能地提供有助于评审人员更高效、更快速完成检视审查的实用说明。 - 请说明为何需要这些更改，例如具体的使用场景或bug描述。 - 关联issue号(如果有) - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes #<issue-number> --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 面向用户的变更 (Does this PR introduce _any_ user-facing change)? <!-- 请注意，这里指的是任何面向用户的变更，包括但不限于API、用户界面或其他使用方式上的变更。 Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 功能验证 (How was this patch tested?) <!-- 请确认CI已通过增量及存量的单元测试用例。如果本次测试方式与常规单元测试不同，请详细说明您的测试步骤(最好提供完整的可复现的操作路径及关键截图)，以便Committer能够快速复现验证，也便于后续的维护。如果未添加测试，请说明未添加的原因，以及为何难添加测试。 - [_] 功能自验 - [_] 本地自验截图(涉及个人标识符等敏感信息请注意脱敏) - [_] 新增/变更内容是否已新增/适配UT测试用例看护 CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. - [_] Self-verification of the feature. - [_] Screenshot of local self-verification (please anonymize any sensitive information such as personal identifiers) - [_] Have new or modified unit test (UT) cases been added or adapted to cover the newly added or changed content? --> ![20260609142823_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/0850dd25-9b1d-47b5-88b5-5be858651845/20260609142823_rec_.gif '20260609142823_rec_.gif') ![20260609142903_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/2ad66e2a-4f44-4bb6-bec6-28f796bf86c8/20260609142903_rec_.gif '20260609142903_rec_.gif') See merge request: Ascend/msagent!84	16 天前
README.md	[Feature]Agent Skill For Host CPU Bind Co-authored-by: lijing-git<1390732074@qq.com> # message auto-generated for no-merge-commit merge: !84 merge msagent_cpu_binding_skill into master [Feature]Agent Skill For Host CPU Bind Created-by: ljing-git Commit-by: lijing-git Merged-by: ascend-robot Description: # 感谢您贡献的Pull Request 在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) <!-- - 请明确说明您提交PR的变更内容。本部分旨在概述所做的变更，以及此PR是如何解决该问题的。请尽可能地提供有助于评审人员更高效、更快速完成检视审查的实用说明。 - 请说明为何需要这些更改，例如具体的使用场景或bug描述。 - 关联issue号(如果有) - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes #<issue-number> --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 面向用户的变更 (Does this PR introduce _any_ user-facing change)? <!-- 请注意，这里指的是任何面向用户的变更，包括但不限于API、用户界面或其他使用方式上的变更。 Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 功能验证 (How was this patch tested?) <!-- 请确认CI已通过增量及存量的单元测试用例。如果本次测试方式与常规单元测试不同，请详细说明您的测试步骤(最好提供完整的可复现的操作路径及关键截图)，以便Committer能够快速复现验证，也便于后续的维护。如果未添加测试，请说明未添加的原因，以及为何难添加测试。 - [_] 功能自验 - [_] 本地自验截图(涉及个人标识符等敏感信息请注意脱敏) - [_] 新增/变更内容是否已新增/适配UT测试用例看护 CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. - [_] Self-verification of the feature. - [_] Screenshot of local self-verification (please anonymize any sensitive information such as personal identifiers) - [_] Have new or modified unit test (UT) cases been added or adapted to cover the newly added or changed content? --> ![20260609142823_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/0850dd25-9b1d-47b5-88b5-5be858651845/20260609142823_rec_.gif '20260609142823_rec_.gif') ![20260609142903_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/2ad66e2a-4f44-4bb6-bec6-28f796bf86c8/20260609142903_rec_.gif '20260609142903_rec_.gif') See merge request: Ascend/msagent!84	16 天前
SKILL.md	[Feature]Agent Skill For Host CPU Bind Co-authored-by: lijing-git<1390732074@qq.com> # message auto-generated for no-merge-commit merge: !84 merge msagent_cpu_binding_skill into master [Feature]Agent Skill For Host CPU Bind Created-by: ljing-git Commit-by: lijing-git Merged-by: ascend-robot Description: # 感谢您贡献的Pull Request 在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msinsight/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) <!-- - 请明确说明您提交PR的变更内容。本部分旨在概述所做的变更，以及此PR是如何解决该问题的。请尽可能地提供有助于评审人员更高效、更快速完成检视审查的实用说明。 - 请说明为何需要这些更改，例如具体的使用场景或bug描述。 - 关联issue号(如果有) - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes #<issue-number> --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 面向用户的变更 (Does this PR introduce _any_ user-facing change)? <!-- 请注意，这里指的是任何面向用户的变更，包括但不限于API、用户界面或其他使用方式上的变更。 Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. --> 增加Host CPU绑核分析的Agent Skill，建设一个面向 NPU + PyTorch / LLM Serving 场景的 Host CPU 绑核分析能力，帮助用户以证据化方式发现绑核问题、生成优化建议，并提供风险说明、回滚方式和优化前后验证方案 ## 功能验证 (How was this patch tested?) <!-- 请确认CI已通过增量及存量的单元测试用例。如果本次测试方式与常规单元测试不同，请详细说明您的测试步骤(最好提供完整的可复现的操作路径及关键截图)，以便Committer能够快速复现验证，也便于后续的维护。如果未添加测试，请说明未添加的原因，以及为何难添加测试。 - [_] 功能自验 - [_] 本地自验截图(涉及个人标识符等敏感信息请注意脱敏) - [_] 新增/变更内容是否已新增/适配UT测试用例看护 CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. - [_] Self-verification of the feature. - [_] Screenshot of local self-verification (please anonymize any sensitive information such as personal identifiers) - [_] Have new or modified unit test (UT) cases been added or adapted to cover the newly added or changed content? --> ![20260609142823_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/0850dd25-9b1d-47b5-88b5-5be858651845/20260609142823_rec_.gif '20260609142823_rec_.gif') ![20260609142903_rec_.gif](https://raw.gitcode.com/user-images/assets/8492996/2ad66e2a-4f44-4bb6-bec6-28f796bf86c8/20260609142903_rec_.gif '20260609142903_rec_.gif') See merge request: Ascend/msagent!84	16 天前

mindstudio-cpu-binding

mindstudio-cpu-binding 是一个面向 NPU + PyTorch / LLM Serving Host CPU 亲和性优化 的通用 Agent Skill。它关注的不是单一框架或单一运行时，而是帮助任何支持读取 Skill 目录的 Agent Runtime，围绕以下问题做证据化分析：

Host CPU affinity 是否合理
NUMA locality 是否匹配 NPU
cgroup / cpuset 是否限制了真实可用 CPU
PyTorch / serving runtime 线程配置是否与 CPU 资源冲突
多 rank / 多 worker / 多实例是否存在 CPU range 冲突

角色分工

SKILL.md：给 Agent 读取的指令入口，定义交互、边界和工作流。
README.md：给人看的总览文档，说明价值、安装方式、使用场景、CLI 流程和开发入口。

目录约定

建议保持如下可移植布局，便于不同 Agent Runtime 直接挂载和识别：

skills/
├── mindstudio-cpu-binding/
│   ├── SKILL.md
│   ├── README.md
│   ├── scripts/
│   ├── docs/
│   └── templates/
└── tests/

其中 skills/tests/ 是仓库内用于保障脚本正确性的测试目录，不属于用户安装 Skill 时必须携带的运行内容。skills/mindstudio-cpu-binding/samples/ 可作为本地开发验证数据目录存在，用于离线示例验证；它不是运行真实节点诊断的必需内容。

核心特色

聚焦 NPU 工作负载：覆盖 PyTorch 训练、离线/批量推理，以及 vLLM-Ascend、SGLang 等 LLM Serving 场景。
证据化诊断：先采集、再分析；结论必须能追溯到 Snapshot 和运行时证据。
只读优先：默认先做只读发现和采集，不做状态修改；PID / worker / NPU 映射优先由命令生成候选，用户确认即可。
自包含报告：在 Snapshot 信息足够时生成 report.html，信息不足时会明确标记缺口。
安全执行边界：任何会改变状态的动作都必须先展示风险、回滚和验证方式，并等待显式确认。
Agent 友好：交互、命令、产物、报告结构都按 Agent 工作流设计；提问一次一个、选项优先。

适用场景

当你遇到这些问题时，适合使用 mindstudio-cpu-binding：

PyTorch 训练 step time 抖动、samples/s 低、rank 间不均衡
PyTorch 离线/批量推理吞吐低、延迟高、CPU 争抢明显
vLLM-Ascend、SGLang，或能提供 PID / 进程 / runtime 证据的其他 LLM Serving 的 TTFT、TPOT、tokens/s、QPS、p99 异常
多 rank / 多 worker / 多实例之间 CPU range 重叠
Docker / K8s / Slurm / cgroup / cpuset 限制导致实际可用 CPU 与预期不一致
需要判断进程、worker、engine、scheduler、tokenizer 是否跨 NUMA 运行或绑核过宽/过窄

非目标与安全边界

默认不覆盖以下内容，除非用户明确要求扩展范围：

ftrace / eBPF / perf 深度诊断
IRQ affinity 调整
CPU governor 调整
kernel boot 参数修改
K8s 节点级配置修改
自动重启服务
未经明确确认就执行任何状态变更命令

换句话说：mindstudio-cpu-binding 可以给建议，但不能擅自改环境。

Generic 安装方式

任何可以读取一个 skill 目录的 Agent Runtime，都可以注册本目录作为 mindstudio-cpu-binding。

运行时要求

至少需要满足这些能力：

能读取 SKILL.md
能访问 scripts/
能运行 Python
能只读访问 /proc、/sys、cgroup、lscpu
如果节点上有 NPU，还能读取 npu-smi

注册方式

把整个 skills/mindstudio-cpu-binding 目录挂载进去最简单；最小运行内容是 SKILL.md、scripts/、templates/ 和所需 docs/。samples/ 仅用于离线示例验证，可选；skills/tests/ 是仓库测试目录，不随 Skill 安装。

Claude Code 示例

仅作为示例，mindstudio-cpu-binding 不是 Claude-only 方案。

项目级挂载（从仓库根目录执行，目标是 Claude Code 的 skills 发现目录 .claude/skills/mindstudio-cpu-binding，不是仓库内的包目录）：

mkdir -p .claude/skills
ln -s /absolute/path/to/mindstudio-cpu-binding .claude/skills/mindstudio-cpu-binding

用户级挂载：

ln -s /absolute/path/to/mindstudio-cpu-binding ~/.claude/skills/mindstudio-cpu-binding

通用触发示例

以下提示适用于任何支持 Skill 的 Agent Runtime：

“帮我分析这个 NPU 训练任务的 Host CPU 绑核问题。”
“检查这个 vLLM-Ascend 服务是否存在 NUMA locality 冲突。”
“分析多进程推理的 cgroup/cpuset 和 CPU range 是否互相冲突。”
“读取已有 snapshot.json，给出保守和进阶优化方案。”

Practical CLI 流程

mindstudio-cpu-binding 提供一组可直接跑的原型 CLI。以下命令默认从 Skill 目录执行：

cd skills/mindstudio-cpu-binding

推荐按下面顺序使用。Agent 应先说明每条命令的意义和安全边界，再让用户选择是否执行；用户通常不需要手工推导 PID / worker / NPU 的完整映射。

1. 分析已有 Snapshot

python scripts/cli.py analyze --snapshot out/snapshot.json --out out

输出会基于已有 Snapshot 生成诊断计划和 HTML 报告。

2. 发现进程与 NPU 候选映射

建议优先从只读进程发现开始，尤其是用户不确定目标 PID、TP worker、rank、instance 或 NPU 映射时。该命令只读取 ps 和 npu-smi，用于生成候选列表，不修改系统状态：

python scripts/process_discovery.py --out out/processes.json

Agent 应根据 processes.json 汇总候选，例如 API server、scheduler、engine/worker、TP rank、rank/main 进程和 NPU 占用关系，再让用户确认哪些候选进入下一步采集。

3. 候选确认后的只读采集

用户确认目标 PID 或候选映射后，再运行 Snapshot 采集。该命令只读访问 /proc、/sys、cgroup、NPU topology 和 runtime 信息，生成 snapshot.json，不修改 affinity、cgroup 或系统配置：

python scripts/cli.py collect --pid <pid> --scenario training --framework pytorch --device-type npu --optimization-goal stability --sample-seconds 10 --out out/snapshot.json

4. 多 rank 采集示例

python scripts/cli.py collect --pid 12345 --pid 12346 --scenario training --framework pytorch --device-type npu --optimization-goal stability --sample-seconds 10 --rank-map rank0=12345:npu0,rank1=12346:npu1 --out out/snapshot.json
python scripts/cli.py analyze --snapshot out/snapshot.json --out out

输出物说明

一次完整流程通常会产出以下文件：

snapshot.json：只读采集结果，记录 CPU / NUMA / NPU / 进程 / cgroup / runtime 信息
plan.json：诊断计划、建议、风险、回滚要点
report.html：自包含 HTML 报告，便于分享和查看
raw/：采集到的原始中间数据，便于排查解析和证据链

report.html 包含什么

report.html 重点展示的是“结论 + 证据 + 建议”，常见结构包括：

报告摘要
当前 CPU 绑定状态
CPU / NPU / NUMA 拓扑关系
CPU / NUMA 逻辑 CPU 网格
运行时 CPU 使用与竞争情况
问题发现
推荐绑核方案
推荐 PyTorch / Runtime / Serving 线程配置
验证计划
风险与回滚
信息缺口

其中拓扑关系部分会尽量把 Server -> NUMA -> NPU 的关系画清楚，避免只看一堆文本不容易定位。

开发者参考

数据流

collect -> diagnose -> planner -> report

关键文件地图

scripts/collect.py：采集入口，负责组织 Snapshot 数据
scripts/diagnose.py：根据规则做问题诊断
scripts/planner.py：把诊断结果整理成方案和建议
scripts/report.py：生成 HTML 报告
scripts/topology_view.py：渲染拓扑视图和关系展示

设计文档索引

建议按下面顺序阅读：

docs/agent-workflow.md
docs/question-flow.md
docs/snapshot-schema.md
docs/diagnosis-rules.md
docs/html-report-design.md
docs/binding-rollback-design.md

测试与检查

从仓库根目录执行：

pytest skills/tests/test_report.py skills/tests/test_topology_view.py -v
pytest skills/tests -v

从仓库根目录执行 README 检查：

pre-commit run --files skills/mindstudio-cpu-binding/README.md

说明与边界

本项目不保证性能一定提升；它提供的是诊断、建议和验证路径。
不会自动改写 Docker、K8s、Slurm 或节点级配置。
不以 Claude-only 为前提，可用于任何兼容的 Agent Runtime。
所有状态变更动作都必须先确认，再执行。