msmodeling/.agents/skills/microbench · Ascend/MindStudio-Modeling - AtomGit

ascend-robot[feat] 添加SKILL：通过自然语言配置 device

文件	最后提交记录	最后更新时间
SKILL.md	[feat] 添加SKILL：通过自然语言配置 device Co-authored-by: wendellX<wenxiaoping6@huawei.com> # message auto-generated for no-merge-commit merge: !243 merge skill_device_cfg into develop [feat] 添加SKILL：通过自然语言配置 device Created-by: wendellX Commit-by: wendellX Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [X] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 ------ ## 📝 Modification / 修改内容添加SKILL文件：通过自然语言配置 device ------ ## 📐 Associated Test Results / 关联测试结果 --- 用户：我要导入新的设备拓扑 Claude：我们先不用填复杂参数，我会一步步带你配。先告诉我下面 3 件事就够了，不知道的直接写“不知道”： 1. 这是什么硬件？例如厂商和型号。 2. 你想按什么粒度建这个拓扑：单卡、单 die / chiplet，还是别的调度单元？ 3. 你手头有没有这些原始信息里的任意一部分：显存容量、显存带宽、FP16/BF16/INT8/FP32 算力、以及“几卡/几 die、它们之间怎么连”的描述？有的话原样贴出来就行。你可以直接像这样回我： - 型号：XXX - 粒度：单卡 / 单 die / 不确定 - 已知规格：96GB 显存，3.2TB/s，BF16 800T，8 卡，卡间 400GB/s 收到后我会先帮你整理成“已确认 / 还不确定 / 可以先默认”的清单，再继续下一步。用户：华为的ATLAS_800_A3_560T_128G_DIE。单 die，显存容量 64GiB，显存带宽 1.6TiB/s Claude：我先记下目前已确认的事实。 - 已确认： - 厂商：HUAWEI - profile 候选名：`ATLAS_800_A3_560T_128G_DIE` - 粒度：单 die - 每个 profile 单元显存：64GiB - 每个 profile 单元显存带宽：1.6TiB/s - 说明： - 仓库里现有的 [tensor_cast/device.py:299-320] `ATLAS_800_A3_752T_128G_DIE` 也是“名字带 128G，但单 die 显存写 64GiB”的风格，所以你这个命名方式是对得上的。接下来我只追问 3 件最关键的事，不知道可以直接写“不知道”或“先默认”： 1. `560T` 指的是哪种算力？ - 例如：`FP16 560T`、`BF16 560T`、`INT8 560T` - 如果资料里还有 `FP32 / BF16 / FP16 / INT8` 的其他峰值，也请原样贴出来。 2. 资料里有没有单独的“通用 / vector 算力”？ - 这项主要影响 softmax、norm、激活函数这类非矩阵算子的估算。 - 你可以直接回： - `没有，先留空` - 或 `没有，先用临时估值` 3. 这个单 die profile 的互联拓扑你想怎么建？ - 你可以直接选一种： - `沿用现有 A3 die 拓扑` - `只建 8 卡节点内` - `我只知道 2 die/卡、8 卡/节点，其它先默认` - 如果你知道互联带宽，顺手告诉我是“单向”还是“双向”。你回这 3 项后，我就能整理出将写入的字段、默认值和待校准项，再给你最后确认。用户：FP32 75T，BF16 245.8T，FP16 280T，INT8 560T。通信 grid 形状是 [48, 8, 2]。 Claude：好的，目前我整理出的“已确认”是： - `name`: `ATLAS_800_A3_560T_128G_DIE` - `vendor`: `HUAWEI` - 粒度：单 die - 每 die 显存：64 GiB - 每 die 显存带宽：1.6 TiB/s - `mma_ops`： - FP32 = 75T - BF16 = 245.8T - FP16 = 280T - INT8 = 560T - `comm_grid` 形状：`[48, 8, 2]` 这基本就是把现有 A3 die 模板 [tensor_cast/device.py:169-182] 和 280T 档位算力风格结合起来了。还差 3 个会直接写进代码的点，你选一下即可： 1. `gp_ops` 怎么处理？ - `先留空` - `先用临时估值：FP32 8T，BF16/FP16 16T` - `我有官方值：...` 2. `[48, 8, 2]` 的每层互联参数，是否直接沿用现有 A3 die 拓扑 [tensor_cast/device.py:169-182]？ - 外层：196 GB/s，5.5 us - 中层：196 GB/s，0.5 us - 内层：224 GB/s，0.2 us - `comm_efficiency = 0.7` - 类型按当前代码默认 `CLOS` - 这些带宽会按“单向带宽”写入你可以直接回：`沿用现有 A3 die 拓扑` 3. 如果没有更精确数据，是否接受先沿用当前默认效率和静态开销，并标记 `needs calibration`？ - `compute_efficiency = 0.7` - `memory_efficiency = 0.6` - `StaticCost(5us, 2us, 10us)` 你回这 3 项后，我就直接改 [tensor_cast/device.py]，然后做导入注册验证，并给你可用的 `--device ATLAS_800_A3_560T_128G_DIE`。用户：1 使用临时值；2 沿用A3 ； 3 compute_efficiency=0.7，memory_efficiency=0.6。 Claude：还差一个会写进代码的小点：`StaticCost` 你要不要也先沿用现有 A3 默认值 `mma=5us, gp=2us, comm=10us`？如果可以，我收到你一句“可以”就直接改代码并验证。用户：mma_op_cost_s=5e-06、gp_op_cost_s=2e-06、comm_op_cost_s=1e-05。用这个吧 Claude：已加到 [tensor_cast/device.py:299-320]，复用了现有 A3 die 拓扑 [tensor_cast/device.py:169-182]。 - 已注册的 `DeviceProfile.name` - `ATLAS_800_A3_560T_128G_DIE` - 直接使用的你提供的事实 - `vendor="HUAWEI"` - 单 die 粒度 - `memory_size_bytes=64 * (1024*3)` - `memory_bandwidth_bytes_ps=1.6 (1024*4)` - `mma_ops` - FP32 = `75 1e12` - BF16 = `245.8 * 1e12` - FP16 = `280 * 1e12` - INT8 = `560 * 1e12` - `comm_grid` 形状 `[48, 8, 2]` - `compute_efficiency=0.7` - `memory_efficiency=0.6` - `StaticCost(mma=5e-06, gp=2e-06, comm=1e-05)` - 用户确认的估值 - `gp_ops` - FP32 = `8 * 1e12` - BF16 = `16 * 1e12` - FP16 = `16 * 1e12` - `needs calibration` - `gp_ops` 目前是临时估值，不是官方 vector/general compute 指标 - 互联参数目前沿用现有 A3 die 拓扑： - `start_dim=0`: 196 GB/s, 5.5 us, `comm_efficiency=0.7` - `start_dim=1`: 196 GB/s, 0.5 us, `comm_efficiency=0.7` - `start_dim=2`: 224 GB/s, 0.2 us, `comm_efficiency=0.7` - 验证命令 - `python - <<'PY' from tensor_cast.device import DeviceProfile print(DeviceProfile.all_device_profiles["ATLAS_800_A3_560T_128G_DIE"]) PY` - 结果：成功打印该 `DeviceProfile`，注册正常 - 可直接用的命令 - `python -m cli.inference.text_generate Qwen/Qwen3-32B --num-queries 2 --query-length 3500 --device ATLAS_800_A3_560T_128G_DIE` 已经可用了；如果你愿意，下一步最值得校准的是 `gp_ops` 是否有官方值。 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!243	1 个月前