| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
完善 GLM5 shape grid 生成与 microbench 回填支持 Co-authored-by: Secluded_Ocean<tangchuxiao0709@qq.com> # message auto-generated for no-merge-commit merge: !252 merge codex/debug-shape-grid-generation into develop 完善 GLM5 shape grid 生成与 microbench 回填支持 Created-by: Secluded_Ocean Commit-by: Secluded_Ocean Merged-by: ascend-robot Description: ## Summary - improve GLM5 shape grid generation and EP32 DFC replay coverage - normalize shape matching and dedupe behavior - extend microbench update/replay tooling and related tests ## Validation - Not run locally because pytest is not installed for the available Python interpreter. See merge request: Ascend/msmodeling!252 | 1 个月前 | |
feat: profiling data collection toolchain Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !124 merge pr/perf-db-b into develop feat: profiling data collection toolchain Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 实测算子性能估算系统(见 [PR-A: feat: profiling-based empirical performance model with CSV data source](https://gitcode.com/Ascend/msmodeling/pull/123) )依赖 NPU Profiling 数据(per-kernel CSV + HCCL 通信基准)。这些数据需要从 NPU 设备上采集、解析、验证后才能使用。 本 PR 提供完整的**离线数据采集工具链**,覆盖从 Profiling 原始数据解析到微基准测试、shape 变异、通信基准、M6 E2E 精度计算的全流程。 > 📌 本 PR 与 [PR-A](https://gitcode.com/Ascend/msmodeling/pull/123)(核心功能)**无代码依赖**——tools/ 不 import tensor_cast,可独立 review 和合入。 ------ ## 📝 Modification / 修改内容 所有新增文件位于 tools/perf_data_collection/ 和 tests/tools/。 ### 1. 数据解析与转换 | 工具 | 行数 | 说明 | |------|:----:|------| | parse_kernel_details.py | 698 | 解析 NPU kernel_details.csv → 按 kernel type 拆分为独立 CSV(MatMulV2.csv、SwiGlu.csv 等),支持 FRACTAL_NZ format 转换、shape 归一化 | | build_comm_csv.py | — | HCCL benchmark 结果 → 通信 CSV 构建 | | fia_common.py + fill_fia_runtime_metadata.py | 1,544 | FusedInferAttentionScore 运行时元数据推断与回填 | ### 2. 微基准测试 | 工具 | 行数 | 说明 | |------|:----:|------| | start_microbench.py | 771 | 自动化微基准运行入口:读取 op_mapping.yaml → 按 kernel type 选择 replay 脚本 → msprof 采集 → 解析结果回写 CSV | | op_replay/*.py (25+ scripts) | 5,258 | 每个 NPU kernel 一个回放脚本(MatMulV2、SwiGlu、FusedInferAttentionScore、QuantBatchMatmulV3、DispatchFFNCombine 等),使用 torch_npu API 构造输入并调用 | | generate_shape_grid.py | 2,075 | 从现有 CSV 数据出发,通过 shape mutation(维度缩放、block padding 变体、量化变体等)生成更多 shape 组合,扩大覆盖面 | ### 3. 通信基准 | 工具 | 行数 | 说明 | |------|:----:|------| | generate_comm_microbench.py | — | 生成 HCCL 通信基准测试脚本(allReduce、allGather、allToAll、reduceScatter) | | validate_comm_alignment.py | 345 | HCCL 微基准 CSV vs CommAnalyticModel 对齐验证 | | run_comm_bench.sh | 204 | HCCL benchmark 运行脚本 | ### 4. 精度评估 | 工具 | 行数 | 说明 | |------|:----:|------| | compute_m6.py | — | M6(E2E Ratio)离线计算:TC 预测总时间 / 真实 per-forward 时间,使用 ArgMaxV2 作为 anchor kernel 切分 forward passes | ------ ## 📐 Associated Test Results / 关联测试结果 $ pytest tests/tools/ -q --ignore=tests/tools/test_fia_parser_backfill.py 63 passed, 2 failed in 0.11s 2 个失败为预期行为(profile_and_update_db.py 尚未实现,测试先于代码)。 ### Import 隔离验证 python # 确认 tools/ 不依赖 tensor_cast import ast, sys, pathlib for f in pathlib.Path('tools/perf_data_collection').rglob('*.py'): tree = ast.parse(f.read_text()) for node in ast.walk(tree): if isinstance(node, (ast.Import, ast.ImportFrom)): mod = getattr(node, 'module', '') or '' if 'tensor_cast' in mod: print(f'ERROR: {f}:{node.lineno}') sys.exit(1) # → OK: no tensor_cast imports in tools/ ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 从 NPU profiling 原始数据生成 per-kernel CSV python tools/perf_data_collection/parse_kernel_details.py \ --input <kernel_details.csv> --output-dir <data_dir> # 2. 运行微基准测试(需要 NPU 设备) python tools/perf_data_collection/start_microbench.py \ --data-dir <data_dir> --device ATLAS_800_A3_752T_128G_DIE # 3. 生成 shape 变异矩阵扩大覆盖 python tools/perf_data_collection/generate_shape_grid.py \ --csv-dir <data_dir> --output-dir <output_dir> # 4. 计算 M6 E2E 精度 python tools/perf_data_collection/compute_m6.py \ --tc-report results/metrics.json \ --profiler-output <profiling_trace_dir> # 5. HCCL 通信基准 bash tools/perf_data_collection/run_comm_bench.sh ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. See merge request: Ascend/msmodeling!124 | 1 个月前 | |
feat: profiling data collection toolchain Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !124 merge pr/perf-db-b into develop feat: profiling data collection toolchain Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 实测算子性能估算系统(见 [PR-A: feat: profiling-based empirical performance model with CSV data source](https://gitcode.com/Ascend/msmodeling/pull/123) )依赖 NPU Profiling 数据(per-kernel CSV + HCCL 通信基准)。这些数据需要从 NPU 设备上采集、解析、验证后才能使用。 本 PR 提供完整的**离线数据采集工具链**,覆盖从 Profiling 原始数据解析到微基准测试、shape 变异、通信基准、M6 E2E 精度计算的全流程。 > 📌 本 PR 与 [PR-A](https://gitcode.com/Ascend/msmodeling/pull/123)(核心功能)**无代码依赖**——tools/ 不 import tensor_cast,可独立 review 和合入。 ------ ## 📝 Modification / 修改内容 所有新增文件位于 tools/perf_data_collection/ 和 tests/tools/。 ### 1. 数据解析与转换 | 工具 | 行数 | 说明 | |------|:----:|------| | parse_kernel_details.py | 698 | 解析 NPU kernel_details.csv → 按 kernel type 拆分为独立 CSV(MatMulV2.csv、SwiGlu.csv 等),支持 FRACTAL_NZ format 转换、shape 归一化 | | build_comm_csv.py | — | HCCL benchmark 结果 → 通信 CSV 构建 | | fia_common.py + fill_fia_runtime_metadata.py | 1,544 | FusedInferAttentionScore 运行时元数据推断与回填 | ### 2. 微基准测试 | 工具 | 行数 | 说明 | |------|:----:|------| | start_microbench.py | 771 | 自动化微基准运行入口:读取 op_mapping.yaml → 按 kernel type 选择 replay 脚本 → msprof 采集 → 解析结果回写 CSV | | op_replay/*.py (25+ scripts) | 5,258 | 每个 NPU kernel 一个回放脚本(MatMulV2、SwiGlu、FusedInferAttentionScore、QuantBatchMatmulV3、DispatchFFNCombine 等),使用 torch_npu API 构造输入并调用 | | generate_shape_grid.py | 2,075 | 从现有 CSV 数据出发,通过 shape mutation(维度缩放、block padding 变体、量化变体等)生成更多 shape 组合,扩大覆盖面 | ### 3. 通信基准 | 工具 | 行数 | 说明 | |------|:----:|------| | generate_comm_microbench.py | — | 生成 HCCL 通信基准测试脚本(allReduce、allGather、allToAll、reduceScatter) | | validate_comm_alignment.py | 345 | HCCL 微基准 CSV vs CommAnalyticModel 对齐验证 | | run_comm_bench.sh | 204 | HCCL benchmark 运行脚本 | ### 4. 精度评估 | 工具 | 行数 | 说明 | |------|:----:|------| | compute_m6.py | — | M6(E2E Ratio)离线计算:TC 预测总时间 / 真实 per-forward 时间,使用 ArgMaxV2 作为 anchor kernel 切分 forward passes | ------ ## 📐 Associated Test Results / 关联测试结果 $ pytest tests/tools/ -q --ignore=tests/tools/test_fia_parser_backfill.py 63 passed, 2 failed in 0.11s 2 个失败为预期行为(profile_and_update_db.py 尚未实现,测试先于代码)。 ### Import 隔离验证 python # 确认 tools/ 不依赖 tensor_cast import ast, sys, pathlib for f in pathlib.Path('tools/perf_data_collection').rglob('*.py'): tree = ast.parse(f.read_text()) for node in ast.walk(tree): if isinstance(node, (ast.Import, ast.ImportFrom)): mod = getattr(node, 'module', '') or '' if 'tensor_cast' in mod: print(f'ERROR: {f}:{node.lineno}') sys.exit(1) # → OK: no tensor_cast imports in tools/ ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 从 NPU profiling 原始数据生成 per-kernel CSV python tools/perf_data_collection/parse_kernel_details.py \ --input <kernel_details.csv> --output-dir <data_dir> # 2. 运行微基准测试(需要 NPU 设备) python tools/perf_data_collection/start_microbench.py \ --data-dir <data_dir> --device ATLAS_800_A3_752T_128G_DIE # 3. 生成 shape 变异矩阵扩大覆盖 python tools/perf_data_collection/generate_shape_grid.py \ --csv-dir <data_dir> --output-dir <output_dir> # 4. 计算 M6 E2E 精度 python tools/perf_data_collection/compute_m6.py \ --tc-report results/metrics.json \ --profiler-output <profiling_trace_dir> # 5. HCCL 通信基准 bash tools/perf_data_collection/run_comm_bench.sh ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. See merge request: Ascend/msmodeling!124 | 1 个月前 | |
完善 GLM5 shape grid 生成与 microbench 回填支持 Co-authored-by: Secluded_Ocean<tangchuxiao0709@qq.com> # message auto-generated for no-merge-commit merge: !252 merge codex/debug-shape-grid-generation into develop 完善 GLM5 shape grid 生成与 microbench 回填支持 Created-by: Secluded_Ocean Commit-by: Secluded_Ocean Merged-by: ascend-robot Description: ## Summary - improve GLM5 shape grid generation and EP32 DFC replay coverage - normalize shape matching and dedupe behavior - extend microbench update/replay tooling and related tests ## Validation - Not run locally because pytest is not installed for the available Python interpreter. See merge request: Ascend/msmodeling!252 | 1 个月前 | |
feat: profiling data collection toolchain Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !124 merge pr/perf-db-b into develop feat: profiling data collection toolchain Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 实测算子性能估算系统(见 [PR-A: feat: profiling-based empirical performance model with CSV data source](https://gitcode.com/Ascend/msmodeling/pull/123) )依赖 NPU Profiling 数据(per-kernel CSV + HCCL 通信基准)。这些数据需要从 NPU 设备上采集、解析、验证后才能使用。 本 PR 提供完整的**离线数据采集工具链**,覆盖从 Profiling 原始数据解析到微基准测试、shape 变异、通信基准、M6 E2E 精度计算的全流程。 > 📌 本 PR 与 [PR-A](https://gitcode.com/Ascend/msmodeling/pull/123)(核心功能)**无代码依赖**——tools/ 不 import tensor_cast,可独立 review 和合入。 ------ ## 📝 Modification / 修改内容 所有新增文件位于 tools/perf_data_collection/ 和 tests/tools/。 ### 1. 数据解析与转换 | 工具 | 行数 | 说明 | |------|:----:|------| | parse_kernel_details.py | 698 | 解析 NPU kernel_details.csv → 按 kernel type 拆分为独立 CSV(MatMulV2.csv、SwiGlu.csv 等),支持 FRACTAL_NZ format 转换、shape 归一化 | | build_comm_csv.py | — | HCCL benchmark 结果 → 通信 CSV 构建 | | fia_common.py + fill_fia_runtime_metadata.py | 1,544 | FusedInferAttentionScore 运行时元数据推断与回填 | ### 2. 微基准测试 | 工具 | 行数 | 说明 | |------|:----:|------| | start_microbench.py | 771 | 自动化微基准运行入口:读取 op_mapping.yaml → 按 kernel type 选择 replay 脚本 → msprof 采集 → 解析结果回写 CSV | | op_replay/*.py (25+ scripts) | 5,258 | 每个 NPU kernel 一个回放脚本(MatMulV2、SwiGlu、FusedInferAttentionScore、QuantBatchMatmulV3、DispatchFFNCombine 等),使用 torch_npu API 构造输入并调用 | | generate_shape_grid.py | 2,075 | 从现有 CSV 数据出发,通过 shape mutation(维度缩放、block padding 变体、量化变体等)生成更多 shape 组合,扩大覆盖面 | ### 3. 通信基准 | 工具 | 行数 | 说明 | |------|:----:|------| | generate_comm_microbench.py | — | 生成 HCCL 通信基准测试脚本(allReduce、allGather、allToAll、reduceScatter) | | validate_comm_alignment.py | 345 | HCCL 微基准 CSV vs CommAnalyticModel 对齐验证 | | run_comm_bench.sh | 204 | HCCL benchmark 运行脚本 | ### 4. 精度评估 | 工具 | 行数 | 说明 | |------|:----:|------| | compute_m6.py | — | M6(E2E Ratio)离线计算:TC 预测总时间 / 真实 per-forward 时间,使用 ArgMaxV2 作为 anchor kernel 切分 forward passes | ------ ## 📐 Associated Test Results / 关联测试结果 $ pytest tests/tools/ -q --ignore=tests/tools/test_fia_parser_backfill.py 63 passed, 2 failed in 0.11s 2 个失败为预期行为(profile_and_update_db.py 尚未实现,测试先于代码)。 ### Import 隔离验证 python # 确认 tools/ 不依赖 tensor_cast import ast, sys, pathlib for f in pathlib.Path('tools/perf_data_collection').rglob('*.py'): tree = ast.parse(f.read_text()) for node in ast.walk(tree): if isinstance(node, (ast.Import, ast.ImportFrom)): mod = getattr(node, 'module', '') or '' if 'tensor_cast' in mod: print(f'ERROR: {f}:{node.lineno}') sys.exit(1) # → OK: no tensor_cast imports in tools/ ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 从 NPU profiling 原始数据生成 per-kernel CSV python tools/perf_data_collection/parse_kernel_details.py \ --input <kernel_details.csv> --output-dir <data_dir> # 2. 运行微基准测试(需要 NPU 设备) python tools/perf_data_collection/start_microbench.py \ --data-dir <data_dir> --device ATLAS_800_A3_752T_128G_DIE # 3. 生成 shape 变异矩阵扩大覆盖 python tools/perf_data_collection/generate_shape_grid.py \ --csv-dir <data_dir> --output-dir <output_dir> # 4. 计算 M6 E2E 精度 python tools/perf_data_collection/compute_m6.py \ --tc-report results/metrics.json \ --profiler-output <profiling_trace_dir> # 5. HCCL 通信基准 bash tools/perf_data_collection/run_comm_bench.sh ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. See merge request: Ascend/msmodeling!124 | 1 个月前 | |
完善 GLM5 shape grid 生成与 microbench 回填支持 Co-authored-by: Secluded_Ocean<tangchuxiao0709@qq.com> # message auto-generated for no-merge-commit merge: !252 merge codex/debug-shape-grid-generation into develop 完善 GLM5 shape grid 生成与 microbench 回填支持 Created-by: Secluded_Ocean Commit-by: Secluded_Ocean Merged-by: ascend-robot Description: ## Summary - improve GLM5 shape grid generation and EP32 DFC replay coverage - normalize shape matching and dedupe behavior - extend microbench update/replay tooling and related tests ## Validation - Not run locally because pytest is not installed for the available Python interpreter. See merge request: Ascend/msmodeling!252 | 1 个月前 | |
feat: profiling data collection toolchain Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !124 merge pr/perf-db-b into develop feat: profiling data collection toolchain Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 实测算子性能估算系统(见 [PR-A: feat: profiling-based empirical performance model with CSV data source](https://gitcode.com/Ascend/msmodeling/pull/123) )依赖 NPU Profiling 数据(per-kernel CSV + HCCL 通信基准)。这些数据需要从 NPU 设备上采集、解析、验证后才能使用。 本 PR 提供完整的**离线数据采集工具链**,覆盖从 Profiling 原始数据解析到微基准测试、shape 变异、通信基准、M6 E2E 精度计算的全流程。 > 📌 本 PR 与 [PR-A](https://gitcode.com/Ascend/msmodeling/pull/123)(核心功能)**无代码依赖**——tools/ 不 import tensor_cast,可独立 review 和合入。 ------ ## 📝 Modification / 修改内容 所有新增文件位于 tools/perf_data_collection/ 和 tests/tools/。 ### 1. 数据解析与转换 | 工具 | 行数 | 说明 | |------|:----:|------| | parse_kernel_details.py | 698 | 解析 NPU kernel_details.csv → 按 kernel type 拆分为独立 CSV(MatMulV2.csv、SwiGlu.csv 等),支持 FRACTAL_NZ format 转换、shape 归一化 | | build_comm_csv.py | — | HCCL benchmark 结果 → 通信 CSV 构建 | | fia_common.py + fill_fia_runtime_metadata.py | 1,544 | FusedInferAttentionScore 运行时元数据推断与回填 | ### 2. 微基准测试 | 工具 | 行数 | 说明 | |------|:----:|------| | start_microbench.py | 771 | 自动化微基准运行入口:读取 op_mapping.yaml → 按 kernel type 选择 replay 脚本 → msprof 采集 → 解析结果回写 CSV | | op_replay/*.py (25+ scripts) | 5,258 | 每个 NPU kernel 一个回放脚本(MatMulV2、SwiGlu、FusedInferAttentionScore、QuantBatchMatmulV3、DispatchFFNCombine 等),使用 torch_npu API 构造输入并调用 | | generate_shape_grid.py | 2,075 | 从现有 CSV 数据出发,通过 shape mutation(维度缩放、block padding 变体、量化变体等)生成更多 shape 组合,扩大覆盖面 | ### 3. 通信基准 | 工具 | 行数 | 说明 | |------|:----:|------| | generate_comm_microbench.py | — | 生成 HCCL 通信基准测试脚本(allReduce、allGather、allToAll、reduceScatter) | | validate_comm_alignment.py | 345 | HCCL 微基准 CSV vs CommAnalyticModel 对齐验证 | | run_comm_bench.sh | 204 | HCCL benchmark 运行脚本 | ### 4. 精度评估 | 工具 | 行数 | 说明 | |------|:----:|------| | compute_m6.py | — | M6(E2E Ratio)离线计算:TC 预测总时间 / 真实 per-forward 时间,使用 ArgMaxV2 作为 anchor kernel 切分 forward passes | ------ ## 📐 Associated Test Results / 关联测试结果 $ pytest tests/tools/ -q --ignore=tests/tools/test_fia_parser_backfill.py 63 passed, 2 failed in 0.11s 2 个失败为预期行为(profile_and_update_db.py 尚未实现,测试先于代码)。 ### Import 隔离验证 python # 确认 tools/ 不依赖 tensor_cast import ast, sys, pathlib for f in pathlib.Path('tools/perf_data_collection').rglob('*.py'): tree = ast.parse(f.read_text()) for node in ast.walk(tree): if isinstance(node, (ast.Import, ast.ImportFrom)): mod = getattr(node, 'module', '') or '' if 'tensor_cast' in mod: print(f'ERROR: {f}:{node.lineno}') sys.exit(1) # → OK: no tensor_cast imports in tools/ ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 从 NPU profiling 原始数据生成 per-kernel CSV python tools/perf_data_collection/parse_kernel_details.py \ --input <kernel_details.csv> --output-dir <data_dir> # 2. 运行微基准测试(需要 NPU 设备) python tools/perf_data_collection/start_microbench.py \ --data-dir <data_dir> --device ATLAS_800_A3_752T_128G_DIE # 3. 生成 shape 变异矩阵扩大覆盖 python tools/perf_data_collection/generate_shape_grid.py \ --csv-dir <data_dir> --output-dir <output_dir> # 4. 计算 M6 E2E 精度 python tools/perf_data_collection/compute_m6.py \ --tc-report results/metrics.json \ --profiler-output <profiling_trace_dir> # 5. HCCL 通信基准 bash tools/perf_data_collection/run_comm_bench.sh ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. See merge request: Ascend/msmodeling!124 | 1 个月前 | |
feat: profiling data collection toolchain Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !124 merge pr/perf-db-b into develop feat: profiling data collection toolchain Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 实测算子性能估算系统(见 [PR-A: feat: profiling-based empirical performance model with CSV data source](https://gitcode.com/Ascend/msmodeling/pull/123) )依赖 NPU Profiling 数据(per-kernel CSV + HCCL 通信基准)。这些数据需要从 NPU 设备上采集、解析、验证后才能使用。 本 PR 提供完整的**离线数据采集工具链**,覆盖从 Profiling 原始数据解析到微基准测试、shape 变异、通信基准、M6 E2E 精度计算的全流程。 > 📌 本 PR 与 [PR-A](https://gitcode.com/Ascend/msmodeling/pull/123)(核心功能)**无代码依赖**——tools/ 不 import tensor_cast,可独立 review 和合入。 ------ ## 📝 Modification / 修改内容 所有新增文件位于 tools/perf_data_collection/ 和 tests/tools/。 ### 1. 数据解析与转换 | 工具 | 行数 | 说明 | |------|:----:|------| | parse_kernel_details.py | 698 | 解析 NPU kernel_details.csv → 按 kernel type 拆分为独立 CSV(MatMulV2.csv、SwiGlu.csv 等),支持 FRACTAL_NZ format 转换、shape 归一化 | | build_comm_csv.py | — | HCCL benchmark 结果 → 通信 CSV 构建 | | fia_common.py + fill_fia_runtime_metadata.py | 1,544 | FusedInferAttentionScore 运行时元数据推断与回填 | ### 2. 微基准测试 | 工具 | 行数 | 说明 | |------|:----:|------| | start_microbench.py | 771 | 自动化微基准运行入口:读取 op_mapping.yaml → 按 kernel type 选择 replay 脚本 → msprof 采集 → 解析结果回写 CSV | | op_replay/*.py (25+ scripts) | 5,258 | 每个 NPU kernel 一个回放脚本(MatMulV2、SwiGlu、FusedInferAttentionScore、QuantBatchMatmulV3、DispatchFFNCombine 等),使用 torch_npu API 构造输入并调用 | | generate_shape_grid.py | 2,075 | 从现有 CSV 数据出发,通过 shape mutation(维度缩放、block padding 变体、量化变体等)生成更多 shape 组合,扩大覆盖面 | ### 3. 通信基准 | 工具 | 行数 | 说明 | |------|:----:|------| | generate_comm_microbench.py | — | 生成 HCCL 通信基准测试脚本(allReduce、allGather、allToAll、reduceScatter) | | validate_comm_alignment.py | 345 | HCCL 微基准 CSV vs CommAnalyticModel 对齐验证 | | run_comm_bench.sh | 204 | HCCL benchmark 运行脚本 | ### 4. 精度评估 | 工具 | 行数 | 说明 | |------|:----:|------| | compute_m6.py | — | M6(E2E Ratio)离线计算:TC 预测总时间 / 真实 per-forward 时间,使用 ArgMaxV2 作为 anchor kernel 切分 forward passes | ------ ## 📐 Associated Test Results / 关联测试结果 $ pytest tests/tools/ -q --ignore=tests/tools/test_fia_parser_backfill.py 63 passed, 2 failed in 0.11s 2 个失败为预期行为(profile_and_update_db.py 尚未实现,测试先于代码)。 ### Import 隔离验证 python # 确认 tools/ 不依赖 tensor_cast import ast, sys, pathlib for f in pathlib.Path('tools/perf_data_collection').rglob('*.py'): tree = ast.parse(f.read_text()) for node in ast.walk(tree): if isinstance(node, (ast.Import, ast.ImportFrom)): mod = getattr(node, 'module', '') or '' if 'tensor_cast' in mod: print(f'ERROR: {f}:{node.lineno}') sys.exit(1) # → OK: no tensor_cast imports in tools/ ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 从 NPU profiling 原始数据生成 per-kernel CSV python tools/perf_data_collection/parse_kernel_details.py \ --input <kernel_details.csv> --output-dir <data_dir> # 2. 运行微基准测试(需要 NPU 设备) python tools/perf_data_collection/start_microbench.py \ --data-dir <data_dir> --device ATLAS_800_A3_752T_128G_DIE # 3. 生成 shape 变异矩阵扩大覆盖 python tools/perf_data_collection/generate_shape_grid.py \ --csv-dir <data_dir> --output-dir <output_dir> # 4. 计算 M6 E2E 精度 python tools/perf_data_collection/compute_m6.py \ --tc-report results/metrics.json \ --profiler-output <profiling_trace_dir> # 5. HCCL 通信基准 bash tools/perf_data_collection/run_comm_bench.sh ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. See merge request: Ascend/msmodeling!124 | 1 个月前 | |
完善 GLM5 shape grid 生成与 microbench 回填支持 Co-authored-by: Secluded_Ocean<tangchuxiao0709@qq.com> # message auto-generated for no-merge-commit merge: !252 merge codex/debug-shape-grid-generation into develop 完善 GLM5 shape grid 生成与 microbench 回填支持 Created-by: Secluded_Ocean Commit-by: Secluded_Ocean Merged-by: ascend-robot Description: ## Summary - improve GLM5 shape grid generation and EP32 DFC replay coverage - normalize shape matching and dedupe behavior - extend microbench update/replay tooling and related tests ## Validation - Not run locally because pytest is not installed for the available Python interpreter. See merge request: Ascend/msmodeling!252 | 1 个月前 | |
完善 GLM5 shape grid 生成与 microbench 回填支持 Co-authored-by: Secluded_Ocean<tangchuxiao0709@qq.com> # message auto-generated for no-merge-commit merge: !252 merge codex/debug-shape-grid-generation into develop 完善 GLM5 shape grid 生成与 microbench 回填支持 Created-by: Secluded_Ocean Commit-by: Secluded_Ocean Merged-by: ascend-robot Description: ## Summary - improve GLM5 shape grid generation and EP32 DFC replay coverage - normalize shape matching and dedupe behavior - extend microbench update/replay tooling and related tests ## Validation - Not run locally because pytest is not installed for the available Python interpreter. See merge request: Ascend/msmodeling!252 | 1 个月前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 |