msmodeling/docs/perf_database/examples · Ascend/MindStudio-Modeling - AtomGit

ascend-robotfeat: profiling-based empirical performance model with CSV data source

文件	最后提交记录	最后更新时间
comm_config_example.yaml	feat: profiling-based empirical performance model with CSV data source Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !123 merge pr/perf-db-a into develop feat: profiling-based empirical performance model with CSV data source Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [x] Refactor（代码重构） - [ ] Perf（性能优化） - [x] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 TensorCast 现有的 Roofline 解析模型（`AnalyticPerformanceModel`）对昇腾 NPU 的性能预测精度有限：融合算子（SwiGlu、AddRmsNorm、DispatchFFNCombine）无法建模，HCCL 集合通信与理论带宽差距显著，FRACTAL_NZ 格式等硬件特性无法通过 Roofline 捕获。本 PR 实现了基于真实 NPU Profiling 数据的实测算子性能估算系统，将 kernel 实测耗时接入 TensorCast 仿真框架。与 PR#96 的关系：PR#96 已合入 develop，定义了 `DataSourcePerformanceModel` 接口骨架（stub）和 CLI 集成。本 PR 提供完整的功能实现：CSV 查询引擎（9 种 TC-vs-NPU shape matching 规则）、op_mapping 映射（60+ 算子）、插值、M1-M6 指标体系、以及 DFC/FlashComm 编译 Pass。接口完全兼容。 > 📌 配套的离线数据采集工具链将在后续 PR 中提交（tools/perf_data_collection/，与本 PR 无代码依赖）。 ------ ## 📝 Modification / 修改内容 ### 1. Profiling Data Source 核心实现（替换 PR#96 stub） \| 文件 \| 说明 \| \|------\|------\| \| `profiling_database/profiling_data_source.py` (+1,885) \| `ProfilingDataSource`：op_mapping.yaml 驱动的 CSV 查询引擎，支持 9 种 TC-vs-NPU shape 差异处理（batch dim stripping、seq padding、FRACTAL_NZ、ND transpose、SwiGlu concat、RoPE layout/kernel、composite 分解、flatten batch） \| \| `profiling_database/interpolating_data_source.py` (+702) \| `InterpolatingDataSource`：nearest-neighbor + 线性插值包装器 \| \| `profiling_database/data_source.py` (修改) \| `DataSourcePerformanceModel` ABC 扩展（新增 `EXTRAPOLATED` enum、`details` 字段） \| ### 2. EmpiricalPerformanceModel 增强 (+436) 在 PR#96 基础上增加 M1-M6 指标追踪： - M1-M4：覆盖率指标（raw count → fused → compute-only → per-shape） - M5：延迟加权覆盖率 - M6 input：empirical hit total（用于离线 E2E ratio 计算） - `log_stats()`：结构化 HIT/MISS 日志 - `export_hit_miss_report()`：JSON 格式指标导出 ### 3. 编译 Passes (+875) \| Pass \| 说明 \| \|------\|------\| \| `dispatch_ffn_combine_pass.py` \| DispatchFFNCombine 超级融合（init_routing_v2 + GroupedMatmul + unpermute_tokens → 单 op），支持 5 种量化变体 \| \| `flashcomm_v1_pass.py` \| FlashComm V1 图重写（matmul_all_reduce → 通信隐藏），对标 vLLM-ascend `ENABLE_FLASHCOMM1=1` \| ### 4. op_mapping.yaml（3 个版本，共 ~3,600 行） \| 版本 \| 算子数 \| \|------\|:------:\| \| `vllm0.13.0_torch2.8.0_cann8.3` \| ~45 \| \| `vllm0.15.0_torch2.9.0_cann8.5` \| ~55 \| \| `vllm0.18.0_torch2.9.0_cann8.5` \| ~60 \| ### 5. CSV Profiling Data（~250 files，Git LFS） ATLAS 800 A3 752T 128G 设备数据：HCCL 通信基准 + 3 个 vLLM 版本的 kernel 数据 + 微基准补充数据。 ### 6. 集成改动 \| 文件 \| 改动 \| \|------\|------\| \| `model_runner.py` \| profiling 模式集成（`perf_models[]` + `log_stats` + `ProfilingDataSource` 创建） \| \| `user_config.py` \| `--profiling-database` 参数 \| \| `scripts/text_generate.py` \| `--export-metrics` CLI + FlashComm 配置 \| \| `ops/fused_moe.py` \| 新增 `dispatch_ffn_combine` op \| \| `compile_backend.py` \| 注册 DFC + FlashComm passes \| ------ ## 📐 Associated Test Results / 关联测试结果 ### 单元测试 `$ pytest tests/perf_database/ -q 266 passed, 3 warnings in 1.94s $ pytest tests/test_tensor_cast/test_empirical.py tests/test_tensor_cast/test_dfc_pass.py -q 8 passed, 1 skipped in 120.75s $ lintrunner -a ok No lint issues.` ### 功能验证 bash # Analytic 模式（行为不变） $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 2 --query-length 3500 --device TEST_DEVICE → [analytic] Execution time: 1.744s, TPS/Device: 4013 token/s ✅ # Profiling 模式（新功能） $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 1 --query-length 4112 --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 --tp-size 16 \ --quantize-linear-action DISABLED \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 → [empirical] Execution time: 0.156s, TPS/Device: 1651 token/s ✅ ### M1-M5 指标 \| 场景 \| M3 (计算算子 HR) \| M5 (延迟覆盖) \| \|------\|:---------------:\|:------------:\| \| Qwen3-32B Prefill (BF16) \| 61.5% ✅ (>50%) \| 89.0% ✅ (>80%) \| \| Qwen3-32B Decode (BF16) \| 38.5% \| 80.1% ✅ (>80%) \| \| DeepSeek-V3 Prefill (W8A8) \| 52.6% ✅ (>50%) \| 68.9% \| \| DeepSeek-V3 Decode (W8A8) \| 15.8% \| 54.3% \| ------ ## 🌟 Use cases (Optional) / 使用案例（可选） bash # 1. 使用实测数据替代 Roofline 估算 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> # 2. 导出 M1-M5 指标 JSON（用于离线 M6 计算） python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> \ --export-metrics results/metrics.json # 3. 同时运行 analytic + profiling 对比 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model analytic --performance-model profiling --compile \ --profiling-database <path_to_data_dir> ------ ## ✅ Checklist / 检查列表 Before PR: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. ``` See merge request: Ascend/msmodeling!123	1 个月前
op_mapping_example.yaml	feat: profiling-based empirical performance model with CSV data source Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !123 merge pr/perf-db-a into develop feat: profiling-based empirical performance model with CSV data source Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [x] Refactor（代码重构） - [ ] Perf（性能优化） - [x] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 TensorCast 现有的 Roofline 解析模型（`AnalyticPerformanceModel`）对昇腾 NPU 的性能预测精度有限：融合算子（SwiGlu、AddRmsNorm、DispatchFFNCombine）无法建模，HCCL 集合通信与理论带宽差距显著，FRACTAL_NZ 格式等硬件特性无法通过 Roofline 捕获。本 PR 实现了基于真实 NPU Profiling 数据的实测算子性能估算系统，将 kernel 实测耗时接入 TensorCast 仿真框架。与 PR#96 的关系：PR#96 已合入 develop，定义了 `DataSourcePerformanceModel` 接口骨架（stub）和 CLI 集成。本 PR 提供完整的功能实现：CSV 查询引擎（9 种 TC-vs-NPU shape matching 规则）、op_mapping 映射（60+ 算子）、插值、M1-M6 指标体系、以及 DFC/FlashComm 编译 Pass。接口完全兼容。 > 📌 配套的离线数据采集工具链将在后续 PR 中提交（tools/perf_data_collection/，与本 PR 无代码依赖）。 ------ ## 📝 Modification / 修改内容 ### 1. Profiling Data Source 核心实现（替换 PR#96 stub） \| 文件 \| 说明 \| \|------\|------\| \| `profiling_database/profiling_data_source.py` (+1,885) \| `ProfilingDataSource`：op_mapping.yaml 驱动的 CSV 查询引擎，支持 9 种 TC-vs-NPU shape 差异处理（batch dim stripping、seq padding、FRACTAL_NZ、ND transpose、SwiGlu concat、RoPE layout/kernel、composite 分解、flatten batch） \| \| `profiling_database/interpolating_data_source.py` (+702) \| `InterpolatingDataSource`：nearest-neighbor + 线性插值包装器 \| \| `profiling_database/data_source.py` (修改) \| `DataSourcePerformanceModel` ABC 扩展（新增 `EXTRAPOLATED` enum、`details` 字段） \| ### 2. EmpiricalPerformanceModel 增强 (+436) 在 PR#96 基础上增加 M1-M6 指标追踪： - M1-M4：覆盖率指标（raw count → fused → compute-only → per-shape） - M5：延迟加权覆盖率 - M6 input：empirical hit total（用于离线 E2E ratio 计算） - `log_stats()`：结构化 HIT/MISS 日志 - `export_hit_miss_report()`：JSON 格式指标导出 ### 3. 编译 Passes (+875) \| Pass \| 说明 \| \|------\|------\| \| `dispatch_ffn_combine_pass.py` \| DispatchFFNCombine 超级融合（init_routing_v2 + GroupedMatmul + unpermute_tokens → 单 op），支持 5 种量化变体 \| \| `flashcomm_v1_pass.py` \| FlashComm V1 图重写（matmul_all_reduce → 通信隐藏），对标 vLLM-ascend `ENABLE_FLASHCOMM1=1` \| ### 4. op_mapping.yaml（3 个版本，共 ~3,600 行） \| 版本 \| 算子数 \| \|------\|:------:\| \| `vllm0.13.0_torch2.8.0_cann8.3` \| ~45 \| \| `vllm0.15.0_torch2.9.0_cann8.5` \| ~55 \| \| `vllm0.18.0_torch2.9.0_cann8.5` \| ~60 \| ### 5. CSV Profiling Data（~250 files，Git LFS） ATLAS 800 A3 752T 128G 设备数据：HCCL 通信基准 + 3 个 vLLM 版本的 kernel 数据 + 微基准补充数据。 ### 6. 集成改动 \| 文件 \| 改动 \| \|------\|------\| \| `model_runner.py` \| profiling 模式集成（`perf_models[]` + `log_stats` + `ProfilingDataSource` 创建） \| \| `user_config.py` \| `--profiling-database` 参数 \| \| `scripts/text_generate.py` \| `--export-metrics` CLI + FlashComm 配置 \| \| `ops/fused_moe.py` \| 新增 `dispatch_ffn_combine` op \| \| `compile_backend.py` \| 注册 DFC + FlashComm passes \| ------ ## 📐 Associated Test Results / 关联测试结果 ### 单元测试 `$ pytest tests/perf_database/ -q 266 passed, 3 warnings in 1.94s $ pytest tests/test_tensor_cast/test_empirical.py tests/test_tensor_cast/test_dfc_pass.py -q 8 passed, 1 skipped in 120.75s $ lintrunner -a ok No lint issues.` ### 功能验证 bash # Analytic 模式（行为不变） $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 2 --query-length 3500 --device TEST_DEVICE → [analytic] Execution time: 1.744s, TPS/Device: 4013 token/s ✅ # Profiling 模式（新功能） $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 1 --query-length 4112 --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 --tp-size 16 \ --quantize-linear-action DISABLED \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 → [empirical] Execution time: 0.156s, TPS/Device: 1651 token/s ✅ ### M1-M5 指标 \| 场景 \| M3 (计算算子 HR) \| M5 (延迟覆盖) \| \|------\|:---------------:\|:------------:\| \| Qwen3-32B Prefill (BF16) \| 61.5% ✅ (>50%) \| 89.0% ✅ (>80%) \| \| Qwen3-32B Decode (BF16) \| 38.5% \| 80.1% ✅ (>80%) \| \| DeepSeek-V3 Prefill (W8A8) \| 52.6% ✅ (>50%) \| 68.9% \| \| DeepSeek-V3 Decode (W8A8) \| 15.8% \| 54.3% \| ------ ## 🌟 Use cases (Optional) / 使用案例（可选） bash # 1. 使用实测数据替代 Roofline 估算 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> # 2. 导出 M1-M5 指标 JSON（用于离线 M6 计算） python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> \ --export-metrics results/metrics.json # 3. 同时运行 analytic + profiling 对比 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model analytic --performance-model profiling --compile \ --profiling-database <path_to_data_dir> ------ ## ✅ Checklist / 检查列表 Before PR: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. ``` See merge request: Ascend/msmodeling!123	1 个月前