| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
feat: profiling-based empirical performance model with CSV data source Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !123 merge pr/perf-db-a into develop feat: profiling-based empirical performance model with CSV data source Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [x] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 TensorCast 现有的 Roofline 解析模型( AnalyticPerformanceModel)对昇腾 NPU 的性能预测精度有限:融合算子(SwiGlu、AddRmsNorm、DispatchFFNCombine)无法建模,HCCL 集合通信与理论带宽差距显著,FRACTAL_NZ 格式等硬件特性无法通过 Roofline 捕获。 本 PR 实现了基于真实 NPU Profiling 数据的**实测算子性能估算系统**,将 kernel 实测耗时接入 TensorCast 仿真框架。 **与 PR#96 的关系**:PR#96 已合入 develop,定义了 DataSourcePerformanceModel 接口骨架(stub)和 CLI 集成。本 PR 提供完整的功能实现:CSV 查询引擎(9 种 TC-vs-NPU shape matching 规则)、op_mapping 映射(60+ 算子)、插值、M1-M6 指标体系、以及 DFC/FlashComm 编译 Pass。接口完全兼容。 > 📌 配套的离线数据采集工具链将在后续 PR 中提交(tools/perf_data_collection/,与本 PR 无代码依赖)。 ------ ## 📝 Modification / 修改内容 ### 1. Profiling Data Source 核心实现(替换 PR#96 stub) | 文件 | 说明 | |------|------| | profiling_database/profiling_data_source.py (+1,885) | ProfilingDataSource:op_mapping.yaml 驱动的 CSV 查询引擎,支持 9 种 TC-vs-NPU shape 差异处理(batch dim stripping、seq padding、FRACTAL_NZ、ND transpose、SwiGlu concat、RoPE layout/kernel、composite 分解、flatten batch) | | profiling_database/interpolating_data_source.py (+702) | InterpolatingDataSource:nearest-neighbor + 线性插值包装器 | | profiling_database/data_source.py (修改) | DataSourcePerformanceModel ABC 扩展(新增 EXTRAPOLATED enum、details 字段) | ### 2. EmpiricalPerformanceModel 增强 (+436) 在 PR#96 基础上增加 **M1-M6 指标追踪**: - M1-M4:覆盖率指标(raw count → fused → compute-only → per-shape) - M5:延迟加权覆盖率 - M6 input:empirical hit total(用于离线 E2E ratio 计算) - log_stats():结构化 HIT/MISS 日志 - export_hit_miss_report():JSON 格式指标导出 ### 3. 编译 Passes (+875) | Pass | 说明 | |------|------| | dispatch_ffn_combine_pass.py | DispatchFFNCombine 超级融合(init_routing_v2 + GroupedMatmul + unpermute_tokens → 单 op),支持 5 种量化变体 | | flashcomm_v1_pass.py | FlashComm V1 图重写(matmul_all_reduce → 通信隐藏),对标 vLLM-ascend ENABLE_FLASHCOMM1=1 | ### 4. op_mapping.yaml(3 个版本,共 ~3,600 行) | 版本 | 算子数 | |------|:------:| | vllm0.13.0_torch2.8.0_cann8.3 | ~45 | | vllm0.15.0_torch2.9.0_cann8.5 | ~55 | | vllm0.18.0_torch2.9.0_cann8.5 | ~60 | ### 5. CSV Profiling Data(~250 files,Git LFS) ATLAS 800 A3 752T 128G 设备数据:HCCL 通信基准 + 3 个 vLLM 版本的 kernel 数据 + 微基准补充数据。 ### 6. 集成改动 | 文件 | 改动 | |------|------| | model_runner.py | profiling 模式集成(perf_models[] + log_stats + ProfilingDataSource 创建) | | user_config.py | --profiling-database 参数 | | scripts/text_generate.py | --export-metrics CLI + FlashComm 配置 | | ops/fused_moe.py | 新增 dispatch_ffn_combine op | | compile_backend.py | 注册 DFC + FlashComm passes | ------ ## 📐 Associated Test Results / 关联测试结果 ### 单元测试 $ pytest tests/perf_database/ -q 266 passed, 3 warnings in 1.94s $ pytest tests/test_tensor_cast/test_empirical.py tests/test_tensor_cast/test_dfc_pass.py -q 8 passed, 1 skipped in 120.75s $ lintrunner -a ok No lint issues. ### 功能验证 bash # Analytic 模式(行为不变) $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 2 --query-length 3500 --device TEST_DEVICE → [analytic] Execution time: 1.744s, TPS/Device: 4013 token/s ✅ # Profiling 模式(新功能) $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 1 --query-length 4112 --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 --tp-size 16 \ --quantize-linear-action DISABLED \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 → [empirical] Execution time: 0.156s, TPS/Device: 1651 token/s ✅ ### M1-M5 指标 | 场景 | M3 (计算算子 HR) | M5 (延迟覆盖) | |------|:---------------:|:------------:| | Qwen3-32B Prefill (BF16) | **61.5%** ✅ (>50%) | **89.0%** ✅ (>80%) | | Qwen3-32B Decode (BF16) | 38.5% | **80.1%** ✅ (>80%) | | DeepSeek-V3 Prefill (W8A8) | **52.6%** ✅ (>50%) | 68.9% | | DeepSeek-V3 Decode (W8A8) | 15.8% | 54.3% | ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 使用实测数据替代 Roofline 估算 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> # 2. 导出 M1-M5 指标 JSON(用于离线 M6 计算) python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> \ --export-metrics results/metrics.json # 3. 同时运行 analytic + profiling 对比 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model analytic --performance-model profiling --compile \ --profiling-database <path_to_data_dir> ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. ``` See merge request: Ascend/msmodeling!123 | 1 个月前 | |
Update profiling op mapping skill docs Co-authored-by: Secluded_Ocean<tangchuxiao0709@qq.com> # message auto-generated for no-merge-commit merge: !212 merge pr/glm5-op-mapping-skill-docs into develop Update profiling op mapping skill docs Created-by: Secluded_Ocean Commit-by: Secluded_Ocean Merged-by: ascend-robot Description: **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 This PR updates the profiling database op-mapping skill documentation. During GLM5 profiling database expansion, several recurring issues were identified: - Some TensorCast operators do not map to profiling CSV rows by direct tensor-shape matching. - Semantic operators such as grouped MoE and LightningIndexer require explicit query-mode handling. - Generated placeholder rows with empty or zero latency must not be treated as valid profiling data. - Future op-mapping work needs clearer worker/verifier instructions to avoid incorrect mappings. The goal of this PR is to document these lessons in the op-mapping skill so future profiling database updates follow a clearer and safer workflow. ------ ## 📝 Modification / 修改内容 This PR updates the op-mapping skill documents: - docs/perf_database/skills/op-mapping/SKILL.md - docs/perf_database/skills/op-mapping/single-op-worker-prompt.md - docs/perf_database/skills/op-mapping/verifier-prompt.md Main changes: - Clarify when an operator needs a dedicated query_mode. - Clarify that placeholder latency rows should not be used as measured profiling data. - Strengthen the worker instructions for checking TensorCast op semantics, NPU kernel names, CSV shapes, and replay feasibility. - Strengthen the verifier instructions for reviewing operator mapping quality and shape matching assumptions. ------ ## 📐 Associated Test Results / 关联测试结果 This PR only updates documentation/prompt files. No runtime test is required. Manual check: text Reviewed the updated skill and prompt files for profiling database op-mapping workflow consistency. ------ ## 🌟 Use cases (Optional) / 使用案例(可选) Future profiling database contributors can use this skill to: - Add or verify op mappings for new models. - Decide whether a default compute lookup is enough or whether a dedicated query mode is required. - Avoid treating shape-generated placeholder rows as real latency data. - Review replay feasibility before adding generated CSV shapes. ------ ## ✅ Checklist / 检查列表 **Before PR**: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!212 | 25 天前 | |
【FIX】【TEST】修复 README/文档失效链接并默认运行完整 benchmark 套件 Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !331 merge fix into develop 【FIX】【TEST】修复 README/文档失效链接并默认运行完整 benchmark 套件 Created-by: AvadaKedavrua Commit-by: liujiawang Merged-by: ascend-robot Description: ## 修改原因 1. README.md 社区区公众号二维码指向 msinsight 仓库旧路径,资源已 404,用户扫码/预览失败。 2. OP_PLUGIN_MAPPING_TUTORIAL.md 中 Op Mapping skill 相对路径错误,文档内链接跳转失败。 3. benchmark 入口默认只跑 tests/benchmark/ops/,tests/benchmark/models/ 模型回归被静默跳过,CI/nightly 覆盖不足。 4. 全量 benchmark 启用后,qwen3-30b-a3b decode/prefill baseline 与当前 compile 输出不一致,需刷新。 --- ## 修改内容 | 类别 | 文件 | 变更 | |------|------|------| | 文档链接 | README.md | 公众号图片 URL 换为可用 user-images 资源;TOC 补全 Contributions / Community 等章节锚点 | | 文档链接 | docs/perf_database/tutorial/OP_PLUGIN_MAPPING_TUTORIAL.md | skill 路径 ../skills/... → ../../../.agents/skills/op-mapping/SKILL.md | | benchmark 默认行为 | scripts/run_benchmark.sh、scripts/helpers/nightly/main.py | 移除 MSMODELING_BENCHMARK_MODELS 开关,固定跑 tests/benchmark/ 全目录 | | 设计文档 | docs/design/ut_refactor.md | 同步 benchmark phase 描述 | | baseline | tests/benchmark/models/cases/qwen3-30b-a3b-{decode,prefill}.json | 刷新 baseline_time_s 与 operator top-N | | lint | experimental/optix/、scripts/、tensor_cast/、tests/ 等 | 为 inspect.* 误报补 pylint: disable 注释 | --- ## 自验证 ### README 公众号图片链接 目的:确认旧链接 404、新链接可访问。 步骤: 1. 检查旧 URL HTTP 状态 2. 检查新 URL HTTP 状态 bash curl -sI "https://raw.gitcode.com/Ascend/msinsight/raw/master/docs/zh/user_guide/figures/readme/officialAccount.jpg" | head -1 curl -sI "https://raw.gitcode.com/user-images/assets/8428112/2a22a707-de26-4bb3-b312-4952035e021b/30be980e7fd65b2486d251b48a7999f3.jpg" | head -1 结果: text HTTP/1.1 404 Not Found HTTP/1.1 200 OK ### Op Mapping skill 文档路径 目的:确认教程内链接指向真实文件。 步骤: 1. 在仓库根目录检查 skill 文件是否存在 bash test -f .agents/skills/op-mapping/SKILL.md && echo OK 结果: text OK ### Benchmark 入口默认全量 目的:确认 run_benchmark.sh 不再依赖 MSMODELING_BENCHMARK_MODELS,默认覆盖 models 子目录。 步骤: 1. 查看脚本 benchmark target 配置 bash grep -n "TESTS_BENCHMARK" scripts/run_benchmark.sh 结果: text run_pytest "${TESTS_BENCHMARK}/" \ ### CI 流水线 目的:确认改动未破坏现有 CI/docs CI。 步骤: 1. 查看 PR #331 CI label 状态 结果:PR 已打标 ci-pipeline-passed、docs-ci-pipeline-success。 See merge request: Ascend/msmodeling!331 | 19 天前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 1 个月前 | ||
| 25 天前 | ||
| 19 天前 |