| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
feat: profiling-based empirical performance model with CSV data source Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !123 merge pr/perf-db-a into develop feat: profiling-based empirical performance model with CSV data source Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [x] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 TensorCast 现有的 Roofline 解析模型( AnalyticPerformanceModel)对昇腾 NPU 的性能预测精度有限:融合算子(SwiGlu、AddRmsNorm、DispatchFFNCombine)无法建模,HCCL 集合通信与理论带宽差距显著,FRACTAL_NZ 格式等硬件特性无法通过 Roofline 捕获。 本 PR 实现了基于真实 NPU Profiling 数据的**实测算子性能估算系统**,将 kernel 实测耗时接入 TensorCast 仿真框架。 **与 PR#96 的关系**:PR#96 已合入 develop,定义了 DataSourcePerformanceModel 接口骨架(stub)和 CLI 集成。本 PR 提供完整的功能实现:CSV 查询引擎(9 种 TC-vs-NPU shape matching 规则)、op_mapping 映射(60+ 算子)、插值、M1-M6 指标体系、以及 DFC/FlashComm 编译 Pass。接口完全兼容。 > 📌 配套的离线数据采集工具链将在后续 PR 中提交(tools/perf_data_collection/,与本 PR 无代码依赖)。 ------ ## 📝 Modification / 修改内容 ### 1. Profiling Data Source 核心实现(替换 PR#96 stub) | 文件 | 说明 | |------|------| | profiling_database/profiling_data_source.py (+1,885) | ProfilingDataSource:op_mapping.yaml 驱动的 CSV 查询引擎,支持 9 种 TC-vs-NPU shape 差异处理(batch dim stripping、seq padding、FRACTAL_NZ、ND transpose、SwiGlu concat、RoPE layout/kernel、composite 分解、flatten batch) | | profiling_database/interpolating_data_source.py (+702) | InterpolatingDataSource:nearest-neighbor + 线性插值包装器 | | profiling_database/data_source.py (修改) | DataSourcePerformanceModel ABC 扩展(新增 EXTRAPOLATED enum、details 字段) | ### 2. EmpiricalPerformanceModel 增强 (+436) 在 PR#96 基础上增加 **M1-M6 指标追踪**: - M1-M4:覆盖率指标(raw count → fused → compute-only → per-shape) - M5:延迟加权覆盖率 - M6 input:empirical hit total(用于离线 E2E ratio 计算) - log_stats():结构化 HIT/MISS 日志 - export_hit_miss_report():JSON 格式指标导出 ### 3. 编译 Passes (+875) | Pass | 说明 | |------|------| | dispatch_ffn_combine_pass.py | DispatchFFNCombine 超级融合(init_routing_v2 + GroupedMatmul + unpermute_tokens → 单 op),支持 5 种量化变体 | | flashcomm_v1_pass.py | FlashComm V1 图重写(matmul_all_reduce → 通信隐藏),对标 vLLM-ascend ENABLE_FLASHCOMM1=1 | ### 4. op_mapping.yaml(3 个版本,共 ~3,600 行) | 版本 | 算子数 | |------|:------:| | vllm0.13.0_torch2.8.0_cann8.3 | ~45 | | vllm0.15.0_torch2.9.0_cann8.5 | ~55 | | vllm0.18.0_torch2.9.0_cann8.5 | ~60 | ### 5. CSV Profiling Data(~250 files,Git LFS) ATLAS 800 A3 752T 128G 设备数据:HCCL 通信基准 + 3 个 vLLM 版本的 kernel 数据 + 微基准补充数据。 ### 6. 集成改动 | 文件 | 改动 | |------|------| | model_runner.py | profiling 模式集成(perf_models[] + log_stats + ProfilingDataSource 创建) | | user_config.py | --profiling-database 参数 | | scripts/text_generate.py | --export-metrics CLI + FlashComm 配置 | | ops/fused_moe.py | 新增 dispatch_ffn_combine op | | compile_backend.py | 注册 DFC + FlashComm passes | ------ ## 📐 Associated Test Results / 关联测试结果 ### 单元测试 $ pytest tests/perf_database/ -q 266 passed, 3 warnings in 1.94s $ pytest tests/test_tensor_cast/test_empirical.py tests/test_tensor_cast/test_dfc_pass.py -q 8 passed, 1 skipped in 120.75s $ lintrunner -a ok No lint issues. ### 功能验证 bash # Analytic 模式(行为不变) $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 2 --query-length 3500 --device TEST_DEVICE → [analytic] Execution time: 1.744s, TPS/Device: 4013 token/s ✅ # Profiling 模式(新功能) $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 1 --query-length 4112 --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 --tp-size 16 \ --quantize-linear-action DISABLED \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 → [empirical] Execution time: 0.156s, TPS/Device: 1651 token/s ✅ ### M1-M5 指标 | 场景 | M3 (计算算子 HR) | M5 (延迟覆盖) | |------|:---------------:|:------------:| | Qwen3-32B Prefill (BF16) | **61.5%** ✅ (>50%) | **89.0%** ✅ (>80%) | | Qwen3-32B Decode (BF16) | 38.5% | **80.1%** ✅ (>80%) | | DeepSeek-V3 Prefill (W8A8) | **52.6%** ✅ (>50%) | 68.9% | | DeepSeek-V3 Decode (W8A8) | 15.8% | 54.3% | ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 使用实测数据替代 Roofline 估算 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> # 2. 导出 M1-M5 指标 JSON(用于离线 M6 计算) python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> \ --export-metrics results/metrics.json # 3. 同时运行 analytic + profiling 对比 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model analytic --performance-model profiling --compile \ --profiling-database <path_to_data_dir> ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. ``` See merge request: Ascend/msmodeling!123 | 1 个月前 | |
【FEAT】MindStudio CLI 统一 stderr Logo Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !307 merge feat-logo into develop 【FEAT】MindStudio CLI 统一 stderr Logo Created-by: AvadaKedavrua Commit-by: liujiawang Merged-by: ascend-robot Description: ## 修改原因 MindStudio Modeling 各 Python CLI 启动时缺少统一品牌标识,用户在仿真、吞吐寻优、适配与 profiling 工具间切换时难以从终端首屏确认产品归属。本 PR 在 parse_args 成功后向 stderr 输出固定四行 MindStudio Logo,并支持 TTY/TERM 降级与 Windows colorama 控制台初始化。 --- ## 修改内容 - 新增共享模块 cli/logo.py:render_logo / print_logo,65 列块 + 终端居中 + ANSI/纯文本降级 - 在 11 个 Python 入口(cli/inference/*、serving_cast/main.py、tools/perf_data_collection 驱动脚本)于 parse_args 后调用 print_logo();--help 路径不输出 Logo - 依赖: - **运行时** colorama>=0.4.6 — 写入 [project] dependencies 与 requirements.txt,Windows 上调用 just_fix_windows_console() 启用控制台 VT/ANSI 输出 - **CI 静态检查** types-colorama>=0.4.15 — 写入 [dependency-groups] ci,见下方说明 - 详设文档:docs/design/mindstudio-brand-logo-design.md(本仓仅 Python 范围) - 测试:tests/regression/cli/test_logo.py(14 条模块 UT)+ tests/regression/cli/test_logo_cli_hooks.py(help 抑制与入口 hook 回归,in-process run_module_main) ### 为何需要 types-colorama(CI 组,非运行时) cli/logo.py 在 Windows 路径下会调用 colorama.just_fix_windows_console()。colorama 包本身未提供完整的 inline 类型注解,mypy / 仓库 type_check 在无 stub 时会报 *Cannot find implementation or library stub for module named "colorama"*,或将其视为 untyped 调用。 types-colorama 是社区维护的 **PEP 561 stub 包**(.pyi),仅用于开发态与 CI 的类型检查,**不会**随默认 uv sync 进入用户仿真运行时环境(位于 dependency-groups.ci,与 pytest-cov 等工具同属 CI 组)。 加入该依赖的目的: 1. 让 uv sync --group ci + mypy 能正确解析 colorama API,满足本 PR 静态检查门禁,**无需**在业务代码中使用 # type: ignore 绕过规范 2. 与项目现有做法一致:第三方库缺类型时,在 ci 组补 types-* stub,而非放宽 mypy 配置 若仅安装运行时依赖(uv sync / pip install -r requirements.txt),**不需要**也**不会**安装 types-colorama;Logo 功能仅依赖运行时 colorama。 --- ## 自验证 ### Logo 四行块渲染(纯文本 / 80 列居中) 目的:确认固定四行布局、品牌行与 Slogan 居中、无前置空行。 步骤: 1. 在仓库根目录执行: bash uv run python -c "from cli.logo import render_logo; print(render_logo(color=False, terminal_cols=80))" 结果:  ### Logo 模块 + CLI hook 回归测试 目的:满足 CI Gate 对新增 print_logo 路径的覆盖;确认 --help 不泄漏 Logo,正常 parse_args 后 stderr 含品牌块。 步骤: 1. 在仓库根目录执行: bash uv run pytest tests/regression/cli/test_logo.py tests/regression/cli/test_logo_cli_hooks.py -v --tb=no 结果:  ### --help 不输出 Logo 目的:确认 argparse 在 print_logo 之前退出,help 路径保持干净。 步骤: 1. 执行: bash uv run python -m cli.inference.text_generate --help 2>&1 | head -5 结果:  ### 端到端  See merge request: Ascend/msmodeling!307 | 23 天前 | |
chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [x] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ------ ## Motivation / 变更动机 Continue the **pre-commit** migration: tighten **Pylint** so only high-signal messages run ( disable=all + explicit enable list), fix real issues that remained under that profile, and translate hook/config comments to **English**. ------ ## Configuration changes(仅工具与注释 / tooling & comments only) | Path | What changed | |------|----------------| | pre-commit/pyproject.toml | **Pylint:** [tool.pylint."messages control"] with disable = ["all"] and a short **allowlist** of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). **Ruff:** unchanged behavior; comments translated to English. **Bandit:** comments translated; rule allowlist/skip lists unchanged. | | .pre-commit-config.yaml | Comments translated to English; Bandit hook display name set to **bandit (Python security checks)**. Hook versions and args unchanged except for comment text. | ------ ## Source code changes(应用代码 / application code) | Area | Files | Purpose | |------|--------|---------| | serving_cast | communication.py, engine.py, instance.py, kv_cache_manager.py, load_gen.py, main.py, model_runner.py, request.py, serving.py, utils.py | Replace from . import stime with import serving_cast.stime as stime so Pylint resolves imports (fixes **E0611**). | | serving_cast | stime.py | Singleton **salabim** Environment via _get_sim_env() so type checkers/Pylint see **sim.Environment** (fixes **E1101** on SimulationEnv). | | serving_cast/service | base_throughput_optimizer.py | __init__ defaults + assert runner is not None before run_inference (fixes **E1101** on base class). | | tensor_cast | diffusers/diffusers_model.py, diffusers/diffusers_utils.py, runtime.py | Add **encoding="utf-8"** to open() / trace export (fixes **W1514**). | | web_ui | callbacks.py | **refresh_optimizer_detail:** call _optimizer_detail_view(rows, None, device) and unpack five return values (fixes **E1120**). | ------ ## Recent commits on pre-commit branch - ci(pre-commit): fix pylint message selection with disable=all - fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui - docs(pre-commit): translate comments to English and add all-files run log ------  ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176 | 1 个月前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 1 个月前 | ||
| 23 天前 | ||
| 1 个月前 |