| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
feat(comm-bench): comm_bench增加多机 inter-pod 数据采集 Co-authored-by: Hudingyi<305949481@qq.com> # message auto-generated for no-merge-commit merge: !223 merge feat/comm-bench-multinode into develop feat(comm-bench): comm_bench增加多机 inter-pod 数据采集 Created-by: Hudingyi Commit-by: Hudingyi Merged-by: ascend-robot Description: ## 背景 GLM5 适配以及 inter-pod 仿真场景需要 topology_tier=0 的通信参考数据,但现有 run_comm_bench.sh 仅支持单机采集。本 PR 在同一入口脚本里增加多机 (inter-pod) 采集路径,并提交一份基于 ATLAS_800_A3 双机 32 die 的v8.5 inter-pod 数据。 **单机路径完全未改动**:脚本顺序调整为单机分支在前、多机分支在后,通过 [ "${NNODES:-1}" -lt 2 ] / -ge 2 ] 两个条件守门互斥。仅当调用方显式设置 NNODES>=2 时进入多机分支,调用torchrun --nnodes/--node_rank/--master_addr/--master_port 拉起多机session,并强制 --topology-tier 0,避免 32 die 在 grid_shape=[48,8,2] 下被误判为 tier-1 group。 ## 改动清单 | 文件 | 变更 | |---|---| | tools/perf_data_collection/comm_bench/run_comm_bench.sh | 单机块加 NNODES<2 守门置于上方;末尾追加 NNODES>=2 多机块。单机内部逻辑一字未改 | | tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/hccl/v8.5/hcom_{allReduce,allGather,reduceScatter,alltoallv}_.csv | 每个文件 +23 行 (topology_tier=0, num_devices=32, dtype=DT_BF16),git-lfs 管理,diff 仅为 pointer 变化 | | tests/tools/test_generate_comm_microbench.py | +1 个 case (TestRunCommBenchShellMultiNode::test_multinode_dispatches_inter_pod_torchrun),仅验证多机分发契约。原有 case 一字未改 | ## 用法 ### 单机用法(未变更) bash run_comm_bench.sh [OUTPUT_DIR] # 默认输出 ./hccl_bench_data/ 多机用法(新增路径) 每个节点都执行同一行,仅 NODE_RANK 不同: # 主节点: NNODES=2 NODE_RANK=0 MASTER_ADDR=<主节点IP> bash run_comm_bench.sh # Worker 节点: NNODES=2 NODE_RANK=1 MASTER_ADDR=<主节点IP> bash run_comm_bench.sh 可选环境变量:NPROC(默认 16)、MASTER_PORT(默认 29700,每 round +1)、QUICK=1(5 点 sanity 网格)。 数据采集方法 Inter-pod 数据在 2 台 ATLAS_800_A3(world_size=32, tier=0)上采集,覆盖23 个 power-of-2 消息长度(128B ~ 512MB)。独立采集 2 轮后按 (msg_bytes, num_devices, tier) 取 min(Duration) 聚合,用以抑制小消息在 HCCL JIT 预热阶段的随机 jitter。 ## 测试结论 T1 bash 语法检查 bash -n run_comm_bench.sh ✅ syntax OK T2 单元测试 pytest tests/tools/test_generate_comm_microbench.py ✅ 34 passed (含 1 个新增 case + 33 个原有 case,6 个 NPU mark deselected) 新增 case: TestRunCommBenchShellMultiNode - 验证 NNODES>=2 时 torchrun 收到 --nnodes=2 / --node_rank=1 / --master_addr - 验证 Python 调用包含 --topology-tier 0 - 验证 --num-devices = NNODES * NPROC = 32 T3 dispatch 双路径手工 dry-run(torchrun stub) 单机模式 (NNODES unset): ✅ 12 个 torchrun session (4 nd × 3 round) 多机模式 (NNODES=2): ✅ 3 个 torchrun session (3 round) ✅ 含 --nnodes=2 / --node_rank=1 / --master_addr ✅ 含 --topology-tier 0 / --num-devices 32 NPU 真实硬件验证✅ # 主节点: NNODES=2 NODE_RANK=0 MASTER_ADDR=<ip> QUICK=1 bash run_comm_bench.sh ./_pr_smoke # Worker 节点: NNODES=2 NODE_RANK=1 MASTER_ADDR=<ip> QUICK=1 bash run_comm_bench.sh ./_pr_smoke 测试数据见附件的 See merge request: Ascend/msmodeling!223 | 17 天前 | |
feat: profiling-based empirical performance model with CSV data source Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !123 merge pr/perf-db-a into develop feat: profiling-based empirical performance model with CSV data source Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [x] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 TensorCast 现有的 Roofline 解析模型( AnalyticPerformanceModel)对昇腾 NPU 的性能预测精度有限:融合算子(SwiGlu、AddRmsNorm、DispatchFFNCombine)无法建模,HCCL 集合通信与理论带宽差距显著,FRACTAL_NZ 格式等硬件特性无法通过 Roofline 捕获。 本 PR 实现了基于真实 NPU Profiling 数据的**实测算子性能估算系统**,将 kernel 实测耗时接入 TensorCast 仿真框架。 **与 PR#96 的关系**:PR#96 已合入 develop,定义了 DataSourcePerformanceModel 接口骨架(stub)和 CLI 集成。本 PR 提供完整的功能实现:CSV 查询引擎(9 种 TC-vs-NPU shape matching 规则)、op_mapping 映射(60+ 算子)、插值、M1-M6 指标体系、以及 DFC/FlashComm 编译 Pass。接口完全兼容。 > 📌 配套的离线数据采集工具链将在后续 PR 中提交(tools/perf_data_collection/,与本 PR 无代码依赖)。 ------ ## 📝 Modification / 修改内容 ### 1. Profiling Data Source 核心实现(替换 PR#96 stub) | 文件 | 说明 | |------|------| | profiling_database/profiling_data_source.py (+1,885) | ProfilingDataSource:op_mapping.yaml 驱动的 CSV 查询引擎,支持 9 种 TC-vs-NPU shape 差异处理(batch dim stripping、seq padding、FRACTAL_NZ、ND transpose、SwiGlu concat、RoPE layout/kernel、composite 分解、flatten batch) | | profiling_database/interpolating_data_source.py (+702) | InterpolatingDataSource:nearest-neighbor + 线性插值包装器 | | profiling_database/data_source.py (修改) | DataSourcePerformanceModel ABC 扩展(新增 EXTRAPOLATED enum、details 字段) | ### 2. EmpiricalPerformanceModel 增强 (+436) 在 PR#96 基础上增加 **M1-M6 指标追踪**: - M1-M4:覆盖率指标(raw count → fused → compute-only → per-shape) - M5:延迟加权覆盖率 - M6 input:empirical hit total(用于离线 E2E ratio 计算) - log_stats():结构化 HIT/MISS 日志 - export_hit_miss_report():JSON 格式指标导出 ### 3. 编译 Passes (+875) | Pass | 说明 | |------|------| | dispatch_ffn_combine_pass.py | DispatchFFNCombine 超级融合(init_routing_v2 + GroupedMatmul + unpermute_tokens → 单 op),支持 5 种量化变体 | | flashcomm_v1_pass.py | FlashComm V1 图重写(matmul_all_reduce → 通信隐藏),对标 vLLM-ascend ENABLE_FLASHCOMM1=1 | ### 4. op_mapping.yaml(3 个版本,共 ~3,600 行) | 版本 | 算子数 | |------|:------:| | vllm0.13.0_torch2.8.0_cann8.3 | ~45 | | vllm0.15.0_torch2.9.0_cann8.5 | ~55 | | vllm0.18.0_torch2.9.0_cann8.5 | ~60 | ### 5. CSV Profiling Data(~250 files,Git LFS) ATLAS 800 A3 752T 128G 设备数据:HCCL 通信基准 + 3 个 vLLM 版本的 kernel 数据 + 微基准补充数据。 ### 6. 集成改动 | 文件 | 改动 | |------|------| | model_runner.py | profiling 模式集成(perf_models[] + log_stats + ProfilingDataSource 创建) | | user_config.py | --profiling-database 参数 | | scripts/text_generate.py | --export-metrics CLI + FlashComm 配置 | | ops/fused_moe.py | 新增 dispatch_ffn_combine op | | compile_backend.py | 注册 DFC + FlashComm passes | ------ ## 📐 Associated Test Results / 关联测试结果 ### 单元测试 $ pytest tests/perf_database/ -q 266 passed, 3 warnings in 1.94s $ pytest tests/test_tensor_cast/test_empirical.py tests/test_tensor_cast/test_dfc_pass.py -q 8 passed, 1 skipped in 120.75s $ lintrunner -a ok No lint issues. ### 功能验证 bash # Analytic 模式(行为不变) $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 2 --query-length 3500 --device TEST_DEVICE → [analytic] Execution time: 1.744s, TPS/Device: 4013 token/s ✅ # Profiling 模式(新功能) $ python -m tensor_cast.scripts.text_generate Qwen/Qwen3-32B \ --num-queries 1 --query-length 4112 --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 --tp-size 16 \ --quantize-linear-action DISABLED \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 → [empirical] Execution time: 0.156s, TPS/Device: 1651 token/s ✅ ### M1-M5 指标 | 场景 | M3 (计算算子 HR) | M5 (延迟覆盖) | |------|:---------------:|:------------:| | Qwen3-32B Prefill (BF16) | **61.5%** ✅ (>50%) | **89.0%** ✅ (>80%) | | Qwen3-32B Decode (BF16) | 38.5% | **80.1%** ✅ (>80%) | | DeepSeek-V3 Prefill (W8A8) | **52.6%** ✅ (>50%) | 68.9% | | DeepSeek-V3 Decode (W8A8) | 15.8% | 54.3% | ------ ## 🌟 Use cases (Optional) / 使用案例(可选) bash # 1. 使用实测数据替代 Roofline 估算 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> # 2. 导出 M1-M5 指标 JSON(用于离线 M6 计算) python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model profiling --compile \ --profiling-database <path_to_data_dir> \ --export-metrics results/metrics.json # 3. 同时运行 analytic + profiling 对比 python -m tensor_cast.scripts.text_generate <model_id> \ --performance-model analytic --performance-model profiling --compile \ --profiling-database <path_to_data_dir> ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. ``` See merge request: Ascend/msmodeling!123 | 1 个月前 | |
chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [x] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ------ ## Motivation / 变更动机 Continue the **pre-commit** migration: tighten **Pylint** so only high-signal messages run ( disable=all + explicit enable list), fix real issues that remained under that profile, and translate hook/config comments to **English**. ------ ## Configuration changes(仅工具与注释 / tooling & comments only) | Path | What changed | |------|----------------| | pre-commit/pyproject.toml | **Pylint:** [tool.pylint."messages control"] with disable = ["all"] and a short **allowlist** of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). **Ruff:** unchanged behavior; comments translated to English. **Bandit:** comments translated; rule allowlist/skip lists unchanged. | | .pre-commit-config.yaml | Comments translated to English; Bandit hook display name set to **bandit (Python security checks)**. Hook versions and args unchanged except for comment text. | ------ ## Source code changes(应用代码 / application code) | Area | Files | Purpose | |------|--------|---------| | serving_cast | communication.py, engine.py, instance.py, kv_cache_manager.py, load_gen.py, main.py, model_runner.py, request.py, serving.py, utils.py | Replace from . import stime with import serving_cast.stime as stime so Pylint resolves imports (fixes **E0611**). | | serving_cast | stime.py | Singleton **salabim** Environment via _get_sim_env() so type checkers/Pylint see **sim.Environment** (fixes **E1101** on SimulationEnv). | | serving_cast/service | base_throughput_optimizer.py | __init__ defaults + assert runner is not None before run_inference (fixes **E1101** on base class). | | tensor_cast | diffusers/diffusers_model.py, diffusers/diffusers_utils.py, runtime.py | Add **encoding="utf-8"** to open() / trace export (fixes **W1514**). | | web_ui | callbacks.py | **refresh_optimizer_detail:** call _optimizer_detail_view(rows, None, device) and unpack five return values (fixes **E1120**). | ------ ## Recent commits on pre-commit branch - ci(pre-commit): fix pylint message selection with disable=all - fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui - docs(pre-commit): translate comments to English and add all-files run log ------  ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176 | 1 个月前 | |
chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [x] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ------ ## Motivation / 变更动机 Continue the **pre-commit** migration: tighten **Pylint** so only high-signal messages run ( disable=all + explicit enable list), fix real issues that remained under that profile, and translate hook/config comments to **English**. ------ ## Configuration changes(仅工具与注释 / tooling & comments only) | Path | What changed | |------|----------------| | pre-commit/pyproject.toml | **Pylint:** [tool.pylint."messages control"] with disable = ["all"] and a short **allowlist** of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). **Ruff:** unchanged behavior; comments translated to English. **Bandit:** comments translated; rule allowlist/skip lists unchanged. | | .pre-commit-config.yaml | Comments translated to English; Bandit hook display name set to **bandit (Python security checks)**. Hook versions and args unchanged except for comment text. | ------ ## Source code changes(应用代码 / application code) | Area | Files | Purpose | |------|--------|---------| | serving_cast | communication.py, engine.py, instance.py, kv_cache_manager.py, load_gen.py, main.py, model_runner.py, request.py, serving.py, utils.py | Replace from . import stime with import serving_cast.stime as stime so Pylint resolves imports (fixes **E0611**). | | serving_cast | stime.py | Singleton **salabim** Environment via _get_sim_env() so type checkers/Pylint see **sim.Environment** (fixes **E1101** on SimulationEnv). | | serving_cast/service | base_throughput_optimizer.py | __init__ defaults + assert runner is not None before run_inference (fixes **E1101** on base class). | | tensor_cast | diffusers/diffusers_model.py, diffusers/diffusers_utils.py, runtime.py | Add **encoding="utf-8"** to open() / trace export (fixes **W1514**). | | web_ui | callbacks.py | **refresh_optimizer_detail:** call _optimizer_detail_view(rows, None, device) and unpack five return values (fixes **E1120**). | ------ ## Recent commits on pre-commit branch - ci(pre-commit): fix pylint message selection with disable=all - fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui - docs(pre-commit): translate comments to English and add all-files run log ------  ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176 | 1 个月前 | |
chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [x] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ------ ## Motivation / 变更动机 Continue the **pre-commit** migration: tighten **Pylint** so only high-signal messages run ( disable=all + explicit enable list), fix real issues that remained under that profile, and translate hook/config comments to **English**. ------ ## Configuration changes(仅工具与注释 / tooling & comments only) | Path | What changed | |------|----------------| | pre-commit/pyproject.toml | **Pylint:** [tool.pylint."messages control"] with disable = ["all"] and a short **allowlist** of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). **Ruff:** unchanged behavior; comments translated to English. **Bandit:** comments translated; rule allowlist/skip lists unchanged. | | .pre-commit-config.yaml | Comments translated to English; Bandit hook display name set to **bandit (Python security checks)**. Hook versions and args unchanged except for comment text. | ------ ## Source code changes(应用代码 / application code) | Area | Files | Purpose | |------|--------|---------| | serving_cast | communication.py, engine.py, instance.py, kv_cache_manager.py, load_gen.py, main.py, model_runner.py, request.py, serving.py, utils.py | Replace from . import stime with import serving_cast.stime as stime so Pylint resolves imports (fixes **E0611**). | | serving_cast | stime.py | Singleton **salabim** Environment via _get_sim_env() so type checkers/Pylint see **sim.Environment** (fixes **E1101** on SimulationEnv). | | serving_cast/service | base_throughput_optimizer.py | __init__ defaults + assert runner is not None before run_inference (fixes **E1101** on base class). | | tensor_cast | diffusers/diffusers_model.py, diffusers/diffusers_utils.py, runtime.py | Add **encoding="utf-8"** to open() / trace export (fixes **W1514**). | | web_ui | callbacks.py | **refresh_optimizer_detail:** call _optimizer_detail_view(rows, None, device) and unpack five return values (fixes **E1120**). | ------ ## Recent commits on pre-commit branch - ci(pre-commit): fix pylint message selection with disable=all - fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui - docs(pre-commit): translate comments to English and add all-files run log ------  ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176 | 1 个月前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 17 天前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 1 个月前 |