DrivingSDK/mx_driving/tools/profile_analyse · Ascend/DrivingSDK - AtomGit

ascend-robotfeat: [perf-predict] add 910C dual-die compute handling

文件	最后提交记录	最后更新时间
profile_analyse.py	feat: [perf-predict] add 910C dual-die compute handling Co-authored-by: captain_jing<jingshenjin@huawei.com> # message auto-generated for no-merge-commit merge: !2081 merge master into master feat: [perf-predict] add 910C dual-die compute handling Created-by: captain_jing Commit-by: captain_jing Merged-by: ascend-robot Description: ## What this PR does / why we need it? - 910C 双 die 基线的计算预测逻辑修正当基线为 910C 时，计算部分先按 910C→910B 做折算（等价以 910B 作为“计算基准”），再从 910B 换算到目标芯片；通信与 FreeTime 仍按 910C 的互联带宽与 CPU 系数测算，避免把 910C 当作单 die 直接线性缩放导致偏差。 - 性能预测使用最后一个 step（并对齐通信矩阵）统一按最后 step 选择 kernel_details.csv 、 step_trace_time.csv ，并把 communication_matrix.json 的统计 step 与之对齐，避免多 step 混算导致结果放大。 - source 传参支持尾标并更新说明支持 910B/910C/950PR/PPU1.5/H200/H20 等简写，避免 CLI 传含空格全名不便；README 示例同步更新。 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/DrivingSDK!2081	7 天前
profiling_common.py	add skills of profile analyse and performance prediction Co-authored-by: captain_jing<jingshenjin@huawei.com> # message auto-generated for no-merge-commit merge: !2071 merge master into master add skills of profile analyse and performance prediction Created-by: captain_jing Commit-by: captain_jing Merged-by: ascend-robot Description: ## What this PR does / why we need it? - 提供 Profiling 自动分析：输出 Step Trace 耗时占比/波动汇总、Top20 算子计算/访存占比、以及整网计算 vs 访存、cube vs vector 的汇总结果。 - 增强跨芯片性能预测：基于 kernel_details 的算力/带宽换算，并融合 Step Trace（Computing/未掩盖通信/FreeTime）做综合性能估算与通信拆分展示。 - 新增 Skills 文档与目录（迁移到 .agent/skills ），同时抽取工具脚本公共逻辑复用。 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? > 请调用ascend-profiling-analysis的skill分析D:\profile\gr00t-n1.6数据 ## 📊 Section 1 — Step Trace 分析 #### 各卡耗时明细（单位：us） \| 卡号 \| Computing \| Comm (Not Overlapped) \| FreeTime \| \| ------- \| ----------- \| --------------------- \| ----------- \| \| 3719121 \| 979775.29 \| 149468.23 \| 217707.86 \| \| 3719122 \| 979431.24 \| 74561.34 \| 289013.28 \| \| 3719123 \| 978215.92 \| 174677.31 \| 191793.32 \| \| 3719124 \| 978196.84 \| 162114.22 \| 198694.20 \| \| 3719125 \| 979913.30 \| 169451.06 \| 197771.84 \| \| 3719126 \| 975832.20 \| 174683.25 \| 194576.49 \| \| 3719127 \| 976537.90 \| 161350.04 \| 211175.86 \| \| 3719128 \| 979242.05 \| 157755.80 \| 209752.07 \| #### 数据分析 \| 指标 \| 耗时占比 \| 波动(min~max) \| \| --------------------- \| -------- \| -------------------- \| \| Computing \| 72.73% \| 72.39% ~ 73.05% \| \| Comm (Not Overlapped) \| 11.37% \| 5.55% ~ 12.99% \| \| FreeTime \| 15.89% \| 14.26% ~ 21.52% \| 📊 Section 2 — Kernel Details 分析（Top 20 算子） \| 算子 \| 计算类型 \| 调用次数 \| 总耗时(us) \| 计算耗时(us) \| 访存耗时(us) \| 计算占比(%) \| 访存占比(%) \| \| ------------------------ \| -------- \| -------- \| ---------- \| ----------- \| ----------- \| ----------- \| ----------- \| \| MatMulV3 \| cube \| 715 \| 412771.13 \| 299131.43 \| 334346.26 \| 72.47 \| 81.00 \| \| MatMulV2 \| cube \| 330 \| 79014.07 \| 61684.22 \| 72430.32 \| 78.07 \| 91.67 \| \| FlashAttentionScore \| cube \| 75 \| 72966.88 \| 10795.21 \| 54609.67 \| 14.79 \| 74.84 \| \| Transpose \| vector \| 1424 \| 52269.11 \| 2495.50 \| 21322.84 \| 4.77 \| 40.79 \| \| LayerNormV3 \| vector \| 122 \| 36653.91 \| 25690.44 \| 4722.75 \| 70.09 \| 12.88 \| \| Add \| vector \| 411 \| 35240.58 \| 8151.78 \| 26756.06 \| 23.13 \| 75.92 \| \| FlashAttentionScoreGrad \| cube \| 36 \| 29015.22 \| 2768.09 \| 13555.86 \| 9.54 \| 46.72 \| \| Gelu \| vector \| 60 \| 25053.26 \| 23578.78 \| 14476.24 \| 94.11 \| 57.78 \| \| UpsampleBilinear2dAA \| cube \| 160 \| 24667.07 \| 382.39 \| 4563.16 \| 1.55 \| 18.50 \| \| Index \| vector \| 7 \| 19249.84 \| 0.06 \| 728.44 \| 0.00 \| 3.78 \| \| Mul \| vector \| 153 \| 18957.28 \| 3939.50 \| 14735.29 \| 20.78 \| 77.73 \| \| ReduceSum \| vector \| 321 \| 11583.47 \| 3028.94 \| 2584.02 \| 26.15 \| 22.31 \| \| ConcatD \| vector \| 125 \| 11097.76 \| 258.39 \| 3964.33 \| 2.33 \| 35.72 \| \| RmsNorm \| vector \| 65 \| 9359.13 \| 6234.44 \| 2621.00 \| 66.61 \| 28.00 \| \| TransData \| vector \| 42 \| 9056.48 \| 827.88 \| 7130.83 \| 9.14 \| 78.74 \| \| DropOutDoMask \| vector \| 256 \| 8501.12 \| 2339.97 \| 2804.41 \| 27.53 \| 32.99 \| \| GeluGrad \| vector \| 32 \| 8305.12 \| 6544.20 \| 2724.75 \| 78.80 \| 32.81 \| \| LayerNormGradV3 \| vector \| 66 \| 7808.67 \| 2896.44 \| 2994.11 \| 37.09 \| 38.34 \| \| Cast \| vector \| 1209 \| 7459.66 \| 530.80 \| 3578.90 \| 7.12 \| 47.98 \| \| RotaryPositionEmbedding \| vector \| 32 \| 6566.85 \| 3623.23 \| 4293.35 \| 55.17 \| 65.38 \| 📊 Section 3 — 整网计算 vs 访存比例 `计算类算子总 Duration: 950,226.88 us ├── 计算 & 访存对比 │ ├── 计算耗时: 482,631.58 us ── 43.20% │ └── 访存耗时: 634,681.02 us ── 56.80% │ └── Cube & Vector 对比 ├── Cube 耗时: 616,188.65 us ── 64.85% └── Vector 耗时: 334,038.23 us ── 35.15%` 📊 各芯片相对性能对比（基线 = 1.0x） \| 设备 \| 架构 \| Computing(us) \| Comm Not Overlapped(us) \| FreeTime(us) \| Total Time(us) \| Speedup \| 通信详情 \| \| --------------------- \| ---- \| ------------- \| ---------------------- \| ------------ \| ------------- \| ------- \| --------------------------------- \| \| NVIDIA H200 \| GPU \| 307653.41 \| 114941.87 \| 164469.70 \| 587064.98 \| 2.29 \| 通信=29373.60us, 等待=85568.27us \| \| Huawei Ascend 950DT \| NPU \| 466010.74 \| 103192.43 \| 164469.70 \| 733672.87 \| 1.83 \| 通信=17624.16us, 等待=85568.27us \| \| Huawei Ascend 910C \| NPU \| 489196.55 \| 119287.96 \| 178175.51 \| 786660.02 \| 1.71 \| 通信=33719.69us, 等待=85568.27us \| \| Alibaba PPU1.5 \| GPU \| 627551.06 \| 123334.32 \| 178175.51 \| 929060.90 \| 1.45 \| 通信=37766.06us, 等待=85568.27us \| \| NVIDIA H20 \| GPU \| 849208.62 \| 114941.87 \| 178175.51 \| 1142326.00 \| 1.18 \| 通信=29373.60us, 等待=85568.27us \| \| Huawei Ascend 950PR \| NPU \| 841506.34 \| 118613.57 \| 178175.51 \| 1138295.42 \| 1.18 \| 通信=33045.30us, 等待=85568.27us \| \| Huawei Ascend 910B (基线) \| NPU \| 978393.09 \| 153007.66 \| 213810.61 \| 1345211.36 \| 1.00 \| 通信=67439.39us, 等待=85568.27us \| See merge request: Ascend/DrivingSDK!2071	17 天前