| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
feat: [perf-predict] add 910C dual-die compute handling Co-authored-by: captain_jing<jingshenjin@huawei.com> # message auto-generated for no-merge-commit merge: !2081 merge master into master feat: [perf-predict] add 910C dual-die compute handling Created-by: captain_jing Commit-by: captain_jing Merged-by: ascend-robot Description: ## What this PR does / why we need it? - 910C 双 die 基线的计算预测逻辑修正 当基线为 910C 时,计算部分先按 910C→910B 做折算(等价以 910B 作为“计算基准”),再从 910B 换算到目标芯片;通信与 FreeTime 仍按 910C 的互联带宽与 CPU 系数测算,避免把 910C 当作单 die 直接线性缩放导致偏差。 - 性能预测使用最后一个 step(并对齐通信矩阵) 统一按最后 step 选择 kernel_details.csv 、 step_trace_time.csv ,并把 communication_matrix.json 的统计 step 与之对齐,避免多 step 混算导致结果放大。 - source 传参支持尾标并更新说明 支持 910B/910C/950PR/PPU1.5/H200/H20 等简写,避免 CLI 传含空格全名不便;README 示例同步更新。 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/DrivingSDK!2081 | 7 天前 | |
add skills of profile analyse and performance prediction Co-authored-by: captain_jing<jingshenjin@huawei.com> # message auto-generated for no-merge-commit merge: !2071 merge master into master add skills of profile analyse and performance prediction Created-by: captain_jing Commit-by: captain_jing Merged-by: ascend-robot Description: ## What this PR does / why we need it? - 提供 Profiling 自动分析:输出 Step Trace 耗时占比/波动汇总、Top20 算子计算/访存占比、以及整网计算 vs 访存、cube vs vector 的汇总结果。 - 增强跨芯片性能预测:基于 kernel_details 的算力/带宽换算,并融合 Step Trace(Computing/未掩盖通信/FreeTime)做综合性能估算与通信拆分展示。 - 新增 Skills 文档与目录(迁移到 .agent/skills ),同时抽取工具脚本公共逻辑复用。 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? > 请调用ascend-profiling-analysis的skill分析D:\profile\gr00t-n1.6数据 ## 📊 Section 1 — Step Trace 分析 #### 各卡耗时明细(单位:us) | 卡号 | Computing | Comm (Not Overlapped) | FreeTime | | ------- | ----------- | --------------------- | ----------- | | 3719121 | 979775.29 | 149468.23 | 217707.86 | | 3719122 | 979431.24 | 74561.34 | 289013.28 | | 3719123 | 978215.92 | 174677.31 | 191793.32 | | 3719124 | 978196.84 | 162114.22 | 198694.20 | | 3719125 | 979913.30 | 169451.06 | 197771.84 | | 3719126 | 975832.20 | 174683.25 | 194576.49 | | 3719127 | 976537.90 | 161350.04 | 211175.86 | | 3719128 | 979242.05 | 157755.80 | 209752.07 | #### 数据分析 | 指标 | 耗时占比 | 波动(min~max) | | --------------------- | -------- | -------------------- | | Computing | 72.73% | 72.39% ~ 73.05% | | Comm (Not Overlapped) | 11.37% | 5.55% ~ 12.99% | | FreeTime | 15.89% | 14.26% ~ 21.52% | 📊 Section 2 — Kernel Details 分析(Top 20 算子) | 算子 | 计算类型 | 调用次数 | 总耗时(us) | 计算耗时(us) | 访存耗时(us) | 计算占比(%) | 访存占比(%) | | ------------------------ | -------- | -------- | ---------- | ----------- | ----------- | ----------- | ----------- | | MatMulV3 | cube | 715 | 412771.13 | 299131.43 | 334346.26 | 72.47 | 81.00 | | MatMulV2 | cube | 330 | 79014.07 | 61684.22 | 72430.32 | 78.07 | 91.67 | | FlashAttentionScore | cube | 75 | 72966.88 | 10795.21 | 54609.67 | 14.79 | 74.84 | | Transpose | vector | 1424 | 52269.11 | 2495.50 | 21322.84 | 4.77 | 40.79 | | LayerNormV3 | vector | 122 | 36653.91 | 25690.44 | 4722.75 | 70.09 | 12.88 | | Add | vector | 411 | 35240.58 | 8151.78 | 26756.06 | 23.13 | 75.92 | | FlashAttentionScoreGrad | cube | 36 | 29015.22 | 2768.09 | 13555.86 | 9.54 | 46.72 | | Gelu | vector | 60 | 25053.26 | 23578.78 | 14476.24 | 94.11 | 57.78 | | UpsampleBilinear2dAA | cube | 160 | 24667.07 | 382.39 | 4563.16 | 1.55 | 18.50 | | Index | vector | 7 | 19249.84 | 0.06 | 728.44 | 0.00 | 3.78 | | Mul | vector | 153 | 18957.28 | 3939.50 | 14735.29 | 20.78 | 77.73 | | ReduceSum | vector | 321 | 11583.47 | 3028.94 | 2584.02 | 26.15 | 22.31 | | ConcatD | vector | 125 | 11097.76 | 258.39 | 3964.33 | 2.33 | 35.72 | | RmsNorm | vector | 65 | 9359.13 | 6234.44 | 2621.00 | 66.61 | 28.00 | | TransData | vector | 42 | 9056.48 | 827.88 | 7130.83 | 9.14 | 78.74 | | DropOutDoMask | vector | 256 | 8501.12 | 2339.97 | 2804.41 | 27.53 | 32.99 | | GeluGrad | vector | 32 | 8305.12 | 6544.20 | 2724.75 | 78.80 | 32.81 | | LayerNormGradV3 | vector | 66 | 7808.67 | 2896.44 | 2994.11 | 37.09 | 38.34 | | Cast | vector | 1209 | 7459.66 | 530.80 | 3578.90 | 7.12 | 47.98 | | RotaryPositionEmbedding | vector | 32 | 6566.85 | 3623.23 | 4293.35 | 55.17 | 65.38 | 📊 Section 3 — 整网计算 vs 访存比例 计算类算子总 Duration: 950,226.88 us ├── 计算 & 访存对比 │ ├── 计算耗时: 482,631.58 us ── 43.20% │ └── 访存耗时: 634,681.02 us ── 56.80% │ └── Cube & Vector 对比 ├── Cube 耗时: 616,188.65 us ── 64.85% └── Vector 耗时: 334,038.23 us ── 35.15% 📊 各芯片相对性能对比(基线 = 1.0x) | 设备 | 架构 | Computing(us) | Comm Not Overlapped(us) | FreeTime(us) | Total Time(us) | Speedup | 通信详情 | | --------------------- | ---- | ------------- | ---------------------- | ------------ | ------------- | ------- | --------------------------------- | | NVIDIA H200 | GPU | 307653.41 | 114941.87 | 164469.70 | 587064.98 | 2.29 | 通信=29373.60us, 等待=85568.27us | | Huawei Ascend 950DT | NPU | 466010.74 | 103192.43 | 164469.70 | 733672.87 | 1.83 | 通信=17624.16us, 等待=85568.27us | | Huawei Ascend 910C | NPU | 489196.55 | 119287.96 | 178175.51 | 786660.02 | 1.71 | 通信=33719.69us, 等待=85568.27us | | Alibaba PPU1.5 | GPU | 627551.06 | 123334.32 | 178175.51 | 929060.90 | 1.45 | 通信=37766.06us, 等待=85568.27us | | NVIDIA H20 | GPU | 849208.62 | 114941.87 | 178175.51 | 1142326.00 | 1.18 | 通信=29373.60us, 等待=85568.27us | | Huawei Ascend 950PR | NPU | 841506.34 | 118613.57 | 178175.51 | 1138295.42 | 1.18 | 通信=33045.30us, 等待=85568.27us | | Huawei Ascend 910B (基线) | NPU | 978393.09 | 153007.66 | 213810.61 | 1345211.36 | 1.00 | 通信=67439.39us, 等待=85568.27us | See merge request: Ascend/DrivingSDK!2071 | 17 天前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 7 天前 | ||
| 17 天前 |