| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
feat(minimax_m3) + docs(scan): MiniMax-M3 model + harden gemma4_31b_it / llada2_moe / minimax_m27 Co-authored-by: leedongkun30-arch<lee.dongkun30@gmail.com> # message auto-generated for no-merge-commit merge: !244 merge scan/docs-hardening-merged into master feat(minimax_m3) + docs(scan): MiniMax-M3 model + harden gemma4_31b_it / llada2_moe / minimax_m27 Created-by: leedongkun30-arch Commit-by: leedongkun30-arch Merged-by: cann-robot Description: ## What A combined contribution (merges #239): adds the self-contained **minimax_m3** model integration, and applies the pypto-fused-op-integration scan checklist (cf. #236) to the three already-merged integrations — **gemma4_31b_it**, **llada2_moe**, **minimax_m27** — hardening doc/structure, adding an opt-in **NPUGraph-capture E2E bench arm**, keeping kernel microbenches local-only, and making the grouped-GEMM kernels **CANN 9.1.0 / NPUGraph-safe**. ## minimax_m3 (new model) Self-contained MiniMax-M3 text-backbone (mirrors the MiniMax-M2.7 layout): in-repo model definition (no trust_remote_code), PyPTO MoE grouped-GEMM, PyPTO MSA decode kernels, a single-die generate-based E2E bench, and operator tests. - **Model wiring**: MiniMaxM3ForCausalLM / MiniMaxM3Config in-repo; loader handles the VL checkpoint's language_model. prefix and FP8 weight-only expert tensors. Ported modeling/configuration keep the original copyright + Apache-2.0 + a Huawei modification NOTICE (no Huawei copyright claim). - **MoE grouped-GEMM**: one kernel replaces the per-expert FFN loop (H=6144, I=3072, E=128, top-4). - **MSA decode**: lightning-indexer block selection + block-sparse decode; short-context guard matches the in-repo _msa_decode_block_table path. ## Scan areas (gemma4_31b_it / llada2_moe / minimax_m27) 1. **Env versions** — each modeling/transformers/<model>/README.md carries a full env table: torch_npu 2.10 · transformers 4.57.1 (gemma4: 5.12) · pypto 0.2.1 · pto-isa v9.1.0 · CANN 9.1.0 · Ascend 910B3. 2. **Benchmark reproduce commands** — E2E is the single-die --graph [--use_pypto] arm, measured at the **max layers that fit one die** (max-fit). 3. **Folder / archive mapping** — repo-layout maps, the two missing gemma4 per-op kernel READMEs, and llada2 signature/line fixes. ops READMEs trimmed to the skill template (phi-level brevity). ## CANN 9.1.0 / NPUGraph fix (minimax_m3 + minimax_m27 + llada2_moe kernels) The grouped-GEMM kernels read expert_cumsum on the host to bound each expert's token slice. CANN 9.1.0 op-smoke requires it declared via ready_on_host_tensors (older toolkits auto-inferred it, so the merged models would now abort with the default block_table PARAM_CHECK). The declaration is added; and because that host-read is capture-hostile, the --graph static-route bench builds expert_cumsum **once outside the captured region** (reused persistent tensor) — otherwise the replay aborts with aicore 507011. Verified on CANN 9.1.0 / pypto 0.2.1. ## Cleanup — kernel microbenches local-only Mirrors the repo convention (phi): no committed result JSONs (bench_baseline.json / bench_pypto.json) and no in-repo kernel-microbench scripts; the bench_*.sh still regenerates reports at runtime, and kernel correctness stays in tests/ops/<model>/. ## Kernel + E2E (single-die max-fit; averaged, not best-of; with warm-up) E2E runs real model.generate(), counts generated tokens, averages over runs (mean±std); graph arms capture once and replay per token. Kernel = operator microbench (WARMUP=5 / ITERS=20, best=min(20)). Env: CANN 9.1.0, torch_npu 2.10, pypto 0.2.1, pto-isa v9.1.0, Ascend 910B3. Intra-machine ratios only. | model | Eager (Kernel) | PyPTO (Kernel) | NPU-friendly (E2E) | PyPTO+graph (E2E) | Kernel× | E2E× | |---|---:|---:|---:|---:|---:|---:| | minimax_m3 | 108959 µs | **28384 µs** | 112.2 (vec+graph) | 105.6 (pypto+graph) | **3.84×** | 0.94× | | minimax_m27 | 58136 µs | **7659 µs** | 31.5 (vec+graph) | 30.9 (pypto+graph) | **7.59×** | 0.98× | | llada2_moe | 13306 µs | **969 µs** | 212.8 (graph) | 205.8 (pypto+graph) | **13.74×** | 0.97× | | gemma4_31b_it | 18042 µs | 20527 µs | 15.44 (graph) | 15.5 (pypto+graph) | **0.88×** | 1.00× | Conditions — m3: K E=256 N=2048 H=6144 I=3072 / E2E 1-die LAYERS=5 W=32 decode; m27: K E=256 N=2048 H=3072 I=1536 / E2E 1-die LAYERS=32 W=32 decode; llada2: K E=64 N=128 H=2048 I=512 / E2E 1-die LAYERS=20(full) W128 steps32 decode (denoising-step); gemma4: K decode Sq=1 Skv=65536 GQA vs lossless full-KV 16-head / E2E 1-die LAYERS=48 ctx256 gen128 decode. **gemma4 is dense** — no MoE grouped_gemm to fuse, so pypto+graph ≈ graph. The PyPTO grouped-GEMM win is concentrated in the sparse-MoE kernels; on full-network E2E the NPU-friendly arm (vectorized FFN + NPUGraph) already removes launch/host overhead, so E2E ≈ 1.0×. ## Bench / skill updates - **NPUGraph-capture arm (--graph)** on each model (single-die capture/replay; static routing made capturable). Capture gotchas: SDPA fused kernel side-stream → attn_implementation="eager"; mask/rotary host syncs → prebuilt 0-mask + precomputed cos/sin; MoE argsort → static routing. - **pypto-convert-model** skill extended from format conversion to also run an HF model E2E on Ascend NPU and measure the three arms (eager / NPUGraph / PyPTO), with capture-safe primitives + the gotcha table. ## Testing - TILE_FWK_DEVICE_ID=0 python -m pytest tests/ops/minimax_m3 -q → grouped-GEMM precision passes; large MSA sparse-decode / HF-attention integration cases are skipped in CI smoke (need a full free die). - m27 / llada2 op-smoke pass; m27 pypto+graph replays cleanly. ## Checklist - [x] Code follows style guide - [x] Tests added and passed - [x] Docs updated - [x] No secrets hardcoded See merge request: cann/pypto-gym!244 | 8 天前 | |
fix(Operator):Adjust pypto tensor directory Co-authored-by: huangyuqian<huangyuqian2@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge master into master fix(Operator):Adjust pypto tensor directory Created-by: huangyuqian Commit-by: huangyuqian Merged-by: cann-robot Description: ## 变更描述 / Description <!-- 本 PR 做了什么,为什么需要 / What does this PR do and why --> ## 改动类型 / Change Type - [ ] Bug 修复 / Bug Fix - [ ] 新功能 / New Feature - [ ] 性能优化 / Performance - [ ] 代码重构 / Refactoring - [ ] 文档更新 / Documentation - [ ] 测试相关 / Test - [ ] 其它 / Other ## 关联 Issue / Related Issues <!-- Closes #000 可自动关闭 / Closes #000 to auto-close --> - Closes # - References # ## 测试信息 / Testing <!-- 简要测试说明或关键结果 / Brief test description or key results --> - [ ] 单元测试通过 / UT passed - [ ] 集成测试通过 / ST passed - [ ] 人工验证通过 / Manual verified ## 检查清单 / Checklist - [ ] 代码符合规范 / Code follows style guide - [ ] 测试添加并通过 / Tests added and passed - [ ] 文档已更新 / Docs updated if needed - [ ] 无硬编码敏感信息 / No secrets hardcoded - [ ] 提交信息符合规范 / Commit message follows convention See merge request: cann/pypto-gym!321 | 5 天前 | |
fix(Operator):Adjust pypto tensor directory Co-authored-by: huangyuqian<huangyuqian2@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge master into master fix(Operator):Adjust pypto tensor directory Created-by: huangyuqian Commit-by: huangyuqian Merged-by: cann-robot Description: ## 变更描述 / Description <!-- 本 PR 做了什么,为什么需要 / What does this PR do and why --> ## 改动类型 / Change Type - [ ] Bug 修复 / Bug Fix - [ ] 新功能 / New Feature - [ ] 性能优化 / Performance - [ ] 代码重构 / Refactoring - [ ] 文档更新 / Documentation - [ ] 测试相关 / Test - [ ] 其它 / Other ## 关联 Issue / Related Issues <!-- Closes #000 可自动关闭 / Closes #000 to auto-close --> - Closes # - References # ## 测试信息 / Testing <!-- 简要测试说明或关键结果 / Brief test description or key results --> - [ ] 单元测试通过 / UT passed - [ ] 集成测试通过 / ST passed - [ ] 人工验证通过 / Manual verified ## 检查清单 / Checklist - [ ] 代码符合规范 / Code follows style guide - [ ] 测试添加并通过 / Tests added and passed - [ ] 文档已更新 / Docs updated if needed - [ ] 无硬编码敏感信息 / No secrets hardcoded - [ ] 提交信息符合规范 / Commit message follows convention See merge request: cann/pypto-gym!321 | 5 天前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 8 天前 | ||
| 5 天前 | ||
| 5 天前 |