pypto-gym/tests/ops/minimax_m3 · CANN/pypto-gym - AtomGit

cann-robotfix(Operator):Adjust pypto tensor directory

文件	最后提交记录	最后更新时间
test_cases.json	feat(minimax_m3) + docs(scan): MiniMax-M3 model + harden gemma4_31b_it / llada2_moe / minimax_m27 Co-authored-by: leedongkun30-arch<lee.dongkun30@gmail.com> # message auto-generated for no-merge-commit merge: !244 merge scan/docs-hardening-merged into master feat(minimax_m3) + docs(scan): MiniMax-M3 model + harden gemma4_31b_it / llada2_moe / minimax_m27 Created-by: leedongkun30-arch Commit-by: leedongkun30-arch Merged-by: cann-robot Description: ## What A combined contribution (merges #239): adds the self-contained minimax_m3 model integration, and applies the `pypto-fused-op-integration` scan checklist (cf. #236) to the three already-merged integrations — gemma4_31b_it, llada2_moe, minimax_m27 — hardening doc/structure, adding an opt-in NPUGraph-capture E2E bench arm, keeping kernel microbenches local-only, and making the grouped-GEMM kernels CANN 9.1.0 / NPUGraph-safe. ## minimax_m3 (new model) Self-contained MiniMax-M3 text-backbone (mirrors the MiniMax-M2.7 layout): in-repo model definition (no `trust_remote_code`), PyPTO MoE grouped-GEMM, PyPTO MSA decode kernels, a single-die generate-based E2E bench, and operator tests. - Model wiring: `MiniMaxM3ForCausalLM` / `MiniMaxM3Config` in-repo; loader handles the VL checkpoint's `language_model.` prefix and FP8 weight-only expert tensors. Ported modeling/configuration keep the original copyright + Apache-2.0 + a Huawei modification NOTICE (no Huawei copyright claim). - MoE grouped-GEMM: one kernel replaces the per-expert FFN loop (H=6144, I=3072, E=128, top-4). - MSA decode: lightning-indexer block selection + block-sparse decode; short-context guard matches the in-repo `_msa_decode_block_table` path. ## Scan areas (gemma4_31b_it / llada2_moe / minimax_m27) 1. Env versions — each `modeling/transformers/<model>/README.md` carries a full env table: torch_npu 2.10 · transformers 4.57.1 (gemma4: 5.12) · pypto 0.2.1 · pto-isa v9.1.0 · CANN 9.1.0 · Ascend 910B3. 2. Benchmark reproduce commands — E2E is the single-die `--graph [--use_pypto]` arm, measured at the max layers that fit one die (max-fit). 3. Folder / archive mapping — repo-layout maps, the two missing gemma4 per-op kernel READMEs, and llada2 signature/line fixes. ops READMEs trimmed to the skill template (phi-level brevity). ## CANN 9.1.0 / NPUGraph fix (minimax_m3 + minimax_m27 + llada2_moe kernels) The grouped-GEMM kernels read `expert_cumsum` on the host to bound each expert's token slice. CANN 9.1.0 op-smoke requires it declared via `ready_on_host_tensors` (older toolkits auto-inferred it, so the merged models would now abort with the default `block_table` PARAM_CHECK). The declaration is added; and because that host-read is capture-hostile, the `--graph` static-route bench builds `expert_cumsum` once outside the captured region (reused persistent tensor) — otherwise the replay aborts with aicore 507011. Verified on CANN 9.1.0 / pypto 0.2.1. ## Cleanup — kernel microbenches local-only Mirrors the repo convention (phi): no committed result JSONs (`bench_baseline.json` / `bench_pypto.json`) and no in-repo kernel-microbench scripts; the `bench_.sh` still regenerates reports at runtime, and kernel correctness stays in `tests/ops/<model>/`. ## Kernel + E2E (single-die max-fit; averaged, not best-of; with warm-up) E2E runs real `model.generate()`, counts generated tokens, averages over runs (mean±std); graph arms capture once and replay per token. Kernel = operator microbench (WARMUP=5 / ITERS=20, best=min(20)). Env: CANN 9.1.0, torch_npu 2.10, pypto 0.2.1, pto-isa v9.1.0, Ascend 910B3. Intra-machine ratios only. \| model \| Eager (Kernel) \| PyPTO (Kernel) \| NPU-friendly (E2E) \| PyPTO+graph (E2E) \| Kernel× \| E2E× \| \|---\|---:\|---:\|---:\|---:\|---:\|---:\| \| minimax_m3 \| 108959 µs \| 28384 µs* \| 112.2 (vec+graph) \| 105.6 (pypto+graph) \| 3.84× \| 0.94× \| \| minimax_m27 \| 58136 µs \| 7659 µs \| 31.5 (vec+graph) \| 30.9 (pypto+graph) \| 7.59× \| 0.98× \| \| llada2_moe \| 13306 µs \| 969 µs \| 212.8 (graph) \| 205.8 (pypto+graph) \| 13.74× \| 0.97× \| \| gemma4_31b_it \| 18042 µs \| 20527 µs \| 15.44 (graph) \| 15.5 (pypto+graph) \| 0.88× \| 1.00× \| Conditions — m3: K E=256 N=2048 H=6144 I=3072 / E2E 1-die LAYERS=5 W=32 decode; m27: K E=256 N=2048 H=3072 I=1536 / E2E 1-die LAYERS=32 W=32 decode; llada2: K E=64 N=128 H=2048 I=512 / E2E 1-die LAYERS=20(full) W128 steps32 decode (denoising-step); gemma4: K decode Sq=1 Skv=65536 GQA vs lossless full-KV 16-head / E2E 1-die LAYERS=48 ctx256 gen128 decode. gemma4 is dense — no MoE `grouped_gemm` to fuse, so pypto+graph ≈ graph. The PyPTO grouped-GEMM win is concentrated in the sparse-MoE kernels; on full-network E2E the NPU-friendly arm (vectorized FFN + NPUGraph) already removes launch/host overhead, so E2E ≈ 1.0×. ## Bench / skill updates - NPUGraph-capture arm (`--graph`) on each model (single-die capture/replay; static routing made capturable). Capture gotchas: SDPA fused kernel side-stream → `attn_implementation="eager"`; mask/rotary host syncs → prebuilt 0-mask + precomputed cos/sin; MoE argsort → static routing. - `pypto-convert-model` skill extended from format conversion to also run an HF model E2E on Ascend NPU and measure the three arms (eager / NPUGraph / PyPTO), with capture-safe primitives + the gotcha table. ## Testing - `TILE_FWK_DEVICE_ID=0 python -m pytest tests/ops/minimax_m3 -q` → grouped-GEMM precision passes; large MSA sparse-decode / HF-attention integration cases are skipped in CI smoke (need a full free die). - m27 / llada2 op-smoke pass; m27 pypto+graph replays cleanly. ## Checklist - [x] Code follows style guide - [x] Tests added and passed - [x] Docs updated - [x] No secrets hardcoded See merge request: cann/pypto-gym!244	8 天前
test_minimax_m3_grouped_gemm.py	fix(Operator):Adjust pypto tensor directory Co-authored-by: huangyuqian<huangyuqian2@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge master into master fix(Operator):Adjust pypto tensor directory Created-by: huangyuqian Commit-by: huangyuqian Merged-by: cann-robot Description: ## 变更描述 / Description <!-- 本 PR 做了什么，为什么需要 / What does this PR do and why --> ## 改动类型 / Change Type - [ ] Bug 修复 / Bug Fix - [ ] 新功能 / New Feature - [ ] 性能优化 / Performance - [ ] 代码重构 / Refactoring - [ ] 文档更新 / Documentation - [ ] 测试相关 / Test - [ ] 其它 / Other ## 关联 Issue / Related Issues <!-- Closes #000 可自动关闭 / Closes #000 to auto-close --> - Closes # - References # ## 测试信息 / Testing <!-- 简要测试说明或关键结果 / Brief test description or key results --> - [ ] 单元测试通过 / UT passed - [ ] 集成测试通过 / ST passed - [ ] 人工验证通过 / Manual verified ## 检查清单 / Checklist - [ ] 代码符合规范 / Code follows style guide - [ ] 测试添加并通过 / Tests added and passed - [ ] 文档已更新 / Docs updated if needed - [ ] 无硬编码敏感信息 / No secrets hardcoded - [ ] 提交信息符合规范 / Commit message follows convention See merge request: cann/pypto-gym!321	5 天前
test_minimax_m3_msa_pypto.py	fix(Operator):Adjust pypto tensor directory Co-authored-by: huangyuqian<huangyuqian2@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge master into master fix(Operator):Adjust pypto tensor directory Created-by: huangyuqian Commit-by: huangyuqian Merged-by: cann-robot Description: ## 变更描述 / Description <!-- 本 PR 做了什么，为什么需要 / What does this PR do and why --> ## 改动类型 / Change Type - [ ] Bug 修复 / Bug Fix - [ ] 新功能 / New Feature - [ ] 性能优化 / Performance - [ ] 代码重构 / Refactoring - [ ] 文档更新 / Documentation - [ ] 测试相关 / Test - [ ] 其它 / Other ## 关联 Issue / Related Issues <!-- Closes #000 可自动关闭 / Closes #000 to auto-close --> - Closes # - References # ## 测试信息 / Testing <!-- 简要测试说明或关键结果 / Brief test description or key results --> - [ ] 单元测试通过 / UT passed - [ ] 集成测试通过 / ST passed - [ ] 人工验证通过 / Manual verified ## 检查清单 / Checklist - [ ] 代码符合规范 / Code follows style guide - [ ] 测试添加并通过 / Tests added and passed - [ ] 文档已更新 / Docs updated if needed - [ ] 无硬编码敏感信息 / No secrets hardcoded - [ ] 提交信息符合规范 / Commit message follows convention See merge request: cann/pypto-gym!321	5 天前