msmodeling/tensor_cast/layers · Ascend/MindStudio-Modeling - AtomGit

ascend-robotfix(tensor_cast): model MTP speculative decode shapes

文件	最后提交记录	最后更新时间
__init__.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
attention.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
deepseek_v4.py	fix(tensor_cast): fix shared expert tensor parallelism mismatch and routing alignment for DeepSeek V4 Co-authored-by: jhon-117<fangkai15@huawei.com> # message auto-generated for no-merge-commit merge: !414 merge bugfix/20260624-text-generate-fix into master fix(tensor_cast): fix shared expert tensor parallelism mismatch and routing alignment for DeepSeek V4 Created-by: jhon-117 Commit-by: jhon-117 Merged-by: ascend-robot Description: ## Description This PR fixes an execution crash and incorrect computation behavior when running DeepSeek V4 with `--enable-shared-expert-tp`. ### Background & Root Causes When simulating DeepSeek V4 with shared expert tensor parallelism enabled, the model threw an expected `1564672 elements, got 12517376` (an 8x difference matching `tp_size=8`) shape mismatch error and caused a TorchDynamo Graph Break. This was traced back to three interrelated issues: 1. Routing Slice Misalignment (`moe_layer.py`): DeepSeek V4 utilizes a hash-based gate router requiring `route_after_dp_transform=False`. In the manual routing execution branch for shared expert TP, `route` was correctly invoked before the `hidden_states` DP slice (`_dp_transform_enter`), but the resulting `topk_indices` and `topk_weights` were never subsequently sliced. This resulted in the token dispatching logic receiving 1/8th of the tokens alongside an unsliced 8/8 routing matrix, blowing up the combined tensor shape. 2. Shared Experts TP Match Failure (`transformations.py`): The `tp_plan` matching pattern for shared experts explicitly looked for `..mlp.fused_moe.shared_experts.gate_proj`. However, DeepSeek V4 mounts its shared experts directly under `mlp.shared_experts` as a standard MLP block. Because the regex failed to match, the shared experts were executed densely (unsliced) and mistakenly accumulated across the DP/TP domains. 3. Graph Break (`moe_layer.py`): The fallback safety `logger.warning` checking for the aforementioned shape mismatch triggered a `torch._dynamo.exc.Unsupported` graph break, preventing full-graph compilation. ### Proposed Changes - Fix topk tensors slicing:* Invoked `_dp_transform_enter(topk_indices)` and `_dp_transform_enter(topk_weights)` when `route_after_dp` is False, ensuring token alignment strictly matches the `hidden_states` sequences. - Broaden shared experts match pattern: Updated the shared experts matching rule in `tp_plan` from `f"{prefix}..mlp.fused_moe.shared_experts.gate_proj"` to `f"{prefix}..shared_expert.gate_proj"`, perfectly accommodating both DeepSeek V4 and standard architectural layouts. - Safeguard Dynamo Compilation:* Wrapped the shape mismatch `logger.warning` in a `if not torch.compiler.is_compiling():` block. This eliminates compilation graph breaks while preserving the log for eager mode debugging. ### Testing - [x] Verified `cli.inference.text_generate "deepseek-ai/DeepSeek-V4-Flash"` with `--enable-shared-expert-tp` successfully compiles and outputs correct TPS metrics. - [x] Confirmed `torch.compile` finishes without graph breaks related to the `logger.warning` call. #### 关于测试用例补充现有测试不需要修改：并没有改变接口协议（API）或算子输出的预期形状规则，而是修复了在某个特定配置下的内部对齐错误。是否补充新测试：针对 enable_shared_expert_tp=True 且 route_after_dp_transform=False 这种极度特定的组合条件，已有 cli.inference.text_generate 的端到端（E2E）全量仿真流，可完全看护。总体建议是：目前不需要特意补充。现有的修复已经通过了完整的端到端仿真，并且相关模块的回归测试也全部通过了。 ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9f0000d9-e6c9-49ab-80f6-a5cca5dbfa09/image.png 'image.png') See merge request: Ascend/msmodeling!414	2 天前
glm5.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
internal.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
mla.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
moe_layer.py	fix(tensor_cast): fix shared expert tensor parallelism mismatch and routing alignment for DeepSeek V4 Co-authored-by: jhon-117<fangkai15@huawei.com> # message auto-generated for no-merge-commit merge: !414 merge bugfix/20260624-text-generate-fix into master fix(tensor_cast): fix shared expert tensor parallelism mismatch and routing alignment for DeepSeek V4 Created-by: jhon-117 Commit-by: jhon-117 Merged-by: ascend-robot Description: ## Description This PR fixes an execution crash and incorrect computation behavior when running DeepSeek V4 with `--enable-shared-expert-tp`. ### Background & Root Causes When simulating DeepSeek V4 with shared expert tensor parallelism enabled, the model threw an expected `1564672 elements, got 12517376` (an 8x difference matching `tp_size=8`) shape mismatch error and caused a TorchDynamo Graph Break. This was traced back to three interrelated issues: 1. Routing Slice Misalignment (`moe_layer.py`): DeepSeek V4 utilizes a hash-based gate router requiring `route_after_dp_transform=False`. In the manual routing execution branch for shared expert TP, `route` was correctly invoked before the `hidden_states` DP slice (`_dp_transform_enter`), but the resulting `topk_indices` and `topk_weights` were never subsequently sliced. This resulted in the token dispatching logic receiving 1/8th of the tokens alongside an unsliced 8/8 routing matrix, blowing up the combined tensor shape. 2. Shared Experts TP Match Failure (`transformations.py`): The `tp_plan` matching pattern for shared experts explicitly looked for `..mlp.fused_moe.shared_experts.gate_proj`. However, DeepSeek V4 mounts its shared experts directly under `mlp.shared_experts` as a standard MLP block. Because the regex failed to match, the shared experts were executed densely (unsliced) and mistakenly accumulated across the DP/TP domains. 3. Graph Break (`moe_layer.py`): The fallback safety `logger.warning` checking for the aforementioned shape mismatch triggered a `torch._dynamo.exc.Unsupported` graph break, preventing full-graph compilation. ### Proposed Changes - Fix topk tensors slicing:* Invoked `_dp_transform_enter(topk_indices)` and `_dp_transform_enter(topk_weights)` when `route_after_dp` is False, ensuring token alignment strictly matches the `hidden_states` sequences. - Broaden shared experts match pattern: Updated the shared experts matching rule in `tp_plan` from `f"{prefix}..mlp.fused_moe.shared_experts.gate_proj"` to `f"{prefix}..shared_expert.gate_proj"`, perfectly accommodating both DeepSeek V4 and standard architectural layouts. - Safeguard Dynamo Compilation:* Wrapped the shape mismatch `logger.warning` in a `if not torch.compiler.is_compiling():` block. This eliminates compilation graph breaks while preserving the log for eager mode debugging. ### Testing - [x] Verified `cli.inference.text_generate "deepseek-ai/DeepSeek-V4-Flash"` with `--enable-shared-expert-tp` successfully compiles and outputs correct TPS metrics. - [x] Confirmed `torch.compile` finishes without graph breaks related to the `logger.warning` call. #### 关于测试用例补充现有测试不需要修改：并没有改变接口协议（API）或算子输出的预期形状规则，而是修复了在某个特定配置下的内部对齐错误。是否补充新测试：针对 enable_shared_expert_tp=True 且 route_after_dp_transform=False 这种极度特定的组合条件，已有 cli.inference.text_generate 的端到端（E2E）全量仿真流，可完全看护。总体建议是：目前不需要特意补充。现有的修复已经通过了完整的端到端仿真，并且相关模块的回归测试也全部通过了。 ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9f0000d9-e6c9-49ab-80f6-a5cca5dbfa09/image.png 'image.png') See merge request: Ascend/msmodeling!414	2 天前
mtp.py	fix(tensor_cast): model MTP speculative decode shapes Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !362 merge resolve-issue-130 into master fix(tensor_cast): model MTP speculative decode shapes Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: PR Type / PR类型 - [x] Bugfix（Bug 修复） - [x] Refactor（代码重构） - [x] Test-Cases（测试用例更新） ## 🔍 Motivation / 变更动机 MTP（Multi-Token Prediction）speculative decode 在 TensorCast 仿真中，lm_head 和 sampler 的 row selection 逻辑存在问题： 1. target/proposal rows 混淆：MTP decode 时，lm_head 应该只处理 spec window 内的 target+bonus verification rows，而不是全量 packed rows。旧逻辑用 `selected_token_indices` 做 prefill 级的 token 裁剪，但无法区分 target 和 proposal，导致 lm_head 多算了不参与 verification 的行。 2. Sampler 不支持 spec decode 输出格式：旧 Sampler 只做 greedy argmax 取最后一个 token，不支持返回 `(num_requests, num_speculative_tokens + 1)` 形状的 target+bonus tokens。 3. Kimi K2.5 MTP path 全量过 lm_head：Kimi 的 monkey-patch 在 MTP text path 里把全部 hidden states 送进 lm_head（163840 vocab），prefill 时 12×7168×163840 的矩阵乘法被放大 ~3500×。 ------ ## 📝 Modification / 修改内容 ### 核心：统一 row selection 路径 - 新增 `SpecDecodeMetadata` dataclass，记录每个 batch 的 `logits_indices`、`num_active_requests`、`num_speculative_tokens`。 - 新增 `select_lm_head_hidden_states(hidden_states, sampling_metadata, mode)`： - `mode="target"`：选 verification window 全部行（给 lm_head 用） - `mode="proposal"`：只选每个 request 的最后一行（给 MTP predictor 用） - `CausalLmWrapper`、`VLModelWrapper`、`MultiTokenPredictor` 统一调用该函数，替代原来散落各处的 `index_select`。 ### Input Generator - `generate_inputs` / `generate_inputs_varlen` 在 MTP decode 时构造 `SpecDecodeMetadata`，只覆盖每个 request 尾部的 spec window rows。 - 短窗口（query_len < num_mtp_tokens + 1）自动 fallback 到普通 decode selection。 ### Sampler - 识别 `spec_decode_metadata` 后，将 verification logits reshape 为 `(num_requests, spec_window, vocab)`，分别对 target 和 bonus 做 greedy argmax，返回 `(num_requests, spec_window)` 形状。 - 兼容 proposal rows（MTP 后续层）和旧的 `selected_token_indices` prefill 路径。 ### Kimi K2.5 - MTP text path 拆分：先跑 language model body 拿到 full hidden states（rotary/proposal 需要），再用 `select_lm_head_hidden_states` 裁 target rows 后过 lm_head。避免 163840-vocab 的全量投影。 - 删除旧的 `MultiTokenPredictorLayer` tuple-unpack monkey patch（已在上游 mtp.py 修复）。 ------ ## 📐 Associated Test Results / 关联测试结果 `pytest: 26 passed, 2 warnings in 3.72s` 覆盖： - target/proposal row selection 不混用 - default `selected_token_indices=-1` sentinel 被正确忽略 - wrong logits_indices length → ValueError - spec-decode sampler 返回 target+bonus tokens - CausalLmWrapper target row 投影 - MTP wrapper prefill fallback / spec-decode bonus token forward - fixed/varlen input generator MTP metadata 生成 - Kimi text path target rows 在 internal lm_head 前裁剪 - Kimi default sentinel 不触发 fast path ------ ## ✅ Checklist / 检查列表 - [x] Linting tools used / 使用 lintrunner 工具 - [x] Bug fixes covered by unit tests / 修复的 Bug 已由单元测试覆盖 - [x] Modification covered by unit tests / 修改已由单元测试覆盖 - [ ] Documentation updated / 文档已更新 - [x] No Chinese comments in code files / 代码文件中不含中文注释 See merge request: Ascend/msmodeling!362	1 天前
parallel_embedding.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
parallel_linear.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
quant_linear.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
rotary_embedding.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前
sampler.py	fix(tensor_cast): model MTP speculative decode shapes Co-authored-by: Horacehxw<horacehxw@gmail.com> # message auto-generated for no-merge-commit merge: !362 merge resolve-issue-130 into master fix(tensor_cast): model MTP speculative decode shapes Created-by: Horacehxw Commit-by: Horacehxw Merged-by: ascend-robot Description: PR Type / PR类型 - [x] Bugfix（Bug 修复） - [x] Refactor（代码重构） - [x] Test-Cases（测试用例更新） ## 🔍 Motivation / 变更动机 MTP（Multi-Token Prediction）speculative decode 在 TensorCast 仿真中，lm_head 和 sampler 的 row selection 逻辑存在问题： 1. target/proposal rows 混淆：MTP decode 时，lm_head 应该只处理 spec window 内的 target+bonus verification rows，而不是全量 packed rows。旧逻辑用 `selected_token_indices` 做 prefill 级的 token 裁剪，但无法区分 target 和 proposal，导致 lm_head 多算了不参与 verification 的行。 2. Sampler 不支持 spec decode 输出格式：旧 Sampler 只做 greedy argmax 取最后一个 token，不支持返回 `(num_requests, num_speculative_tokens + 1)` 形状的 target+bonus tokens。 3. Kimi K2.5 MTP path 全量过 lm_head：Kimi 的 monkey-patch 在 MTP text path 里把全部 hidden states 送进 lm_head（163840 vocab），prefill 时 12×7168×163840 的矩阵乘法被放大 ~3500×。 ------ ## 📝 Modification / 修改内容 ### 核心：统一 row selection 路径 - 新增 `SpecDecodeMetadata` dataclass，记录每个 batch 的 `logits_indices`、`num_active_requests`、`num_speculative_tokens`。 - 新增 `select_lm_head_hidden_states(hidden_states, sampling_metadata, mode)`： - `mode="target"`：选 verification window 全部行（给 lm_head 用） - `mode="proposal"`：只选每个 request 的最后一行（给 MTP predictor 用） - `CausalLmWrapper`、`VLModelWrapper`、`MultiTokenPredictor` 统一调用该函数，替代原来散落各处的 `index_select`。 ### Input Generator - `generate_inputs` / `generate_inputs_varlen` 在 MTP decode 时构造 `SpecDecodeMetadata`，只覆盖每个 request 尾部的 spec window rows。 - 短窗口（query_len < num_mtp_tokens + 1）自动 fallback 到普通 decode selection。 ### Sampler - 识别 `spec_decode_metadata` 后，将 verification logits reshape 为 `(num_requests, spec_window, vocab)`，分别对 target 和 bonus 做 greedy argmax，返回 `(num_requests, spec_window)` 形状。 - 兼容 proposal rows（MTP 后续层）和旧的 `selected_token_indices` prefill 路径。 ### Kimi K2.5 - MTP text path 拆分：先跑 language model body 拿到 full hidden states（rotary/proposal 需要），再用 `select_lm_head_hidden_states` 裁 target rows 后过 lm_head。避免 163840-vocab 的全量投影。 - 删除旧的 `MultiTokenPredictorLayer` tuple-unpack monkey patch（已在上游 mtp.py 修复）。 ------ ## 📐 Associated Test Results / 关联测试结果 `pytest: 26 passed, 2 warnings in 3.72s` 覆盖： - target/proposal row selection 不混用 - default `selected_token_indices=-1` sentinel 被正确忽略 - wrong logits_indices length → ValueError - spec-decode sampler 返回 target+bonus tokens - CausalLmWrapper target row 投影 - MTP wrapper prefill fallback / spec-decode bonus token forward - fixed/varlen input generator MTP metadata 生成 - Kimi text path target rows 在 internal lm_head 前裁剪 - Kimi default sentinel 不触发 fast path ------ ## ✅ Checklist / 检查列表 - [x] Linting tools used / 使用 lintrunner 工具 - [x] Bug fixes covered by unit tests / 修复的 Bug 已由单元测试覆盖 - [x] Modification covered by unit tests / 修改已由单元测试覆盖 - [ ] Documentation updated / 文档已更新 - [x] No Chinese comments in code files / 代码文件中不含中文注释 See merge request: Ascend/msmodeling!362	1 天前
utils.py	【同步】【非开发代码】代码从 develop 同步到 master Co-authored-by: yydyzr<liuyuncong1@huawei.com> Co-authored-by: gcw_61YBRfIt<chuzhenxing@huawei.com> Co-authored-by: 孔炳翔<1120200577@qq.com> Co-authored-by: zhengxinqian<qianzhengxin@huawei.com> Co-authored-by: hw_whx<wanghexiang7@huawei.com> Co-authored-by: jgong5<steven.gong@gmail.com> Co-authored-by: hw_whx<2952154980@qq.com> # message auto-generated for no-merge-commit merge: !330 merge master into master 【同步】【非开发代码】代码从 develop 同步到 master Created-by: AvadaKedavrua Commit-by: liujiawang;ascend-robot;AvadaKedavrua;lutean;Horacehxw;eveyin1;minghang_c;zwt__;tt0cool;elrond-g;jia_ya_nan;zhenyu_zhang;ChenHuiwen;wangshen001;Hudingyi;wendellX;Secluded_Ocean;jhon-117;yaohan404;jiangruitao;zhenghaojie;stormchasingg;panyj1993;cmh1056291129;yuyinkai1;sunguozhong;genius52;liu_jiaxu;HongMaoShuiGuai;zhengxinqian;weixin_43368449;jsez-li-bin;jgong5;wqh17101;w00609794;yydyzr;JieZhang679;sppedforcy;gcw_61YBRfIt;Jiong Gong;hw_whx;gongjiong;孔炳翔 Merged-by: ascend-robot Description: 代码从 develop 同步到 master，后续基于 master 演进，并支持打包 See merge request: Ascend/msmodeling!330	15 天前