msmodeling/tensor_cast/layers · Ascend/MindStudio-Modeling - AtomGit

ascend-robotfix(tensor_cast): support GLM5 DSA tuple returns

文件	最后提交记录	最后更新时间
__init__.py	add o_proj tp and mla tp Co-authored-by: yydyzr<liuyuncong1@huawei.com>	7 个月前
attention.py	chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [ ] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [x] Docs（文档更新） - [x] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ------ ## Motivation / 变更动机 Continue the pre-commit migration: tighten Pylint so only high-signal messages run (`disable=all` + explicit `enable` list), fix real issues that remained under that profile, and translate hook/config comments to English. ------ ## Configuration changes（仅工具与注释 / tooling & comments only） \| Path \| What changed \| \|------\|----------------\| \| `pre-commit/pyproject.toml` \| Pylint: `[tool.pylint."messages control"]` with `disable = ["all"]` and a short allowlist of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). Ruff: unchanged behavior; comments translated to English. Bandit: comments translated; rule allowlist/skip lists unchanged. \| \| `.pre-commit-config.yaml` \| Comments translated to English; Bandit hook display name set to bandit (Python security checks). Hook versions and args unchanged except for comment text. \| ------ ## Source code changes（应用代码 / application code） \| Area \| Files \| Purpose \| \|------\|--------\|---------\| \| `serving_cast` \| `communication.py`, `engine.py`, `instance.py`, `kv_cache_manager.py`, `load_gen.py`, `main.py`, `model_runner.py`, `request.py`, `serving.py`, `utils.py` \| Replace `from . import stime` with `import serving_cast.stime as stime` so Pylint resolves imports (fixes E0611). \| \| `serving_cast` \| `stime.py` \| Singleton salabim `Environment` via `_get_sim_env()` so type checkers/Pylint see `sim.Environment` (fixes E1101 on `SimulationEnv`). \| \| `serving_cast/service` \| `base_throughput_optimizer.py` \| `__init__` defaults + `assert runner is not None` before `run_inference` (fixes E1101 on base class). \| \| `tensor_cast` \| `diffusers/diffusers_model.py`, `diffusers/diffusers_utils.py`, `runtime.py` \| Add `encoding="utf-8"` to `open()` / trace export (fixes W1514). \| \| `web_ui` \| `callbacks.py` \| `refresh_optimizer_detail`: call `_optimizer_detail_view(rows, None, device)` and unpack five return values (fixes E1120). \| ------ ## Recent commits on `pre-commit` branch - `ci(pre-commit): fix pylint message selection with disable=all` - `fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui` - `docs(pre-commit): translate comments to English and add all-files run log` ------ ![](https://raw.atomgit.com/Ascend/msmodeling/attachment/uploads/b22b18aa-4c84-4dc0-85f5-1e7e0715350e/pre-commit-all-files-run.svg) ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176	1 个月前
deepseek_v4.py	【Bugfix】AttributeError: 'CopyLayerWrapper' object has no attribute 'self_attn'问题修复 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !298 merge bug-fix2 into develop 【Bugfix】AttributeError: 'CopyLayerWrapper' object has no attribute 'self_attn'问题修复 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!298	20 天前
glm5.py	fix(tensor_cast): support GLM5 DSA tuple returns Co-authored-by: minghang_c<chiminghang@h-partners.com> # message auto-generated for no-merge-commit merge: !332 merge glm5-transformers-fix into develop fix(tensor_cast): support GLM5 DSA tuple returns Created-by: minghang_c Commit-by: minghang_c Merged-by: ascend-robot Description: ## 背景在 GLM-5 (`glm_moe_dsa`) / GLM-5.1 模型上执行 TensorCast 推理建模时，原始问题会在 decoder layer 返回值解包处失败： `bash python -m cli.inference.text_generate zai-org/GLM-5 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` 错误表现为 tuple 返回值数量不匹配： `text ValueError: not enough values to unpack (expected 3, got 2)` 修复 attention 返回协议后，repetition copy layer 路径还会暴露 decoder layer 返回值数量不匹配： `text ValueError: not enough values to unpack (expected 2, got 1)` 在 GLM-5.1 开启 MTP 时还会暴露两个 MTP 适配问题： `bash python -m cli.inference.text_generate zai-org/GLM-5.1 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --num-mtp-tokens 3 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` 第一处是 synthetic MTP layer 使用 `layer_idx >= num_hidden_layers` 时访问 GLM DSA per-layer config 越界： `text IndexError: list index out of range # config.indexer_types[layer_idx]` 第二处是 GLM DSA decoder block 返回 tuple，而 MTP 通用流程期望继续处理 tensor： ```text torch._dynamo.exc.Unsupported: Dynamo does not know how to trace method` index_select `of class` tuple ` `## 根因 GLM-5 / GLM-5.1 的 HuggingFace decoder layer 有 DSA sparse attention 的跨层 top-k 传递协议： - attention 返回值协议是三元组：`(attn_output, attn_weights, topk_indices) `- decoder layer 返回值协议是二元组：`(hidden_states, topk_indices) `TensorCast 在模型转换过程中会： 1. 使用` mla_module_class_type `将 HF` GlmMoeDsaAttention `替换为 TensorCast sparse attention 实现； 2. 在 repetition 优化中，用` RegionMarkerWrapper `包裹代表层，并用` CopyLayerWrapper `替换后续重复层； 3. 开启 MTP 时，基于 decoder layer class 构造 synthetic MTP layers。原来的通用实现没有完整保留 GLM DSA 相关返回值和 per-layer config 协议： -` DeepseekSparseAttention `只返回` (attn_output, attn_weights)`，但 GLM decoder 期望 attention 返回 3 个值； -` CopyLayerWrapper `对 tuple 返回只构造` (hidden_states,)`，但 GLM decoder layer 期望 repeated layer 也返回 2 个值； -` maybe_enable_mtp() `只扩展了` layer_types `/` mlp_layer_types`，但没有扩展 GLM DSA 专用的` indexer_types`； -` MultiTokenPredictorLayer `没有处理 MTP block 返回 tuple 的模型族。因此问题本质是：TensorCast wrapper/replacement/MTP synthetic layer 没有完整保持被替换 HF 模块的 return contract 和 per-layer config contract。 ## 改动点 ### 1. 增加 GLM 专用 sparse attention wrapper 新增` tensor_cast/layers/glm5.py`：` `python class Glm5SparseAttention(DeepseekSparseAttention): def forward(self, args, kwargs): attn_output, attn_weights = super().forward(args, *kwargs) return attn_output, attn_weights, None` `并将` tensor_cast/transformers/builtin_model/glm5.py `中 GLM profile 的` mla_module_class_type `从通用` DeepseekSparseAttention `切换为` Glm5SparseAttention`。这样 GLM 的三元组 attention 返回协议只在 GLM adapter 层处理，不改变通用` DeepseekSparseAttention`，避免影响其他 built-in 模型。这里没有修改` tests/.ci/gate_policy.yaml`：`builtin_model `路径在 coverage 配置里被 omit，直接把新增实现放在` builtin_model/glm5.py `会导致新增测试无法生成 test_map；因此将可测的 wrapper 放到` tensor_cast/layers/glm5.py`，让 CI gate 可以通过正常 coverage/test_map 关联到` tests/regression/tensor_cast/test_glm5.py`。 ### 2. 让 repetition copy wrapper 保持代表层 tuple 长度在` tensor_cast/layers/internal.py `中： -` RegionMarkerWrapper `记录代表层真实返回 tuple 长度； -` CopyLayerWrapper `根据代表层返回长度补齐` None`，使 copy layer 的 tuple arity 与代表层一致。这个改动不包含 GLM 专属字段判断，例如不读取` prev_topk_indices`。它只保证通用 wrapper 的返回结构长度与代表层一致。对于 GLM，被 copy 的 decoder layer 会返回` (hidden_states, None)`，下一层如果收到` prev_topk_indices=None`，会按 HF 原逻辑重新计算 top-k，因此语义安全。 ### 3. 补齐 GLM DSA MTP per-layer config 在` tensor_cast/transformers/transformations.py `中，开启 MTP 时像` layer_types `/` mlp_layer_types `一样扩展` indexer_types`：` `python if hasattr(hf_config, "indexer_types") and isinstance(hf_config.indexer_types, list) and hf_config.indexer_types: hf_config.indexer_types.extend([hf_config.indexer_types[-1]] mtp_config.num_mtp_layers)` `这样 synthetic MTP layer 的` layer_idx=78,79,80 `可以访问合法的 GLM DSA indexer type，避免` IndexError`。 ### 4. 让 MTP layer 兼容 tuple block 输出在` tensor_cast/layers/mtp.py `中，如果` mtp_block `返回 tuple，则取第一个元素作为后续 hidden states：` `python if isinstance(hidden_states, tuple): hidden_states = hidden_states[0]` `这与 decoder layer tuple 协议一致：第一个元素是` hidden_states`，后续元素是模型族特定的辅助返回值。 ### 5. 增加轻量回归测试新增/扩展` tests/regression/tensor_cast/test_glm5.py`，覆盖： -` Glm5SparseAttention.forward `将二元组 attention 输出补齐为 GLM decoder 需要的三元组； -` maybe_enable_mtp() `会扩展 GLM DSA` indexer_types`； -` MultiTokenPredictorLayer `会从 tuple MTP block 输出中取` hidden_states`。 ## 验证已验证 GLM adapter / MTP 回归测试和现有 repetition wrapper 测试通过：` `bash /home/minghang/workspace/msmodeling-upstream/.venv/bin/python -m pytest \ tests/regression/tensor_cast/test_glm5.py \ tests/regression/tensor_cast/test_repetition_wrappers.py -q` `结果：` `text 4 passed in 0.02s` `已验证 GLM-5.1 + MTP 原始失败命令可运行并完成性能统计输出：` `bash /home/minghang/workspace/msmodeling-upstream/.venv/bin/python -m cli.inference.text_generate zai-org/GLM-5.1 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --num-mtp-tokens 3 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` `结果摘要：` `text Model compilation and execution time: 8.125 s Total time for analytic: 283.311ms [analytic] TPS/Device: 772.1 token/s` `已验证新增 layer 文件的符号可被 CI gate AST 逻辑识别：` `text top-level: ['Glm5SparseAttention'] spans: [('Glm5SparseAttention.forward', 5, 7)]` `## 影响范围 - GLM attention 返回协议的三元组适配限定在` tensor_cast/layers/glm5.py `的` Glm5SparseAttention `中； - 通用` DeepseekSparseAttention `未修改，避免影响其他 MLA/DSA 模型； -` CopyLayerWrapper `的改动是通用 tuple arity 保持逻辑，不引入 GLM 专属字段判断； -` maybe_enable_mtp() `只对存在` indexer_types `的 HF config 做 list 扩展，和已有` layer_types `/` mlp_layer_types `扩展逻辑一致； -` MultiTokenPredictorLayer `对 tuple block 输出取第一个元素，兼容 decoder layer 标准 tuple 返回协议； - 不修改` tests/.ci/gate_policy.yaml`，避免触发配置变更导致 CI gate 运行 full suite。 See merge request: Ascend/msmodeling!332	14 天前
internal.py	fix(tensor_cast): support GLM5 DSA tuple returns Co-authored-by: minghang_c<chiminghang@h-partners.com> # message auto-generated for no-merge-commit merge: !332 merge glm5-transformers-fix into develop fix(tensor_cast): support GLM5 DSA tuple returns Created-by: minghang_c Commit-by: minghang_c Merged-by: ascend-robot Description: ## 背景在 GLM-5 (`glm_moe_dsa`) / GLM-5.1 模型上执行 TensorCast 推理建模时，原始问题会在 decoder layer 返回值解包处失败： `bash python -m cli.inference.text_generate zai-org/GLM-5 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` 错误表现为 tuple 返回值数量不匹配： `text ValueError: not enough values to unpack (expected 3, got 2)` 修复 attention 返回协议后，repetition copy layer 路径还会暴露 decoder layer 返回值数量不匹配： `text ValueError: not enough values to unpack (expected 2, got 1)` 在 GLM-5.1 开启 MTP 时还会暴露两个 MTP 适配问题： `bash python -m cli.inference.text_generate zai-org/GLM-5.1 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --num-mtp-tokens 3 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` 第一处是 synthetic MTP layer 使用 `layer_idx >= num_hidden_layers` 时访问 GLM DSA per-layer config 越界： `text IndexError: list index out of range # config.indexer_types[layer_idx]` 第二处是 GLM DSA decoder block 返回 tuple，而 MTP 通用流程期望继续处理 tensor： ```text torch._dynamo.exc.Unsupported: Dynamo does not know how to trace method` index_select `of class` tuple ` `## 根因 GLM-5 / GLM-5.1 的 HuggingFace decoder layer 有 DSA sparse attention 的跨层 top-k 传递协议： - attention 返回值协议是三元组：`(attn_output, attn_weights, topk_indices) `- decoder layer 返回值协议是二元组：`(hidden_states, topk_indices) `TensorCast 在模型转换过程中会： 1. 使用` mla_module_class_type `将 HF` GlmMoeDsaAttention `替换为 TensorCast sparse attention 实现； 2. 在 repetition 优化中，用` RegionMarkerWrapper `包裹代表层，并用` CopyLayerWrapper `替换后续重复层； 3. 开启 MTP 时，基于 decoder layer class 构造 synthetic MTP layers。原来的通用实现没有完整保留 GLM DSA 相关返回值和 per-layer config 协议： -` DeepseekSparseAttention `只返回` (attn_output, attn_weights)`，但 GLM decoder 期望 attention 返回 3 个值； -` CopyLayerWrapper `对 tuple 返回只构造` (hidden_states,)`，但 GLM decoder layer 期望 repeated layer 也返回 2 个值； -` maybe_enable_mtp() `只扩展了` layer_types `/` mlp_layer_types`，但没有扩展 GLM DSA 专用的` indexer_types`； -` MultiTokenPredictorLayer `没有处理 MTP block 返回 tuple 的模型族。因此问题本质是：TensorCast wrapper/replacement/MTP synthetic layer 没有完整保持被替换 HF 模块的 return contract 和 per-layer config contract。 ## 改动点 ### 1. 增加 GLM 专用 sparse attention wrapper 新增` tensor_cast/layers/glm5.py`：` `python class Glm5SparseAttention(DeepseekSparseAttention): def forward(self, args, kwargs): attn_output, attn_weights = super().forward(args, *kwargs) return attn_output, attn_weights, None` `并将` tensor_cast/transformers/builtin_model/glm5.py `中 GLM profile 的` mla_module_class_type `从通用` DeepseekSparseAttention `切换为` Glm5SparseAttention`。这样 GLM 的三元组 attention 返回协议只在 GLM adapter 层处理，不改变通用` DeepseekSparseAttention`，避免影响其他 built-in 模型。这里没有修改` tests/.ci/gate_policy.yaml`：`builtin_model `路径在 coverage 配置里被 omit，直接把新增实现放在` builtin_model/glm5.py `会导致新增测试无法生成 test_map；因此将可测的 wrapper 放到` tensor_cast/layers/glm5.py`，让 CI gate 可以通过正常 coverage/test_map 关联到` tests/regression/tensor_cast/test_glm5.py`。 ### 2. 让 repetition copy wrapper 保持代表层 tuple 长度在` tensor_cast/layers/internal.py `中： -` RegionMarkerWrapper `记录代表层真实返回 tuple 长度； -` CopyLayerWrapper `根据代表层返回长度补齐` None`，使 copy layer 的 tuple arity 与代表层一致。这个改动不包含 GLM 专属字段判断，例如不读取` prev_topk_indices`。它只保证通用 wrapper 的返回结构长度与代表层一致。对于 GLM，被 copy 的 decoder layer 会返回` (hidden_states, None)`，下一层如果收到` prev_topk_indices=None`，会按 HF 原逻辑重新计算 top-k，因此语义安全。 ### 3. 补齐 GLM DSA MTP per-layer config 在` tensor_cast/transformers/transformations.py `中，开启 MTP 时像` layer_types `/` mlp_layer_types `一样扩展` indexer_types`：` `python if hasattr(hf_config, "indexer_types") and isinstance(hf_config.indexer_types, list) and hf_config.indexer_types: hf_config.indexer_types.extend([hf_config.indexer_types[-1]] mtp_config.num_mtp_layers)` `这样 synthetic MTP layer 的` layer_idx=78,79,80 `可以访问合法的 GLM DSA indexer type，避免` IndexError`。 ### 4. 让 MTP layer 兼容 tuple block 输出在` tensor_cast/layers/mtp.py `中，如果` mtp_block `返回 tuple，则取第一个元素作为后续 hidden states：` `python if isinstance(hidden_states, tuple): hidden_states = hidden_states[0]` `这与 decoder layer tuple 协议一致：第一个元素是` hidden_states`，后续元素是模型族特定的辅助返回值。 ### 5. 增加轻量回归测试新增/扩展` tests/regression/tensor_cast/test_glm5.py`，覆盖： -` Glm5SparseAttention.forward `将二元组 attention 输出补齐为 GLM decoder 需要的三元组； -` maybe_enable_mtp() `会扩展 GLM DSA` indexer_types`； -` MultiTokenPredictorLayer `会从 tuple MTP block 输出中取` hidden_states`。 ## 验证已验证 GLM adapter / MTP 回归测试和现有 repetition wrapper 测试通过：` `bash /home/minghang/workspace/msmodeling-upstream/.venv/bin/python -m pytest \ tests/regression/tensor_cast/test_glm5.py \ tests/regression/tensor_cast/test_repetition_wrappers.py -q` `结果：` `text 4 passed in 0.02s` `已验证 GLM-5.1 + MTP 原始失败命令可运行并完成性能统计输出：` `bash /home/minghang/workspace/msmodeling-upstream/.venv/bin/python -m cli.inference.text_generate zai-org/GLM-5.1 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --num-mtp-tokens 3 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` `结果摘要：` `text Model compilation and execution time: 8.125 s Total time for analytic: 283.311ms [analytic] TPS/Device: 772.1 token/s` `已验证新增 layer 文件的符号可被 CI gate AST 逻辑识别：` `text top-level: ['Glm5SparseAttention'] spans: [('Glm5SparseAttention.forward', 5, 7)]` `## 影响范围 - GLM attention 返回协议的三元组适配限定在` tensor_cast/layers/glm5.py `的` Glm5SparseAttention `中； - 通用` DeepseekSparseAttention `未修改，避免影响其他 MLA/DSA 模型； -` CopyLayerWrapper `的改动是通用 tuple arity 保持逻辑，不引入 GLM 专属字段判断； -` maybe_enable_mtp() `只对存在` indexer_types `的 HF config 做 list 扩展，和已有` layer_types `/` mlp_layer_types `扩展逻辑一致； -` MultiTokenPredictorLayer `对 tuple block 输出取第一个元素，兼容 decoder layer 标准 tuple 返回协议； - 不修改` tests/.ci/gate_policy.yaml`，避免触发配置变更导致 CI gate 运行 full suite。 See merge request: Ascend/msmodeling!332	14 天前
mla.py	【Bugfix】AttributeError: 'CopyLayerWrapper' object has no attribute 'self_attn'问题修复 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !298 merge bug-fix2 into develop 【Bugfix】AttributeError: 'CopyLayerWrapper' object has no attribute 'self_attn'问题修复 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!298	20 天前
moe_layer.py	feat：仿真建模支持deepseek-V4模型适配 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !166 merge deepseek-v4 into develop feat：仿真建模支持deepseek-V4模型适配 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机为 msmodeling/tensor_cast 增加对 DeepSeek V4 (Flash/Pro) 模型的端到端支持，使其性能建模流水线能够覆盖 V4 引入的稀疏注意力（NSA / Window / Compressed / Heavily-Compressed 多 layer-type 路由）、HC（Head Compression）混合、Sinkhorn 拆分以及 Hash Routing MoE 等新结构，并补齐对应的 fake-tensor 语义算子与代价模型，让 V4 模型可以直接走通现有 analytic / multistream tracing 流程。 ------ ## 📝 Modification / 修改内容新增文件 / New files - tensor_cast/transformers/builtin_model/deepseek_v4.py：DeepSeek V4 builtin model profile，包含 DeepseekV4Config / DeepseekV4Model 注册、layer-type 校验（{0, 4, 128} 对应 sliding_attention / compressed_sparse_attention / heavily_compressed_attention）、以及与 transformers AutoConfig / AutoModel 的安全注册逻辑。 - tests/test_tensor_cast/test_deepseek_v4.py 与 tests/test_tensor_cast/data/deepseek_v4/.json：V4 模型对应的测试数据集与用例（含合法/非法/缺失/截短的 ratios 配置）。注意力 / Attention（tensor_cast/layers/mla.py，tensor_cast/ops/mla.py，tensor_cast/ops/rotary_embedding.py） - 新增 DeepseekV4SparseAttention 与 MultiheadLatentAttentionTensorCast 适配（含 requires_legacy_kv_b_decomposition、KV-cache window 写入路径等）。 - 新增 get_window_topk_idxs / get_compress_topk_idxs 索引生成工具。 - 新增 HC 路径语义算子：hc_pre_inv_rms、hc_pre_sinkhorn，分别对应参考实现中的 inverse-RMS 缩放与 Sinkhorn 加权 reduction。 - 新增 scatter_nd_update_mla 等 KV 写入算子的代价模型，按参考实现仅计 source 行读 + 更新行写，不计 slot_mapping / 整 cache 张量。 MoE / Gate（tensor_cast/layers/moe_layer.py，tensor_cast/ops/fused_moe.py） - MoELayer 增加 V4 统一 gating 路径：识别 gate 上的 is_v4 / hash 标志位，按参考 Gate.forward 顺序发出 matmul + score func + indices + gather/normalize/route_scale 各算子，使每一步按其真实 dtype（gate matmul 走 fp32）单独计费。 - 新增 moe_gating_top_k（带可选 bias 的 V4 非 hash 层）与 moe_gating_top_k_hash（基于 tid2eid 表的 hash 路由层）两个语义算子。性能模型 / Performance Model（tensor_cast/performance_model/__init__.py） - 引入 _safe_max_int 工具：在 fake / meta / functional tensor 上 tensor.max().item() 不可用时回退为 None，让 caller 走 shape-based 估算。 - 注册 V4 新算子（scatter_nd_update_mla、HC 系列、MoE 新 gating tail 等）的 PerformanceProperties，与参考实现的内存访问语义对齐。其他 / Misc - tensor_cast/core/config_resolver.py、input_generator.py、model_runner.py、device.py、transformers/transformations.py、 transformers/custom_model_registry.py、layers/utils.py、model_config.py、compilation/passes/multistream_pass.py：补齐 V4 在 config 解析、输入构造、runner 调度、device profile、模型变换与算子注册各环节的接入。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc.* 请提供相关测试结果，例如测试报告等。 ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4dbd32d5-6f6d-4b84-a840-a06eec62fc40/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8428112/fda50383-9b30-4453-bfd1-391889bebb47/image.png 'image.png') ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. / 使用 [lintrunner 工具](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) 来修复潜在的 lint 问题。 - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!166	21 天前
mtp.py	fix(tensor_cast): support GLM5 DSA tuple returns Co-authored-by: minghang_c<chiminghang@h-partners.com> # message auto-generated for no-merge-commit merge: !332 merge glm5-transformers-fix into develop fix(tensor_cast): support GLM5 DSA tuple returns Created-by: minghang_c Commit-by: minghang_c Merged-by: ascend-robot Description: ## 背景在 GLM-5 (`glm_moe_dsa`) / GLM-5.1 模型上执行 TensorCast 推理建模时，原始问题会在 decoder layer 返回值解包处失败： `bash python -m cli.inference.text_generate zai-org/GLM-5 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` 错误表现为 tuple 返回值数量不匹配： `text ValueError: not enough values to unpack (expected 3, got 2)` 修复 attention 返回协议后，repetition copy layer 路径还会暴露 decoder layer 返回值数量不匹配： `text ValueError: not enough values to unpack (expected 2, got 1)` 在 GLM-5.1 开启 MTP 时还会暴露两个 MTP 适配问题： `bash python -m cli.inference.text_generate zai-org/GLM-5.1 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --num-mtp-tokens 3 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` 第一处是 synthetic MTP layer 使用 `layer_idx >= num_hidden_layers` 时访问 GLM DSA per-layer config 越界： `text IndexError: list index out of range # config.indexer_types[layer_idx]` 第二处是 GLM DSA decoder block 返回 tuple，而 MTP 通用流程期望继续处理 tensor： ```text torch._dynamo.exc.Unsupported: Dynamo does not know how to trace method` index_select `of class` tuple ` `## 根因 GLM-5 / GLM-5.1 的 HuggingFace decoder layer 有 DSA sparse attention 的跨层 top-k 传递协议： - attention 返回值协议是三元组：`(attn_output, attn_weights, topk_indices) `- decoder layer 返回值协议是二元组：`(hidden_states, topk_indices) `TensorCast 在模型转换过程中会： 1. 使用` mla_module_class_type `将 HF` GlmMoeDsaAttention `替换为 TensorCast sparse attention 实现； 2. 在 repetition 优化中，用` RegionMarkerWrapper `包裹代表层，并用` CopyLayerWrapper `替换后续重复层； 3. 开启 MTP 时，基于 decoder layer class 构造 synthetic MTP layers。原来的通用实现没有完整保留 GLM DSA 相关返回值和 per-layer config 协议： -` DeepseekSparseAttention `只返回` (attn_output, attn_weights)`，但 GLM decoder 期望 attention 返回 3 个值； -` CopyLayerWrapper `对 tuple 返回只构造` (hidden_states,)`，但 GLM decoder layer 期望 repeated layer 也返回 2 个值； -` maybe_enable_mtp() `只扩展了` layer_types `/` mlp_layer_types`，但没有扩展 GLM DSA 专用的` indexer_types`； -` MultiTokenPredictorLayer `没有处理 MTP block 返回 tuple 的模型族。因此问题本质是：TensorCast wrapper/replacement/MTP synthetic layer 没有完整保持被替换 HF 模块的 return contract 和 per-layer config contract。 ## 改动点 ### 1. 增加 GLM 专用 sparse attention wrapper 新增` tensor_cast/layers/glm5.py`：` `python class Glm5SparseAttention(DeepseekSparseAttention): def forward(self, args, kwargs): attn_output, attn_weights = super().forward(args, *kwargs) return attn_output, attn_weights, None` `并将` tensor_cast/transformers/builtin_model/glm5.py `中 GLM profile 的` mla_module_class_type `从通用` DeepseekSparseAttention `切换为` Glm5SparseAttention`。这样 GLM 的三元组 attention 返回协议只在 GLM adapter 层处理，不改变通用` DeepseekSparseAttention`，避免影响其他 built-in 模型。这里没有修改` tests/.ci/gate_policy.yaml`：`builtin_model `路径在 coverage 配置里被 omit，直接把新增实现放在` builtin_model/glm5.py `会导致新增测试无法生成 test_map；因此将可测的 wrapper 放到` tensor_cast/layers/glm5.py`，让 CI gate 可以通过正常 coverage/test_map 关联到` tests/regression/tensor_cast/test_glm5.py`。 ### 2. 让 repetition copy wrapper 保持代表层 tuple 长度在` tensor_cast/layers/internal.py `中： -` RegionMarkerWrapper `记录代表层真实返回 tuple 长度； -` CopyLayerWrapper `根据代表层返回长度补齐` None`，使 copy layer 的 tuple arity 与代表层一致。这个改动不包含 GLM 专属字段判断，例如不读取` prev_topk_indices`。它只保证通用 wrapper 的返回结构长度与代表层一致。对于 GLM，被 copy 的 decoder layer 会返回` (hidden_states, None)`，下一层如果收到` prev_topk_indices=None`，会按 HF 原逻辑重新计算 top-k，因此语义安全。 ### 3. 补齐 GLM DSA MTP per-layer config 在` tensor_cast/transformers/transformations.py `中，开启 MTP 时像` layer_types `/` mlp_layer_types `一样扩展` indexer_types`：` `python if hasattr(hf_config, "indexer_types") and isinstance(hf_config.indexer_types, list) and hf_config.indexer_types: hf_config.indexer_types.extend([hf_config.indexer_types[-1]] mtp_config.num_mtp_layers)` `这样 synthetic MTP layer 的` layer_idx=78,79,80 `可以访问合法的 GLM DSA indexer type，避免` IndexError`。 ### 4. 让 MTP layer 兼容 tuple block 输出在` tensor_cast/layers/mtp.py `中，如果` mtp_block `返回 tuple，则取第一个元素作为后续 hidden states：` `python if isinstance(hidden_states, tuple): hidden_states = hidden_states[0]` `这与 decoder layer tuple 协议一致：第一个元素是` hidden_states`，后续元素是模型族特定的辅助返回值。 ### 5. 增加轻量回归测试新增/扩展` tests/regression/tensor_cast/test_glm5.py`，覆盖： -` Glm5SparseAttention.forward `将二元组 attention 输出补齐为 GLM decoder 需要的三元组； -` maybe_enable_mtp() `会扩展 GLM DSA` indexer_types`； -` MultiTokenPredictorLayer `会从 tuple MTP block 输出中取` hidden_states`。 ## 验证已验证 GLM adapter / MTP 回归测试和现有 repetition wrapper 测试通过：` `bash /home/minghang/workspace/msmodeling-upstream/.venv/bin/python -m pytest \ tests/regression/tensor_cast/test_glm5.py \ tests/regression/tensor_cast/test_repetition_wrappers.py -q` `结果：` `text 4 passed in 0.02s` `已验证 GLM-5.1 + MTP 原始失败命令可运行并完成性能统计输出：` `bash /home/minghang/workspace/msmodeling-upstream/.venv/bin/python -m cli.inference.text_generate zai-org/GLM-5.1 \ --device ATLAS_800_A3_752T_128G_DIE \ --num-devices 16 \ --tp-size 16 \ --dp-size 1 \ --ep-size 16 \ --context-length 0 \ --query-length 3500 \ --num-queries 1 \ --num-mtp-tokens 3 \ --compile \ --quantize-linear-action W4A8_STATIC \ --dump-input-shapes` `结果摘要：` `text Model compilation and execution time: 8.125 s Total time for analytic: 283.311ms [analytic] TPS/Device: 772.1 token/s` `已验证新增 layer 文件的符号可被 CI gate AST 逻辑识别：` `text top-level: ['Glm5SparseAttention'] spans: [('Glm5SparseAttention.forward', 5, 7)]` `## 影响范围 - GLM attention 返回协议的三元组适配限定在` tensor_cast/layers/glm5.py `的` Glm5SparseAttention `中； - 通用` DeepseekSparseAttention `未修改，避免影响其他 MLA/DSA 模型； -` CopyLayerWrapper `的改动是通用 tuple arity 保持逻辑，不引入 GLM 专属字段判断； -` maybe_enable_mtp() `只对存在` indexer_types `的 HF config 做 list 扩展，和已有` layer_types `/` mlp_layer_types `扩展逻辑一致； -` MultiTokenPredictorLayer `对 tuple block 输出取第一个元素，兼容 decoder layer 标准 tuple 返回协议； - 不修改` tests/.ci/gate_policy.yaml`，避免触发配置变更导致 CI gate 运行 full suite。 See merge request: Ascend/msmodeling!332	14 天前
parallel_embedding.py	bugfix: ParallelEmbedding did not adapt padding_idx in pieces. Co-authored-by: Elrond G<elrondgcn@gmail.com> # message auto-generated for no-merge-commit merge: !244 merge bugfix/developer/adaptar_padding_idx into develop bugfix: ParallelEmbedding did not adapt padding_idx in pieces. Created-by: elrond-g Commit-by: Elrond G Merged-by: ascend-robot Description: # PR Template PR Type / PR类型 - [ ] Feature（功能新增） - [x] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 fix : The error occurred in tensor_cast/layers/parallel_embedding.py. ParallelEmbedding did not adapt to padding_idx. command: bash python -m tensor_cast.scripts.text_generate zai-org/GLM-5 \ --num-queries 8 --query-length 1 --context-length 4096 \ --tp-size 8 --dp-size 2 --ep-size 16 \ --quantize-linear-action W8A8_STATIC --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 \ --enable-shared-expert-tp --enable-dispatch-ffn-combine Trigger link: 1. In the command --word-embedding-tp row, the row is cut (cut along the vocab dimension). 2. GLM-5's vocab_size=154880, pad_token_id=154820. After 8 card TP, the weight shape of each rank becomes (19360, 6144) (154880 / 8 = 19360). 3. ParallelEmbedding.create_weights() only replaces self._inner.weight with sharded weight (parallel_embedding.py:3 7-38), didn't move Self._inner.padding_idx. Therefore, the inside of _inner still remembers the original padding_idx=154820. 4. Call self._inner(safe_local_indices) (parallel_embedding.py:50) in forward(), and PyTorch goes to F.embedding(in Put, weight, Padding_idx=154820, ...). 5. added --compile, and Dynamo asserted that padding_idx < weight.size(0) → 154820 < 19360 failed on the fake tensor path. The reason why it can run without adding --compile: This assertion is usually not triggered in eager mode (in CUDA/NPU kernel, it is only set to zero when padding_idx hits), only Dynamo's meta/fake path is hard-checked. ## 📝 Modification / 修改内容 diff --git a/tensor_cast/layers/parallel_embedding.py b/tensor_cast/layers/parallel_embedding.py index 2b119c6..d41e310 100644 --- a/tensor_cast/layers/parallel_embedding.py +++ b/tensor_cast/layers/parallel_embedding.py @@ -40,6 +40,11 @@ class ParallelEmbedding(ModelWrapperBase): block_size = self._inner.weight.shape[0] self._row_start = self.tp_rank * block_size self._row_end = min(self._row_start + block_size, self._vocab_size) + orig_padding_idx = self._inner.padding_idx + if orig_padding_idx is not None and self._row_start <= orig_padding_idx < self._row_end: + self._inner.padding_idx = orig_padding_idx - self._row_start + else: + self._inner.padding_idx = None def forward(self, x: torch.Tensor) -> torch.Tensor: if self.tp_size == 1: ## 📐 Associated Test Results / 关联测试结果 run this command succ python -m tensor_cast.scripts.text_generate zai-org/GLM-5 \ --num-queries 8 --query-length 1 --context-length 4096 \ --tp-size 8 --dp-size 2 --ep-size 16 \ --quantize-linear-action W8A8_STATIC --word-embedding-tp row \ --device ATLAS_800_A3_752T_128G_DIE --world-size 16 \ --performance-model profiling --compile \ --profiling-database tensor_cast/performance_model/profiling_database/data/ATLAS_800_A3_752T_128G_DIE/vllm_ascend/vllm0.18.0_torch2.9.0_cann8.5 \ --enable-shared-expert-tp --enable-dispatch-ffn-combine ------ ## 🌟 Use cases (Optional) / 使用案例（可选） ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!244	20 天前
parallel_linear.py	feat：仿真建模支持deepseek-V4模型适配 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !166 merge deepseek-v4 into develop feat：仿真建模支持deepseek-V4模型适配 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机为 msmodeling/tensor_cast 增加对 DeepSeek V4 (Flash/Pro) 模型的端到端支持，使其性能建模流水线能够覆盖 V4 引入的稀疏注意力（NSA / Window / Compressed / Heavily-Compressed 多 layer-type 路由）、HC（Head Compression）混合、Sinkhorn 拆分以及 Hash Routing MoE 等新结构，并补齐对应的 fake-tensor 语义算子与代价模型，让 V4 模型可以直接走通现有 analytic / multistream tracing 流程。 ------ ## 📝 Modification / 修改内容新增文件 / New files - tensor_cast/transformers/builtin_model/deepseek_v4.py：DeepSeek V4 builtin model profile，包含 DeepseekV4Config / DeepseekV4Model 注册、layer-type 校验（{0, 4, 128} 对应 sliding_attention / compressed_sparse_attention / heavily_compressed_attention）、以及与 transformers AutoConfig / AutoModel 的安全注册逻辑。 - tests/test_tensor_cast/test_deepseek_v4.py 与 tests/test_tensor_cast/data/deepseek_v4/.json：V4 模型对应的测试数据集与用例（含合法/非法/缺失/截短的 ratios 配置）。注意力 / Attention（tensor_cast/layers/mla.py，tensor_cast/ops/mla.py，tensor_cast/ops/rotary_embedding.py） - 新增 DeepseekV4SparseAttention 与 MultiheadLatentAttentionTensorCast 适配（含 requires_legacy_kv_b_decomposition、KV-cache window 写入路径等）。 - 新增 get_window_topk_idxs / get_compress_topk_idxs 索引生成工具。 - 新增 HC 路径语义算子：hc_pre_inv_rms、hc_pre_sinkhorn，分别对应参考实现中的 inverse-RMS 缩放与 Sinkhorn 加权 reduction。 - 新增 scatter_nd_update_mla 等 KV 写入算子的代价模型，按参考实现仅计 source 行读 + 更新行写，不计 slot_mapping / 整 cache 张量。 MoE / Gate（tensor_cast/layers/moe_layer.py，tensor_cast/ops/fused_moe.py） - MoELayer 增加 V4 统一 gating 路径：识别 gate 上的 is_v4 / hash 标志位，按参考 Gate.forward 顺序发出 matmul + score func + indices + gather/normalize/route_scale 各算子，使每一步按其真实 dtype（gate matmul 走 fp32）单独计费。 - 新增 moe_gating_top_k（带可选 bias 的 V4 非 hash 层）与 moe_gating_top_k_hash（基于 tid2eid 表的 hash 路由层）两个语义算子。性能模型 / Performance Model（tensor_cast/performance_model/__init__.py） - 引入 _safe_max_int 工具：在 fake / meta / functional tensor 上 tensor.max().item() 不可用时回退为 None，让 caller 走 shape-based 估算。 - 注册 V4 新算子（scatter_nd_update_mla、HC 系列、MoE 新 gating tail 等）的 PerformanceProperties，与参考实现的内存访问语义对齐。其他 / Misc - tensor_cast/core/config_resolver.py、input_generator.py、model_runner.py、device.py、transformers/transformations.py、 transformers/custom_model_registry.py、layers/utils.py、model_config.py、compilation/passes/multistream_pass.py：补齐 V4 在 config 解析、输入构造、runner 调度、device profile、模型变换与算子注册各环节的接入。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc.* 请提供相关测试结果，例如测试报告等。 ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4dbd32d5-6f6d-4b84-a840-a06eec62fc40/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8428112/fda50383-9b30-4453-bfd1-391889bebb47/image.png 'image.png') ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. / 使用 [lintrunner 工具](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) 来修复潜在的 lint 问题。 - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!166	21 天前
quant_linear.py	chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [ ] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [x] Docs（文档更新） - [x] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ------ ## Motivation / 变更动机 Continue the pre-commit migration: tighten Pylint so only high-signal messages run (`disable=all` + explicit `enable` list), fix real issues that remained under that profile, and translate hook/config comments to English. ------ ## Configuration changes（仅工具与注释 / tooling & comments only） \| Path \| What changed \| \|------\|----------------\| \| `pre-commit/pyproject.toml` \| Pylint: `[tool.pylint."messages control"]` with `disable = ["all"]` and a short allowlist of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). Ruff: unchanged behavior; comments translated to English. Bandit: comments translated; rule allowlist/skip lists unchanged. \| \| `.pre-commit-config.yaml` \| Comments translated to English; Bandit hook display name set to bandit (Python security checks). Hook versions and args unchanged except for comment text. \| ------ ## Source code changes（应用代码 / application code） \| Area \| Files \| Purpose \| \|------\|--------\|---------\| \| `serving_cast` \| `communication.py`, `engine.py`, `instance.py`, `kv_cache_manager.py`, `load_gen.py`, `main.py`, `model_runner.py`, `request.py`, `serving.py`, `utils.py` \| Replace `from . import stime` with `import serving_cast.stime as stime` so Pylint resolves imports (fixes E0611). \| \| `serving_cast` \| `stime.py` \| Singleton salabim `Environment` via `_get_sim_env()` so type checkers/Pylint see `sim.Environment` (fixes E1101 on `SimulationEnv`). \| \| `serving_cast/service` \| `base_throughput_optimizer.py` \| `__init__` defaults + `assert runner is not None` before `run_inference` (fixes E1101 on base class). \| \| `tensor_cast` \| `diffusers/diffusers_model.py`, `diffusers/diffusers_utils.py`, `runtime.py` \| Add `encoding="utf-8"` to `open()` / trace export (fixes W1514). \| \| `web_ui` \| `callbacks.py` \| `refresh_optimizer_detail`: call `_optimizer_detail_view(rows, None, device)` and unpack five return values (fixes E1120). \| ------ ## Recent commits on `pre-commit` branch - `ci(pre-commit): fix pylint message selection with disable=all` - `fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui` - `docs(pre-commit): translate comments to English and add all-files run log` ------ ![](https://raw.atomgit.com/Ascend/msmodeling/attachment/uploads/b22b18aa-4c84-4dc0-85f5-1e7e0715350e/pre-commit-all-files-run.svg) ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176	1 个月前
rotary_embedding.py	chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [ ] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [x] Docs（文档更新） - [x] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ------ ## Motivation / 变更动机 Continue the pre-commit migration: tighten Pylint so only high-signal messages run (`disable=all` + explicit `enable` list), fix real issues that remained under that profile, and translate hook/config comments to English. ------ ## Configuration changes（仅工具与注释 / tooling & comments only） \| Path \| What changed \| \|------\|----------------\| \| `pre-commit/pyproject.toml` \| Pylint: `[tool.pylint."messages control"]` with `disable = ["all"]` and a short allowlist of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). Ruff: unchanged behavior; comments translated to English. Bandit: comments translated; rule allowlist/skip lists unchanged. \| \| `.pre-commit-config.yaml` \| Comments translated to English; Bandit hook display name set to bandit (Python security checks). Hook versions and args unchanged except for comment text. \| ------ ## Source code changes（应用代码 / application code） \| Area \| Files \| Purpose \| \|------\|--------\|---------\| \| `serving_cast` \| `communication.py`, `engine.py`, `instance.py`, `kv_cache_manager.py`, `load_gen.py`, `main.py`, `model_runner.py`, `request.py`, `serving.py`, `utils.py` \| Replace `from . import stime` with `import serving_cast.stime as stime` so Pylint resolves imports (fixes E0611). \| \| `serving_cast` \| `stime.py` \| Singleton salabim `Environment` via `_get_sim_env()` so type checkers/Pylint see `sim.Environment` (fixes E1101 on `SimulationEnv`). \| \| `serving_cast/service` \| `base_throughput_optimizer.py` \| `__init__` defaults + `assert runner is not None` before `run_inference` (fixes E1101 on base class). \| \| `tensor_cast` \| `diffusers/diffusers_model.py`, `diffusers/diffusers_utils.py`, `runtime.py` \| Add `encoding="utf-8"` to `open()` / trace export (fixes W1514). \| \| `web_ui` \| `callbacks.py` \| `refresh_optimizer_detail`: call `_optimizer_detail_view(rows, None, device)` and unpack five return values (fixes E1120). \| ------ ## Recent commits on `pre-commit` branch - `ci(pre-commit): fix pylint message selection with disable=all` - `fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui` - `docs(pre-commit): translate comments to English and add all-files run log` ------ ![](https://raw.atomgit.com/Ascend/msmodeling/attachment/uploads/b22b18aa-4c84-4dc0-85f5-1e7e0715350e/pre-commit-all-files-run.svg) ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176	1 个月前
sampler.py	fix(tensor_cast, web_ui): #38 #39 #40（含 README 与 #24 同根因） Co-authored-by: welar<welar.ww@gmail.com> # message auto-generated for no-merge-commit merge: !169 merge fix/38-39-40-tensor-ui-readme into develop fix(tensor_cast, web_ui): #38 #39 #40（含 README 与 #24 同根因） Created-by: welar Commit-by: welar Merged-by: ascend-robot Description: ## 修改动机 - #39：`SamplingMetadata` 对 `torch.Tensor` 使用 dataclass 可变默认值，多实例共享同一 Tensor，存在跨实例污染与后续原地写入风险。 - #38：`int4` 分支在写入 `torch.float32` 桶的 dequant `gp_ops` 后，又赋值为新的空 `ComputeOps()`，统计被清零；bias 与 dequant 同 dtype 时用 `=` 会再次覆盖 `gp_ops`。 - #40 / #24：文档仍指向不存在的 `cli.inference.web_ui`，真实入口为 `web_ui.web_ui_start`，按 README 启动必失败。 ## 自验证 - `python -c`：两个 `SamplingMetadata()` 的 `selected_token_indices` 为不同对象且数值一致。 - `python -c`： `_static_quant_linear_properties_helper(..., is_int4=True)` 在带 bias 时 `compute_ops[torch.float32].gp_ops > 0`。 - 全文检索 README 无明文 `cli.inference.web_ui`。 Fixes #38. Fixes #39. Fixes #40. Fixes #24. See merge request: Ascend/msmodeling!169	1 个月前
utils.py	feat：仿真建模支持deepseek-V4模型适配 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !166 merge deepseek-v4 into develop feat：仿真建模支持deepseek-V4模型适配 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机为 msmodeling/tensor_cast 增加对 DeepSeek V4 (Flash/Pro) 模型的端到端支持，使其性能建模流水线能够覆盖 V4 引入的稀疏注意力（NSA / Window / Compressed / Heavily-Compressed 多 layer-type 路由）、HC（Head Compression）混合、Sinkhorn 拆分以及 Hash Routing MoE 等新结构，并补齐对应的 fake-tensor 语义算子与代价模型，让 V4 模型可以直接走通现有 analytic / multistream tracing 流程。 ------ ## 📝 Modification / 修改内容新增文件 / New files - tensor_cast/transformers/builtin_model/deepseek_v4.py：DeepSeek V4 builtin model profile，包含 DeepseekV4Config / DeepseekV4Model 注册、layer-type 校验（{0, 4, 128} 对应 sliding_attention / compressed_sparse_attention / heavily_compressed_attention）、以及与 transformers AutoConfig / AutoModel 的安全注册逻辑。 - tests/test_tensor_cast/test_deepseek_v4.py 与 tests/test_tensor_cast/data/deepseek_v4/.json：V4 模型对应的测试数据集与用例（含合法/非法/缺失/截短的 ratios 配置）。注意力 / Attention（tensor_cast/layers/mla.py，tensor_cast/ops/mla.py，tensor_cast/ops/rotary_embedding.py） - 新增 DeepseekV4SparseAttention 与 MultiheadLatentAttentionTensorCast 适配（含 requires_legacy_kv_b_decomposition、KV-cache window 写入路径等）。 - 新增 get_window_topk_idxs / get_compress_topk_idxs 索引生成工具。 - 新增 HC 路径语义算子：hc_pre_inv_rms、hc_pre_sinkhorn，分别对应参考实现中的 inverse-RMS 缩放与 Sinkhorn 加权 reduction。 - 新增 scatter_nd_update_mla 等 KV 写入算子的代价模型，按参考实现仅计 source 行读 + 更新行写，不计 slot_mapping / 整 cache 张量。 MoE / Gate（tensor_cast/layers/moe_layer.py，tensor_cast/ops/fused_moe.py） - MoELayer 增加 V4 统一 gating 路径：识别 gate 上的 is_v4 / hash 标志位，按参考 Gate.forward 顺序发出 matmul + score func + indices + gather/normalize/route_scale 各算子，使每一步按其真实 dtype（gate matmul 走 fp32）单独计费。 - 新增 moe_gating_top_k（带可选 bias 的 V4 非 hash 层）与 moe_gating_top_k_hash（基于 tid2eid 表的 hash 路由层）两个语义算子。性能模型 / Performance Model（tensor_cast/performance_model/__init__.py） - 引入 _safe_max_int 工具：在 fake / meta / functional tensor 上 tensor.max().item() 不可用时回退为 None，让 caller 走 shape-based 估算。 - 注册 V4 新算子（scatter_nd_update_mla、HC 系列、MoE 新 gating tail 等）的 PerformanceProperties，与参考实现的内存访问语义对齐。其他 / Misc - tensor_cast/core/config_resolver.py、input_generator.py、model_runner.py、device.py、transformers/transformations.py、 transformers/custom_model_registry.py、layers/utils.py、model_config.py、compilation/passes/multistream_pass.py：补齐 V4 在 config 解析、输入构造、runner 调度、device profile、模型变换与算子注册各环节的接入。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc.* 请提供相关测试结果，例如测试报告等。 ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4dbd32d5-6f6d-4b84-a840-a06eec62fc40/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8428112/fda50383-9b30-4453-bfd1-391889bebb47/image.png 'image.png') ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. / 使用 [lintrunner 工具](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) 来修复潜在的 lint 问题。 - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!166	21 天前