msmodeling/tensor_cast/performance_model/builtin_model · Ascend/MindStudio-Modeling - AtomGit

ascend-robot【Bugfix】deepseek-v4模型kvcache计算错误问题修复

文件	最后提交记录	最后更新时间
__init__.py	feat：仿真建模支持deepseek-V4模型适配 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !166 merge deepseek-v4 into develop feat：仿真建模支持deepseek-V4模型适配 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机为 msmodeling/tensor_cast 增加对 DeepSeek V4 (Flash/Pro) 模型的端到端支持，使其性能建模流水线能够覆盖 V4 引入的稀疏注意力（NSA / Window / Compressed / Heavily-Compressed 多 layer-type 路由）、HC（Head Compression）混合、Sinkhorn 拆分以及 Hash Routing MoE 等新结构，并补齐对应的 fake-tensor 语义算子与代价模型，让 V4 模型可以直接走通现有 analytic / multistream tracing 流程。 ------ ## 📝 Modification / 修改内容新增文件 / New files - tensor_cast/transformers/builtin_model/deepseek_v4.py：DeepSeek V4 builtin model profile，包含 DeepseekV4Config / DeepseekV4Model 注册、layer-type 校验（{0, 4, 128} 对应 sliding_attention / compressed_sparse_attention / heavily_compressed_attention）、以及与 transformers AutoConfig / AutoModel 的安全注册逻辑。 - tests/test_tensor_cast/test_deepseek_v4.py 与 tests/test_tensor_cast/data/deepseek_v4/.json：V4 模型对应的测试数据集与用例（含合法/非法/缺失/截短的 ratios 配置）。注意力 / Attention（tensor_cast/layers/mla.py，tensor_cast/ops/mla.py，tensor_cast/ops/rotary_embedding.py） - 新增 DeepseekV4SparseAttention 与 MultiheadLatentAttentionTensorCast 适配（含 requires_legacy_kv_b_decomposition、KV-cache window 写入路径等）。 - 新增 get_window_topk_idxs / get_compress_topk_idxs 索引生成工具。 - 新增 HC 路径语义算子：hc_pre_inv_rms、hc_pre_sinkhorn，分别对应参考实现中的 inverse-RMS 缩放与 Sinkhorn 加权 reduction。 - 新增 scatter_nd_update_mla 等 KV 写入算子的代价模型，按参考实现仅计 source 行读 + 更新行写，不计 slot_mapping / 整 cache 张量。 MoE / Gate（tensor_cast/layers/moe_layer.py，tensor_cast/ops/fused_moe.py） - MoELayer 增加 V4 统一 gating 路径：识别 gate 上的 is_v4 / hash 标志位，按参考 Gate.forward 顺序发出 matmul + score func + indices + gather/normalize/route_scale 各算子，使每一步按其真实 dtype（gate matmul 走 fp32）单独计费。 - 新增 moe_gating_top_k（带可选 bias 的 V4 非 hash 层）与 moe_gating_top_k_hash（基于 tid2eid 表的 hash 路由层）两个语义算子。性能模型 / Performance Model（tensor_cast/performance_model/__init__.py） - 引入 _safe_max_int 工具：在 fake / meta / functional tensor 上 tensor.max().item() 不可用时回退为 None，让 caller 走 shape-based 估算。 - 注册 V4 新算子（scatter_nd_update_mla、HC 系列、MoE 新 gating tail 等）的 PerformanceProperties，与参考实现的内存访问语义对齐。其他 / Misc - tensor_cast/core/config_resolver.py、input_generator.py、model_runner.py、device.py、transformers/transformations.py、 transformers/custom_model_registry.py、layers/utils.py、model_config.py、compilation/passes/multistream_pass.py：补齐 V4 在 config 解析、输入构造、runner 调度、device profile、模型变换与算子注册各环节的接入。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc.* 请提供相关测试结果，例如测试报告等。 ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4dbd32d5-6f6d-4b84-a840-a06eec62fc40/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/8428112/fda50383-9b30-4453-bfd1-391889bebb47/image.png 'image.png') ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] [Linting tools](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) are used to fix the potential lint issues. / 使用 [lintrunner 工具](https://gitcode.com/Ascend/msmodeling/blob/develop/tensor_cast/README.md#coding-style) 来修复潜在的 lint 问题。 - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!166	22 天前
deepseek_v4.py	【Bugfix】deepseek-v4模型kvcache计算错误问题修复 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge ds-kvcache-fix into develop 【Bugfix】deepseek-v4模型kvcache计算错误问题修复 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: PR Type / PR类型 - [x] Feature（功能新增） - [x] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [x] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 This PR fixes inaccurate DeepSeek V4 KV cache sizing and memory estimation in msmodeling. The previous implementation used the full paged KV cache footprint for DeepSeek V4 sparse/compressed attention, which over-counted KV cache memory and affected throughput / memory estimation accuracy. 该 PR 修复 DeepSeek V4 KV cache 尺寸和内存估算不准确的问题。原实现未按 V4 sparse/compressed attention 的压缩缓存语义计算 KV cache，导致 KV cache 内存被高估，进而影响吞吐和显存占用评估结果。 ------ ## 📝 Modification / 修改内容 - Fix DeepSeek V4 main KV cache sizing according to `compress_ratio`, `sliding_window`, batch size, and total KV tokens. - Keep DeepSeek V4 main KV cache dtype as model dtype, while allowing indexer cache to follow attention quantization dtype. - Add compressed sizing for DeepSeek V4 indexer cache, gated explicitly by `model_type == "deepseek_v4"` to avoid affecting other MLA/DSA models. - Update input generation paths to pass batch/token information into KV cache helpers. - Calibrate multiple DeepSeek V4 analytic performance model operators to better match the reference fused-kernel behavior and avoid double-counted memory traffic. - Add `--quantize-backbone-linear-action` to support different quantization actions for backbone linear layers and routed MoE experts. ------ ## 📐 Associated Test Results / 关联测试结果 Not run yet in this commit. ![image.png](https://raw.gitcode.com/user-images/assets/8428112/ac052839-697c-40c8-adbd-ac845fc33a5f/image.png 'image.png') See merge request: Ascend/msmodeling!321	16 天前