| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
【Bugfix】deepseek-v4模型kvcache计算错误问题修复 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge ds-kvcache-fix into develop 【Bugfix】deepseek-v4模型kvcache计算错误问题修复 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [x] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [x] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 This PR fixes inaccurate DeepSeek V4 KV cache sizing and memory estimation in msmodeling. The previous implementation used the full paged KV cache footprint for DeepSeek V4 sparse/compressed attention, which over-counted KV cache memory and affected throughput / memory estimation accuracy. 该 PR 修复 DeepSeek V4 KV cache 尺寸和内存估算不准确的问题。原实现未按 V4 sparse/compressed attention 的压缩缓存语义计算 KV cache,导致 KV cache 内存被高估,进而影响吞吐和显存占用评估结果。 ------ ## 📝 Modification / 修改内容 - Fix DeepSeek V4 main KV cache sizing according to compress_ratio, sliding_window, batch size, and total KV tokens. - Keep DeepSeek V4 main KV cache dtype as model dtype, while allowing indexer cache to follow attention quantization dtype. - Add compressed sizing for DeepSeek V4 indexer cache, gated explicitly by model_type == "deepseek_v4" to avoid affecting other MLA/DSA models. - Update input generation paths to pass batch/token information into KV cache helpers. - Calibrate multiple DeepSeek V4 analytic performance model operators to better match the reference fused-kernel behavior and avoid double-counted memory traffic. - Add --quantize-backbone-linear-action to support different quantization actions for backbone linear layers and routed MoE experts. ------ ## 📐 Associated Test Results / 关联测试结果 Not run yet in this commit.  See merge request: Ascend/msmodeling!321 | 17 天前 | |
support kimi k2.5 Co-authored-by: wangshen001<wangshen34@h-partners.com> # message auto-generated for no-merge-commit merge: !200 merge support-kimi_k2.5 into develop support kimi k2.5 Created-by: wangshen001 Commit-by: wangshen001 Merged-by: ascend-robot Description: 1.1 纯文本 text_generate 的运行截图 prefill: 命令参数:  执行结果:  decode: 命令参数:  执行结果:  1.2 多模态 text_generate 的运行截图 prefill: 命令参数:  执行结果:  decode: 命令参数:  执行结果:  2.throughput_optimizer 的运行结果: 命令参数:  执行结果:  3.目前只拿到了纯文本的Decode实测profilling数据(输入参数对应上面的纯文本入参),实测数据如下:  仿真和测试的精度差异为:(43.187620 - 39.137) / 43.187620 * 100% = 9.33%,验收要求Decode精度在20%以内,精度符合预期 4.命令参数需要补充:--enable-shared-expert-tp,Shared Expert 也使用 TP 切分 5.要增加这么多代码适配kimi k2.5模型的原因: 5.1: 框架用 image_grid_thw 传递视觉 token 边界,Kimi K2.5 的 forward 形参叫 grid_thws,导致 KV cache、attention_meta 等框架注入的关键参数在 kwargs 过滤时被丢弃,视觉编码器收到 None 会崩溃。 5.2 :标准 MLA 由框架注入 position_embeddings,Kimi K2.5 的 decoder 只传 position_ids,自己内部算 RoPE。框架的 MLA 层收不到 cos/sin 就降级为恒等旋转,仿真结果失准。 5.3:n_routed_experts 不在 config 根层级,而在 text_config 里。框架的 patch_moe 从根层级读取,直接 AttributeError。 5.4:MoonViT3dPatchEmbed 用 Conv2d 做投影,仿真时 token 是展平的 2D,Conv2d 只接受 4D,必须手动 reshape + linear 投影。MoonViT3dEncoder 还缺少 use_deterministic_attn 属性,且未注册 tensor_cast attention 后端。 5.5:transformers v5 移除了 is_torch_fx_available 但kimi k2.5的远端代码又调用了它;Windows 没有 signal.SIGALRM 导致 trust_remote_code 交互提示崩溃。这俩不是 Kimi K2.5 特有的问题,但因为 Kimi K2.5 必须 trust_remote_code,所以在这个文件里顺带修了。 See merge request: Ascend/msmodeling!200 | 1 个月前 | |
【Bugfix】deepseek-v4模型kvcache计算错误问题修复 Co-authored-by: ChenHuiwen<chenhuiwen7@huawei.com> # message auto-generated for no-merge-commit merge: !321 merge ds-kvcache-fix into develop 【Bugfix】deepseek-v4模型kvcache计算错误问题修复 Created-by: ChenHuiwen Commit-by: ChenHuiwen Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [x] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [x] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 This PR fixes inaccurate DeepSeek V4 KV cache sizing and memory estimation in msmodeling. The previous implementation used the full paged KV cache footprint for DeepSeek V4 sparse/compressed attention, which over-counted KV cache memory and affected throughput / memory estimation accuracy. 该 PR 修复 DeepSeek V4 KV cache 尺寸和内存估算不准确的问题。原实现未按 V4 sparse/compressed attention 的压缩缓存语义计算 KV cache,导致 KV cache 内存被高估,进而影响吞吐和显存占用评估结果。 ------ ## 📝 Modification / 修改内容 - Fix DeepSeek V4 main KV cache sizing according to compress_ratio, sliding_window, batch size, and total KV tokens. - Keep DeepSeek V4 main KV cache dtype as model dtype, while allowing indexer cache to follow attention quantization dtype. - Add compressed sizing for DeepSeek V4 indexer cache, gated explicitly by model_type == "deepseek_v4" to avoid affecting other MLA/DSA models. - Update input generation paths to pass batch/token information into KV cache helpers. - Calibrate multiple DeepSeek V4 analytic performance model operators to better match the reference fused-kernel behavior and avoid double-counted memory traffic. - Add --quantize-backbone-linear-action to support different quantization actions for backbone linear layers and routed MoE experts. ------ ## 📐 Associated Test Results / 关联测试结果 Not run yet in this commit.  See merge request: Ascend/msmodeling!321 | 17 天前 | |
feat(multistream): add compile-time multistream scheduling (core only) Co-authored-by: Kudo__shinichi<liuning119@huawei.com> # message auto-generated for no-merge-commit merge: !117 merge feat/multistream-design into develop feat(multistream): add compile-time multistream scheduling (core only) Created-by: Kudo__shinichi Commit-by: Kudo__shinichi Merged-by: ascend-robot Description: **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [x] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 当前 torch.compile 路径中缺少通用的多流调度能力。通信与计算的重叠主要依赖少量已有融合算子的局部建模,无法对 FX 图中的普通 compute / collective 节点做统一的 compile-time 调度。 本 PR 的目标是: 1. 在 torch.compile 路径中引入可控的多流调度能力; 2. 在存在通信与计算重叠窗口的场景下缩短关键路径; 3. 在预测无收益时通过收益守卫自动回退,保持原有单流行为不变; 4. 保持实现简洁,尽量复用现有 compile / runtime / performance model 基础能力; 5. 修复多流控制锚点参与 memory tracking 时导致 activation memory 统计失真的问题。 ## 📝 Modification / 修改内容 本 PR 主要包含以下改动: 1. tensor_cast/config.py - 增加 multistream 配置项; - 支持基于 role 的 stream 映射; - 保留旧字段兼容; - 去除 pass-local 硬编码带宽默认值,调度成本优先使用 analytic performance model 和 device profile 信息。 2. tensor_cast/core/model_builder.py - 在构建 compile backend 时传入当前 device 信息; - 使 multistream pass 能够基于当前设备 profile 做 cost estimation。 3. tensor_cast/compilation/compile_backend.py - 在 compile rewrite 流程中接入 multistream pass; - 按 reviewer 建议,将 multistream pass 放在 decompose_auto_functionalized_pass 之前执行; - 原因是 multistream pass 内部会调用 DCE,需要在 pure-functional graph 上运行,避免 defunctionalization 后的 mutation-style graph 影响语义正确性。 4. tensor_cast/compilation/passes/multistream_pass.py - 引入 compile-time multistream schedule pass; - 将节点按执行资源划分为 COMM_ONLY、HYBRID、COMPUTE; - all_reduce / all_gather / reduce_scatter / all_to_all 等 collective 节点建模为通信节点; - matmul_all_reduce / static_quant_linear_all_reduce 等融合节点建模为 hybrid 节点; - 通过 _internal_wait_and_bind / _internal_record 完成 lowering; - 增加收益守卫,仅当预测多流 makespan 优于单流 baseline 时才应用改写; - 非 OpOverload helper 节点不进入 analytic cost estimation,避免 operator.getitem 等 helper 被错误当作设备算子建模。 5. tensor_cast/runtime.py - 增加多流运行事件中的 stream / dependency token 记录; - memory tracker 按多流依赖感知顺序回放事件,更准确地反映多流下 activation lifetime 延长; - 多流内部 anchor op 不作为模型 activation 参与显存统计,避免控制锚点放大 memory 结果。 6. tests - 增加 multistream pass 基础覆盖; - 增加 runtime critical path 和 anchor memory 相关覆盖; - 覆盖收益守卫、anchor lowering、helper node 处理和多流 memory accounting 等关键行为。 ## 📐 Associated Test Results / 关联测试结果 单流示例 python -m tensor_cast.scripts.text_generate deepseek-ai/DeepSeek-V3.1 --device ATLAS_800_A3_560T_128G_DIE --num-queries 64 --query-length 1 --context-length 1024 --world-size 16 --tp-size 8 --dp-size 2 --moe-tp-size 4 --moe-dp-size 1 --ep-size 4 --decode --compile --compile-allow-graph-break --disable-repetition --num-hidden-layers-override 4 --quantize-attention-action INT8 --chrome-trace trace_ds_single_l4_q64_ctx1024.json --log-level info  多流示例 python -m tensor_cast.scripts.text_generate deepseek-ai/DeepSeek-V3.1 --device ATLAS_800_A3_560T_128G_DIE --num-queries 64 --query-length 1 --context-length 1024 --world-size 16 --tp-size 8 --dp-size 2 --moe-tp-size 4 --moe-dp-size 1 --ep-size 4 --decode --compile --compile-allow-graph-break --disable-repetition --num-hidden-layers-override 4 --quantize-attention-action INT8 --chrome-trace trace_ds_multi_l4_q64_ctx1024_current.json --log-level info  关键结果: | 场景 | Total time for analytic | Execution time | TPS/Device | 说明 | |---|---:|---:|---:|---| | 单流 | 20.729ms | 0.020729 s | 193 token/s | baseline | | 多流 | 20.687ms | 0.019750 s | 202.5 token/s | multistream enabled | 性能对比: - 多流场景下,Execution time 从 0.020729 s 降低到 0.019750 s,时延下降约 4.72%。 - TPS/Device 从 193 token/s 提升到 202.5 token/s,提升约 4.92%。 ------ ## 🌟 Use cases (Optional) / 使用案例(可选) 适合当前版本多流收益验证的场景: 1. 通信占比较高的 decode 场景; 2. TP/EP collective 较多、存在独立 compute/comm 重叠窗口的场景; 3. 希望在 compile 侧进行保守调度尝试,并要求无收益时自动回退的场景。 当前版本的已知边界: 1. dense / memory-bound 场景下,多流可能因收益守卫直接跳过; 2. HYBRID 融合算子当前仍按主流黑盒节点建模,后续仍有进一步细化空间。 ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] Linting tools are used to fix the potential lint issues. - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. - [x] Please ensure code files contain no Chinese comments. See merge request: Ascend/msmodeling!117 | 1 个月前 | |
【feat】Add operator bound breakdown reporting to text_generate Co-authored-by: lutean<lutean1@huawei.com> # message auto-generated for no-merge-commit merge: !246 merge develop into develop 【feat】Add operator bound breakdown reporting to text_generate Created-by: lutean Commit-by: lutean Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [x] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [ ] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 **Please describe the motivation of this PR and the goal you want to achieve through this PR.** **请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。** 用户/开发者在使用text_generate时,不管是定位问题还是分析结果合理性,都需要获取该算子的bound信息,当前该信息只能通过--chrome-trace打印查看。 现增加--dump-op-bound-results参数,若开启,增加每个算子的通信、计算、访存占比。 ------ ## 📝 Modification / 修改内容 **Please briefly describe what modification is made in this PR.** **请简要描述此拉取请求中进行的修改。** ------ ## 📐 Associated Test Results / 关联测试结果 **Please provide the related test results, such as test reports, etc.** **请提供相关测试结果,例如测试报告等。**  ------ ## 🌟 Use cases (Optional) / 使用案例(可选) **If this PR introduces a new feature, it is better to list some use cases here and update the documentation.** **如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。** ------ ## ✅ Checklist / 检查列表 **Before PR**: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!246 | 16 天前 | |
refactor(tensor_cast): unify word embedding tp config Co-authored-by: Kudo__shinichi<liuning119@huawei.com> # message auto-generated for no-merge-commit merge: !344 merge codex/word-embedding-tp-normalize into develop refactor(tensor_cast): unify word embedding tp config Created-by: Kudo__shinichi Commit-by: Kudo__shinichi Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [ ] CI/CD(持续集成/持续部署) - [x] Refactor(代码重构) - [ ] Perf(性能优化) - [x] Test-Cases(测试用例更新) - [ ] Other(其他) ## 🔍 Motivation / 变更动机 word_embedding_tp and word_embedding_tp_mode represented the same configuration concept in two fields: one field toggled word embedding TP, and the other selected the TP mode. This PR reduces the public and internal configuration shape to a single parameter so users only need to configure word_embedding_tp as disabled, col, or row. ------ ## 📝 Modification / 修改内容 - Make UserInputConfig.word_embedding_tp the single nullable word embedding TP mode field. - Remove word_embedding_tp_mode and embedding_parallel_mode from the config model. - Pass the normalized word_embedding_tp mode directly into ParallelConfig.embedding_parallel and the embedding transformation. - Keep legacy bool input normalization for compatibility: True -> col, False/None -> disabled. - Remove redundant CLI-side bool/mode conversion and update related benchmark cases and user guide docs. - Add regression coverage for single-field config, legacy bool normalization, and invalid word_embedding_tp values. ------ ## 📐 Associated Test Results / 关联测试结果 - python -m pytest tests/regression/tensor_cast/test_user_config.py -q: 6 passed - python -m pytest tests/regression/tensor_cast/test_user_config.py tests/regression/web_ui/test_command_builder.py tests/regression/tensor_cast/test_adapter_automation.py -q: 98 passed - python -m pytest tests/regression/tensor_cast/test_text_generate.py -k word_embedding_parallel -q: 2 passed, 113 deselected - python -m pytest tests/regression/tensor_cast/test_sequence_parallel_pass.py -o addopts= -m "nightly and not npu and not network" -q: 2 passed - python -m pytest tests/benchmark/models/test_model_regression.py --collect-only -q: 15 tests collected - python -m ruff check <changed python files>: All checks passed - python -m pre_commit run --from-ref origin/develop --to-ref HEAD: passed - git diff --check HEAD~1 HEAD: passed ------ ## 🌟 Use cases (Optional) / 使用案例(可选) - Disable word embedding TP: word_embedding_tp=None - Enable column mode: word_embedding_tp="col" - Enable row mode: word_embedding_tp="row" - CLI usage: --word-embedding-tp col or --word-embedding-tp row ------ ## ✅ Checklist / 检查列表 **Before PR**: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!344 | 16 天前 | |
chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。 **PR Type / PR类型** - [ ] Feature(功能新增) - [ ] Bugfix(Bug 修复) - [x] Docs(文档更新) - [x] CI/CD(持续集成/持续部署) - [ ] Refactor(代码重构) - [ ] Perf(性能优化) - [ ] Test-Cases(测试用例更新) - [ ] Other(其他) ------ ## Motivation / 变更动机 Continue the **pre-commit** migration: tighten **Pylint** so only high-signal messages run ( disable=all + explicit enable list), fix real issues that remained under that profile, and translate hook/config comments to **English**. ------ ## Configuration changes(仅工具与注释 / tooling & comments only) | Path | What changed | |------|----------------| | pre-commit/pyproject.toml | **Pylint:** [tool.pylint."messages control"] with disable = ["all"] and a short **allowlist** of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). **Ruff:** unchanged behavior; comments translated to English. **Bandit:** comments translated; rule allowlist/skip lists unchanged. | | .pre-commit-config.yaml | Comments translated to English; Bandit hook display name set to **bandit (Python security checks)**. Hook versions and args unchanged except for comment text. | ------ ## Source code changes(应用代码 / application code) | Area | Files | Purpose | |------|--------|---------| | serving_cast | communication.py, engine.py, instance.py, kv_cache_manager.py, load_gen.py, main.py, model_runner.py, request.py, serving.py, utils.py | Replace from . import stime with import serving_cast.stime as stime so Pylint resolves imports (fixes **E0611**). | | serving_cast | stime.py | Singleton **salabim** Environment via _get_sim_env() so type checkers/Pylint see **sim.Environment** (fixes **E1101** on SimulationEnv). | | serving_cast/service | base_throughput_optimizer.py | __init__ defaults + assert runner is not None before run_inference (fixes **E1101** on base class). | | tensor_cast | diffusers/diffusers_model.py, diffusers/diffusers_utils.py, runtime.py | Add **encoding="utf-8"** to open() / trace export (fixes **W1514**). | | web_ui | callbacks.py | **refresh_optimizer_detail:** call _optimizer_detail_view(rows, None, device) and unpack five return values (fixes **E1120**). | ------ ## Recent commits on pre-commit branch - ci(pre-commit): fix pylint message selection with disable=all - fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui - docs(pre-commit): translate comments to English and add all-files run log ------  ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176 | 1 个月前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 17 天前 | ||
| 1 个月前 | ||
| 17 天前 | ||
| 1 个月前 | ||
| 16 天前 | ||
| 16 天前 | ||
| 1 个月前 |