msmodeling/serving_cast/service · Ascend/MindStudio-Modeling - AtomGit

ascend-robotfeat(serving_cast): support chunked prefill modeling

文件	最后提交记录	最后更新时间
agg_throughput_optimizer.py	feat(serving_cast): support chunked prefill modeling Co-authored-by: jia_ya_nan<jiayanan3@h-partners.com> # message auto-generated for no-merge-commit merge: !250 merge feat/chunked-prefill-impl into develop feat(serving_cast): support chunked prefill modeling Created-by: jia_ya_nan Commit-by: jia_ya_nan Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。当前 throughput_optimizer 在混部模式下使用 max_prefill_tokens 作为 prefill token budget，并要求有效输入长度不超过该值。当长上下文请求的 effective_input_length 大于 token budget 时，工具会直接报错，无法模拟实际服务中常见的 chunked prefill 场景。本 PR 旨在补齐 msmodeling 对 chunked prefill 的建模能力，使吞吐优化器可以在长 prompt 或较小 batch token budget 场景下，自动将 prefill 拆分为多个 chunk 进行估算，并更合理地建模 prefill 与 decode 混部执行对 TTFT、TPOT 和吞吐的影响。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 - 将 CLI 参数 --max-prefill-tokens 重命名为 --max-batched-tokens，用于表达单个 prefill / mixed step 的 token budget。 - 新增 prefill chunk plan 生成逻辑，当 effective_input_length > max_batched_tokens 时自动按 max_batched_tokens 切分 prefill。 - 新增默认调度策略 DecodeFirstWithSlack，支持 decode-first 调度，并允许 15% slack 以避免 decode token 占用导致 prefill chunk 无法调度。 - 聚合模式中新增 chunked prefill 轻量级时间模拟，支持已完成 prefill 的请求提前进入 decode，不再要求所有请求完成 prefill 后统一 decode。 - PD 分离模式中 prefill 阶段支持 chunked prefill，decode 阶段保持原有逻辑。 - 优化 latency cache key，使其区分不同的 query_len、seq_len 和并发形态。 - 输出结果新增 effective_input_length、max_batched_tokens、prefill_num_chunks，便于分析 chunked prefill 配置影响。 - 更新 Web UI 参数生成、表单校验、相关文档和单元测试。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。以32条 32k请求为例，无chunk改动前： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-prefill-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/22af2740-aec6-4e0d-993e-cfe5478e6223/image.png 'image.png') 无chunk改动后： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/756bb229-befb-4520-ad0c-73fc32da7523/image.png 'image.png') 结果不变，不影响之前的调度逻辑 chunk为2000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 2000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9a583c72-25f2-4d67-8396-120256866f93/image.png 'image.png') chunk为4000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4f83e373-51d6-4101-bc27-41e4aee03b2c/image.png 'image.png') chunk为8000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/457ead7f-6867-4ecd-8f5a-f140de558de0/image.png 'image.png') 从趋势上看，chunk size越小，对tpot越友好；同时chunk 越小，prefill阶段调度越多，ttft会增加，符合预期；另外，由于开启chunk prefill后，prefill会增加多次计算，导致耗时呈线性增长；可以考虑并行一次性跑完所有切分的prefill，但对资源消耗巨大，考虑在下个PR内提升性能 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!250	22 天前
base_throughput_optimizer.py	feat(serving_cast): support chunked prefill modeling Co-authored-by: jia_ya_nan<jiayanan3@h-partners.com> # message auto-generated for no-merge-commit merge: !250 merge feat/chunked-prefill-impl into develop feat(serving_cast): support chunked prefill modeling Created-by: jia_ya_nan Commit-by: jia_ya_nan Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。当前 throughput_optimizer 在混部模式下使用 max_prefill_tokens 作为 prefill token budget，并要求有效输入长度不超过该值。当长上下文请求的 effective_input_length 大于 token budget 时，工具会直接报错，无法模拟实际服务中常见的 chunked prefill 场景。本 PR 旨在补齐 msmodeling 对 chunked prefill 的建模能力，使吞吐优化器可以在长 prompt 或较小 batch token budget 场景下，自动将 prefill 拆分为多个 chunk 进行估算，并更合理地建模 prefill 与 decode 混部执行对 TTFT、TPOT 和吞吐的影响。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 - 将 CLI 参数 --max-prefill-tokens 重命名为 --max-batched-tokens，用于表达单个 prefill / mixed step 的 token budget。 - 新增 prefill chunk plan 生成逻辑，当 effective_input_length > max_batched_tokens 时自动按 max_batched_tokens 切分 prefill。 - 新增默认调度策略 DecodeFirstWithSlack，支持 decode-first 调度，并允许 15% slack 以避免 decode token 占用导致 prefill chunk 无法调度。 - 聚合模式中新增 chunked prefill 轻量级时间模拟，支持已完成 prefill 的请求提前进入 decode，不再要求所有请求完成 prefill 后统一 decode。 - PD 分离模式中 prefill 阶段支持 chunked prefill，decode 阶段保持原有逻辑。 - 优化 latency cache key，使其区分不同的 query_len、seq_len 和并发形态。 - 输出结果新增 effective_input_length、max_batched_tokens、prefill_num_chunks，便于分析 chunked prefill 配置影响。 - 更新 Web UI 参数生成、表单校验、相关文档和单元测试。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。以32条 32k请求为例，无chunk改动前： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-prefill-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/22af2740-aec6-4e0d-993e-cfe5478e6223/image.png 'image.png') 无chunk改动后： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/756bb229-befb-4520-ad0c-73fc32da7523/image.png 'image.png') 结果不变，不影响之前的调度逻辑 chunk为2000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 2000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9a583c72-25f2-4d67-8396-120256866f93/image.png 'image.png') chunk为4000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4f83e373-51d6-4101-bc27-41e4aee03b2c/image.png 'image.png') chunk为8000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/457ead7f-6867-4ecd-8f5a-f140de558de0/image.png 'image.png') 从趋势上看，chunk size越小，对tpot越友好；同时chunk 越小，prefill阶段调度越多，ttft会增加，符合预期；另外，由于开启chunk prefill后，prefill会增加多次计算，导致耗时呈线性增长；可以考虑并行一次性跑完所有切分的prefill，但对资源消耗巨大，考虑在下个PR内提升性能 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!250	22 天前
disagg_throughput_optimizer.py	feat(serving_cast): support chunked prefill modeling Co-authored-by: jia_ya_nan<jiayanan3@h-partners.com> # message auto-generated for no-merge-commit merge: !250 merge feat/chunked-prefill-impl into develop feat(serving_cast): support chunked prefill modeling Created-by: jia_ya_nan Commit-by: jia_ya_nan Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。当前 throughput_optimizer 在混部模式下使用 max_prefill_tokens 作为 prefill token budget，并要求有效输入长度不超过该值。当长上下文请求的 effective_input_length 大于 token budget 时，工具会直接报错，无法模拟实际服务中常见的 chunked prefill 场景。本 PR 旨在补齐 msmodeling 对 chunked prefill 的建模能力，使吞吐优化器可以在长 prompt 或较小 batch token budget 场景下，自动将 prefill 拆分为多个 chunk 进行估算，并更合理地建模 prefill 与 decode 混部执行对 TTFT、TPOT 和吞吐的影响。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 - 将 CLI 参数 --max-prefill-tokens 重命名为 --max-batched-tokens，用于表达单个 prefill / mixed step 的 token budget。 - 新增 prefill chunk plan 生成逻辑，当 effective_input_length > max_batched_tokens 时自动按 max_batched_tokens 切分 prefill。 - 新增默认调度策略 DecodeFirstWithSlack，支持 decode-first 调度，并允许 15% slack 以避免 decode token 占用导致 prefill chunk 无法调度。 - 聚合模式中新增 chunked prefill 轻量级时间模拟，支持已完成 prefill 的请求提前进入 decode，不再要求所有请求完成 prefill 后统一 decode。 - PD 分离模式中 prefill 阶段支持 chunked prefill，decode 阶段保持原有逻辑。 - 优化 latency cache key，使其区分不同的 query_len、seq_len 和并发形态。 - 输出结果新增 effective_input_length、max_batched_tokens、prefill_num_chunks，便于分析 chunked prefill 配置影响。 - 更新 Web UI 参数生成、表单校验、相关文档和单元测试。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。以32条 32k请求为例，无chunk改动前： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-prefill-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/22af2740-aec6-4e0d-993e-cfe5478e6223/image.png 'image.png') 无chunk改动后： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/756bb229-befb-4520-ad0c-73fc32da7523/image.png 'image.png') 结果不变，不影响之前的调度逻辑 chunk为2000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 2000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9a583c72-25f2-4d67-8396-120256866f93/image.png 'image.png') chunk为4000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4f83e373-51d6-4101-bc27-41e4aee03b2c/image.png 'image.png') chunk为8000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/457ead7f-6867-4ecd-8f5a-f140de558de0/image.png 'image.png') 从趋势上看，chunk size越小，对tpot越友好；同时chunk 越小，prefill阶段调度越多，ttft会增加，符合预期；另外，由于开启chunk prefill后，prefill会增加多次计算，导致耗时呈线性增长；可以考虑并行一次性跑完所有切分的prefill，但对资源消耗巨大，考虑在下个PR内提升性能 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!250	22 天前
optimizer_curve_plots.py	【fix】(curve_plots): 将并行配置标签截断阈值从 48 提升至 80 字符 Co-authored-by: eveyin1<qianyin2022@hotmail.com> # message auto-generated for no-merge-commit merge: !239 merge develop into develop 【fix】(curve_plots): 将并行配置标签截断阈值从 48 提升至 80 字符 Created-by: eveyin1 Commit-by: eveyin1 Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [ ] Feature（功能新增） - [x] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。 ### 修改前 ![2222222222.jpg](https://raw.gitcode.com/user-images/assets/8428112/5471b992-3d8f-4dcc-8c01-fae4e08c3aab/2222222222.jpg '2222222222.jpg') ### 修改后 ![1111111111111111.jpg](https://raw.gitcode.com/user-images/assets/8428112/3eab835a-e255-40a9-81c2-45844a15a8a0/1111111111111111.jpg '1111111111111111.jpg') ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [ ] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [ ] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!239	1 个月前
optimizer_factory.py	chore(ci): adopt pre-commit and retire legacy lintrunner adapters Co-authored-by: liujiawang<anonymousdev@163.com> # message auto-generated for no-merge-commit merge: !176 merge pre-commit into develop chore(ci): adopt pre-commit and retire legacy lintrunner adapters Created-by: AvadaKedavrua Commit-by: liujiawang;AvadaKedavrua Merged-by: ascend-robot Description: Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [ ] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [x] Docs（文档更新） - [x] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ------ ## Motivation / 变更动机 Continue the pre-commit migration: tighten Pylint so only high-signal messages run (`disable=all` + explicit `enable` list), fix real issues that remained under that profile, and translate hook/config comments to English. ------ ## Configuration changes（仅工具与注释 / tooling & comments only） \| Path \| What changed \| \|------\|----------------\| \| `pre-commit/pyproject.toml` \| Pylint: `[tool.pylint."messages control"]` with `disable = ["all"]` and a short allowlist of message IDs (E0100, E0601–E0611, E0632, E1101, E1120, W0632, W1514). Ruff: unchanged behavior; comments translated to English. Bandit: comments translated; rule allowlist/skip lists unchanged. \| \| `.pre-commit-config.yaml` \| Comments translated to English; Bandit hook display name set to bandit (Python security checks). Hook versions and args unchanged except for comment text. \| ------ ## Source code changes（应用代码 / application code） \| Area \| Files \| Purpose \| \|------\|--------\|---------\| \| `serving_cast` \| `communication.py`, `engine.py`, `instance.py`, `kv_cache_manager.py`, `load_gen.py`, `main.py`, `model_runner.py`, `request.py`, `serving.py`, `utils.py` \| Replace `from . import stime` with `import serving_cast.stime as stime` so Pylint resolves imports (fixes E0611). \| \| `serving_cast` \| `stime.py` \| Singleton salabim `Environment` via `_get_sim_env()` so type checkers/Pylint see `sim.Environment` (fixes E1101 on `SimulationEnv`). \| \| `serving_cast/service` \| `base_throughput_optimizer.py` \| `__init__` defaults + `assert runner is not None` before `run_inference` (fixes E1101 on base class). \| \| `tensor_cast` \| `diffusers/diffusers_model.py`, `diffusers/diffusers_utils.py`, `runtime.py` \| Add `encoding="utf-8"` to `open()` / trace export (fixes W1514). \| \| `web_ui` \| `callbacks.py` \| `refresh_optimizer_detail`: call `_optimizer_detail_view(rows, None, device)` and unpack five return values (fixes E1120). \| ------ ## Recent commits on `pre-commit` branch - `ci(pre-commit): fix pylint message selection with disable=all` - `fix: resolve pylint findings in serving_cast, tensor_cast, and web_ui` - `docs(pre-commit): translate comments to English and add all-files run log` ------ ![](https://raw.atomgit.com/Ascend/msmodeling/attachment/uploads/b22b18aa-4c84-4dc0-85f5-1e7e0715350e/pre-commit-all-files-run.svg) ------ ## Checklist / 检查列表 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 See merge request: Ascend/msmodeling!176	1 个月前
optimizer_summary.py	Perf(serving_cast): Improve the efficiency of search methods Co-authored-by: cmh1056291129<chenminghaoscu@163.com> # message auto-generated for no-merge-commit merge: !199 merge develop into develop Perf(serving_cast): Improve the efficiency of search methods Created-by: cmh1056291129 Commit-by: HMCCMH;cmh1056291129 Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [ ] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [x] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。在base_throughput_optimizer.py中run函数对batch_size的搜索目前固定在右界为512的倍数，存在冗余空间导致后续二分搜索次数增加 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 1.base_throughput_optimizer.py的run函数中根据估计函数_estimate_right_boundary修改二分搜索的起始右边界值 2.agg_throughput_optimizer.py和disagg_throughput_optimizer.py添加memory_info进summary 3.optimizer_summary.py中添加设置memory_info函数 4.cli增加松弛系数，用于对估计结果的宽松 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。 \| 命令 \| 方法 \| 端到端耗时(s) \| \|--\|--\|--\| \| 命令1 \| Baseline \| 69.49 \| \| 命令1 \| Proposed \| 62.21(<span style="color:#008a00;">↑10.48%</span>) \| \| 命令2 \| Baseline \| 48.37 \| \| 命令2 \| Proposed \| 43.01(<span style="color:#008a00;">↑11.08%</span>) \| \| 命令3 \| Baseline \| 46.49 \| \| 命令3 \| Proposed \| 42.54(<span style="color:#008a00;">↑18.50%</span>) \| CPU条件：Intel Core i7-8750H 命令1： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device TEST_DEVICE --num-devices 8 --input-length 3500 --output-length 1500 --compile --quantize-linear-action W8A8_DYNAMIC --quantize-attention-action DISABLED --tpot-limits 50` Baseline: ![命令1_baseline.png](https://raw.gitcode.com/user-images/assets/8428112/9fa1532a-40dd-41cd-b55f-5c144bd717aa/命令1_baseline.png '命令1_baseline.png') Proposed: ![命令1_own.png](https://raw.gitcode.com/user-images/assets/8428112/6399de26-74da-4914-976d-408f0d296d2a/命令1_own.png '命令1_own.png') CPU条件:昇腾32vCPUs 64GiB 命令2： `python -m cli.inference.throughput_optimizer deepseek-ai/DeepSeek-V3.2 --device TEST_DEVICE --num-devices 32 --input-length 1024 --output-length 1024 --tpot-limits 100 --quantize-attention-action INT8 --disagg` Baseline: ![命令2_baseline.png](https://raw.gitcode.com/user-images/assets/8428112/1b67f6cf-3637-4fd7-a983-46716a9ecacd/命令2_baseline.png '命令2_baseline.png') Proposed: ![命令2_own.png](https://raw.gitcode.com/user-images/assets/8428112/cdf7c4a7-0a07-41be-85ef-a4f86bb785b5/命令2_own.png '命令2_own.png') 命令3： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device TEST_DEVICE --num-devices 8 --input-length 3500 --output-length 1500 --compile --tpot-limits 50 --slo-relax-fator 2` Baseline: ![命令3_baseline.png](https://raw.gitcode.com/user-images/assets/8428112/1c9f70bc-44d5-4b7f-ae0e-c7f52093c4eb/命令3_baseline.png '命令3_baseline.png') Proposed: ![命令3_own.png](https://raw.gitcode.com/user-images/assets/8428112/fc254d8c-af41-4b4b-9a2d-53e705827e34/命令3_own.png '命令3_own.png') ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [ ] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!199	28 天前
pd_ratio_throughput_optimizer.py	feat(serving_cast): support chunked prefill modeling Co-authored-by: jia_ya_nan<jiayanan3@h-partners.com> # message auto-generated for no-merge-commit merge: !250 merge feat/chunked-prefill-impl into develop feat(serving_cast): support chunked prefill modeling Created-by: jia_ya_nan Commit-by: jia_ya_nan Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。当前 throughput_optimizer 在混部模式下使用 max_prefill_tokens 作为 prefill token budget，并要求有效输入长度不超过该值。当长上下文请求的 effective_input_length 大于 token budget 时，工具会直接报错，无法模拟实际服务中常见的 chunked prefill 场景。本 PR 旨在补齐 msmodeling 对 chunked prefill 的建模能力，使吞吐优化器可以在长 prompt 或较小 batch token budget 场景下，自动将 prefill 拆分为多个 chunk 进行估算，并更合理地建模 prefill 与 decode 混部执行对 TTFT、TPOT 和吞吐的影响。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 - 将 CLI 参数 --max-prefill-tokens 重命名为 --max-batched-tokens，用于表达单个 prefill / mixed step 的 token budget。 - 新增 prefill chunk plan 生成逻辑，当 effective_input_length > max_batched_tokens 时自动按 max_batched_tokens 切分 prefill。 - 新增默认调度策略 DecodeFirstWithSlack，支持 decode-first 调度，并允许 15% slack 以避免 decode token 占用导致 prefill chunk 无法调度。 - 聚合模式中新增 chunked prefill 轻量级时间模拟，支持已完成 prefill 的请求提前进入 decode，不再要求所有请求完成 prefill 后统一 decode。 - PD 分离模式中 prefill 阶段支持 chunked prefill，decode 阶段保持原有逻辑。 - 优化 latency cache key，使其区分不同的 query_len、seq_len 和并发形态。 - 输出结果新增 effective_input_length、max_batched_tokens、prefill_num_chunks，便于分析 chunked prefill 配置影响。 - 更新 Web UI 参数生成、表单校验、相关文档和单元测试。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。以32条 32k请求为例，无chunk改动前： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-prefill-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/22af2740-aec6-4e0d-993e-cfe5478e6223/image.png 'image.png') 无chunk改动后： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/756bb229-befb-4520-ad0c-73fc32da7523/image.png 'image.png') 结果不变，不影响之前的调度逻辑 chunk为2000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 2000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9a583c72-25f2-4d67-8396-120256866f93/image.png 'image.png') chunk为4000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4f83e373-51d6-4101-bc27-41e4aee03b2c/image.png 'image.png') chunk为8000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/457ead7f-6867-4ecd-8f5a-f140de558de0/image.png 'image.png') 从趋势上看，chunk size越小，对tpot越友好；同时chunk 越小，prefill阶段调度越多，ttft会增加，符合预期；另外，由于开启chunk prefill后，prefill会增加多次计算，导致耗时呈线性增长；可以考虑并行一次性跑完所有切分的prefill，但对资源消耗巨大，考虑在下个PR内提升性能 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!250	22 天前
scheduler.py	feat(serving_cast): support chunked prefill modeling Co-authored-by: jia_ya_nan<jiayanan3@h-partners.com> # message auto-generated for no-merge-commit merge: !250 merge feat/chunked-prefill-impl into develop feat(serving_cast): support chunked prefill modeling Created-by: jia_ya_nan Commit-by: jia_ya_nan Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。当前 throughput_optimizer 在混部模式下使用 max_prefill_tokens 作为 prefill token budget，并要求有效输入长度不超过该值。当长上下文请求的 effective_input_length 大于 token budget 时，工具会直接报错，无法模拟实际服务中常见的 chunked prefill 场景。本 PR 旨在补齐 msmodeling 对 chunked prefill 的建模能力，使吞吐优化器可以在长 prompt 或较小 batch token budget 场景下，自动将 prefill 拆分为多个 chunk 进行估算，并更合理地建模 prefill 与 decode 混部执行对 TTFT、TPOT 和吞吐的影响。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 - 将 CLI 参数 --max-prefill-tokens 重命名为 --max-batched-tokens，用于表达单个 prefill / mixed step 的 token budget。 - 新增 prefill chunk plan 生成逻辑，当 effective_input_length > max_batched_tokens 时自动按 max_batched_tokens 切分 prefill。 - 新增默认调度策略 DecodeFirstWithSlack，支持 decode-first 调度，并允许 15% slack 以避免 decode token 占用导致 prefill chunk 无法调度。 - 聚合模式中新增 chunked prefill 轻量级时间模拟，支持已完成 prefill 的请求提前进入 decode，不再要求所有请求完成 prefill 后统一 decode。 - PD 分离模式中 prefill 阶段支持 chunked prefill，decode 阶段保持原有逻辑。 - 优化 latency cache key，使其区分不同的 query_len、seq_len 和并发形态。 - 输出结果新增 effective_input_length、max_batched_tokens、prefill_num_chunks，便于分析 chunked prefill 配置影响。 - 更新 Web UI 参数生成、表单校验、相关文档和单元测试。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。以32条 32k请求为例，无chunk改动前： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-prefill-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/22af2740-aec6-4e0d-993e-cfe5478e6223/image.png 'image.png') 无chunk改动后： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/756bb229-befb-4520-ad0c-73fc32da7523/image.png 'image.png') 结果不变，不影响之前的调度逻辑 chunk为2000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 2000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9a583c72-25f2-4d67-8396-120256866f93/image.png 'image.png') chunk为4000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4f83e373-51d6-4101-bc27-41e4aee03b2c/image.png 'image.png') chunk为8000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/457ead7f-6867-4ecd-8f5a-f140de558de0/image.png 'image.png') 从趋势上看，chunk size越小，对tpot越友好；同时chunk 越小，prefill阶段调度越多，ttft会增加，符合预期；另外，由于开启chunk prefill后，prefill会增加多次计算，导致耗时呈线性增长；可以考虑并行一次性跑完所有切分的prefill，但对资源消耗巨大，考虑在下个PR内提升性能 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!250	22 天前
utils.py	feat(serving_cast): support chunked prefill modeling Co-authored-by: jia_ya_nan<jiayanan3@h-partners.com> # message auto-generated for no-merge-commit merge: !250 merge feat/chunked-prefill-impl into develop feat(serving_cast): support chunked prefill modeling Created-by: jia_ya_nan Commit-by: jia_ya_nan Merged-by: ascend-robot Description: # PR Template Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers. 感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。 PR Type / PR类型 - [x] Feature（功能新增） - [ ] Bugfix（Bug 修复） - [ ] Docs（文档更新） - [ ] CI/CD（持续集成/持续部署） - [ ] Refactor（代码重构） - [ ] Perf（性能优化） - [ ] Test-Cases（测试用例更新） - [ ] Other（其他） ## 🔍 Motivation / 变更动机 Please describe the motivation of this PR and the goal you want to achieve through this PR. 请描述您的拉取请求的动机和您希望通过此拉取请求实现的目标。当前 throughput_optimizer 在混部模式下使用 max_prefill_tokens 作为 prefill token budget，并要求有效输入长度不超过该值。当长上下文请求的 effective_input_length 大于 token budget 时，工具会直接报错，无法模拟实际服务中常见的 chunked prefill 场景。本 PR 旨在补齐 msmodeling 对 chunked prefill 的建模能力，使吞吐优化器可以在长 prompt 或较小 batch token budget 场景下，自动将 prefill 拆分为多个 chunk 进行估算，并更合理地建模 prefill 与 decode 混部执行对 TTFT、TPOT 和吞吐的影响。 ------ ## 📝 Modification / 修改内容 Please briefly describe what modification is made in this PR. 请简要描述此拉取请求中进行的修改。 - 将 CLI 参数 --max-prefill-tokens 重命名为 --max-batched-tokens，用于表达单个 prefill / mixed step 的 token budget。 - 新增 prefill chunk plan 生成逻辑，当 effective_input_length > max_batched_tokens 时自动按 max_batched_tokens 切分 prefill。 - 新增默认调度策略 DecodeFirstWithSlack，支持 decode-first 调度，并允许 15% slack 以避免 decode token 占用导致 prefill chunk 无法调度。 - 聚合模式中新增 chunked prefill 轻量级时间模拟，支持已完成 prefill 的请求提前进入 decode，不再要求所有请求完成 prefill 后统一 decode。 - PD 分离模式中 prefill 阶段支持 chunked prefill，decode 阶段保持原有逻辑。 - 优化 latency cache key，使其区分不同的 query_len、seq_len 和并发形态。 - 输出结果新增 effective_input_length、max_batched_tokens、prefill_num_chunks，便于分析 chunked prefill 配置影响。 - 更新 Web UI 参数生成、表单校验、相关文档和单元测试。 ------ ## 📐 Associated Test Results / 关联测试结果 Please provide the related test results, such as test reports, etc. 请提供相关测试结果，例如测试报告等。以32条 32k请求为例，无chunk改动前： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-prefill-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/22af2740-aec6-4e0d-993e-cfe5478e6223/image.png 'image.png') 无chunk改动后： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 32000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/756bb229-befb-4520-ad0c-73fc32da7523/image.png 'image.png') 结果不变，不影响之前的调度逻辑 chunk为2000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 2000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/9a583c72-25f2-4d67-8396-120256866f93/image.png 'image.png') chunk为4000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/4f83e373-51d6-4101-bc27-41e4aee03b2c/image.png 'image.png') chunk为8000： `python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B --device ATLAS_800_A2_280T_64G --quantize-linear-action DISABLED --input-length 32000 --output-length 1024 --tp-sizes 8 --compile --batch-range 32 32 --num-devices 8 --max-batched-tokens 4000 --log-level info` ![image.png](https://raw.gitcode.com/user-images/assets/8428112/457ead7f-6867-4ecd-8f5a-f140de558de0/image.png 'image.png') 从趋势上看，chunk size越小，对tpot越友好；同时chunk 越小，prefill阶段调度越多，ttft会增加，符合预期；另外，由于开启chunk prefill后，prefill会增加多次计算，导致耗时呈线性增长；可以考虑并行一次性跑完所有切分的prefill，但对资源消耗巨大，考虑在下个PR内提升性能 ------ ## 🌟 Use cases (Optional) / 使用案例（可选） If this PR introduces a new feature, it is better to list some use cases here and update the documentation. 如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。 ------ ## ✅ Checklist / 检查列表 Before PR: - [x] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。 - [x] The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。 - [x] All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。 - [x] Please ensure code files contain no Chinese comments. / 请保证代码文件中不含中文注释。 ------ See merge request: Ascend/msmodeling!250	22 天前