cann-robot添加经典和双平面两种实现路径的Dispatch和Combine算子example

文件	最后提交记录	最后更新时间
scripts	添加经典和双平面两种实现路径的Dispatch和Combine算子example Co-authored-by: xionglan2<xionglan2@huawei.com> # message auto-generated for no-merge-commit merge: !350 merge feature/dispatch-combine-examples into master 添加经典和双平面两种实现路径的Dispatch和Combine算子example Created-by: xionglan2 Commit-by: xionglan2 Merged-by: cann-robot Description: ## 描述本次 PR 将 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 四个 MoE 通信算子上库，并对 classic / DoublePlane 两类路径做统一整理和性能优化，目标是在保持现有输入输出语义不变的前提下，补齐基础能力、统一脚本与文档，并提升大包传输场景下的吞吐表现。具体改动包括： - 补齐 classic `dispatch` / `combine` 与 DoublePlane `dispatch_doubleplane` / `combine_doubleplane` 四个算子的构建、运行脚本和说明文档，统一功能验证和性能测试入口。 - `dispatch` 路径将输入token按路由规则分发到目标expert，生成后续combine所需的中间结果；`combine` 路径将个expert输出按源token和topK权重回传聚合，生成最终输出结果。 - `dispatch_doubleplane` / `combine_doubleplane` 引入基于 segment size 的自适应传输策略，在 direct MTE 与 SDMA 之间按段大小选择传输 plane；对大段优先走 SDMA，小段保留 direct 路径，以降低小包 SDMA 开销并提升大包场景吞吐。 - 保持四个算子的外部接口、输出格式和校验方式一致，便于后续统一冒烟、泛化和性能对比。 ## 关联的Issue https://gitcode.com/cann/shmem/issues/249 ## 测试已按以下维度进行验证，具体参数和结果可按实际执行情况补充： - 功能正确性：使用各目录 `scripts/run.sh` 完成 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 基础冒烟，校验 `expand_x`、`assist_info_for_combine`、`ep_recv_count`、`expert_token_nums`、`x_out` 与 golden 一致。 - 二级冒烟：覆盖 `[2/4/8]` 卡、`[int32_t/float16_t]`、`[bs/topk/h/expertPerPe]` 典型组合，确认 classic 与 DoublePlane 路径均可稳定运行。 - 算子泛化：覆盖不同 `pes`、`bs`、`h`、`topk`、`expertPerPe` 组合，验证不同 shape 下输出语义与 classic 基线保持一致。 - 性能验证：通过 `--perf` 模式对 `dispatch` vs `dispatch_doubleplane`、`combine` vs `combine_doubleplane` 进行 `full_op` 和 `comm_only` 对比，重点观察多卡场景下的带宽和耗时表现。 - 协议一致性：验证 DoublePlane 在 SDMA/direct 混合路径下，payload、assist、ready、count、status 的写入/可见性时序正确，无功能回归。验证结果如下： ![91CDE5B2-5437-41ED-A859-9F0C5013E64C.png](https://raw.gitcode.com/user-images/assets/8546182/a1858f90-6331-47a6-9347-46d1131a7550/91CDE5B2-5437-41ED-A859-9F0C5013E64C.png '91CDE5B2-5437-41ED-A859-9F0C5013E64C.png') ![D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png](https://raw.gitcode.com/user-images/assets/8546182/683ebd6c-df45-4d48-9c08-1360b7bbb0cc/D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png 'D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png') ![E19255BB-42E0-4CDE-E609-7E35E761F343.png](https://raw.gitcode.com/user-images/assets/8546182/7557886f-947a-4b17-b62f-107f1e1823b3/E19255BB-42E0-4CDE-E609-7E35E761F343.png 'E19255BB-42E0-4CDE-E609-7E35E761F343.png') ## 文档更新 - 更新/补充 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 的 README，说明构建、功能运行和性能测试方法。 - 更新/补充 DoublePlane 设计说明文档，记录分段判定、传输策略和控制面协议。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：Add example/perftest See merge request: cann/shmem!350	4 天前
CMakeLists.txt	添加经典和双平面两种实现路径的Dispatch和Combine算子example Co-authored-by: xionglan2<xionglan2@huawei.com> # message auto-generated for no-merge-commit merge: !350 merge feature/dispatch-combine-examples into master 添加经典和双平面两种实现路径的Dispatch和Combine算子example Created-by: xionglan2 Commit-by: xionglan2 Merged-by: cann-robot Description: ## 描述本次 PR 将 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 四个 MoE 通信算子上库，并对 classic / DoublePlane 两类路径做统一整理和性能优化，目标是在保持现有输入输出语义不变的前提下，补齐基础能力、统一脚本与文档，并提升大包传输场景下的吞吐表现。具体改动包括： - 补齐 classic `dispatch` / `combine` 与 DoublePlane `dispatch_doubleplane` / `combine_doubleplane` 四个算子的构建、运行脚本和说明文档，统一功能验证和性能测试入口。 - `dispatch` 路径将输入token按路由规则分发到目标expert，生成后续combine所需的中间结果；`combine` 路径将个expert输出按源token和topK权重回传聚合，生成最终输出结果。 - `dispatch_doubleplane` / `combine_doubleplane` 引入基于 segment size 的自适应传输策略，在 direct MTE 与 SDMA 之间按段大小选择传输 plane；对大段优先走 SDMA，小段保留 direct 路径，以降低小包 SDMA 开销并提升大包场景吞吐。 - 保持四个算子的外部接口、输出格式和校验方式一致，便于后续统一冒烟、泛化和性能对比。 ## 关联的Issue https://gitcode.com/cann/shmem/issues/249 ## 测试已按以下维度进行验证，具体参数和结果可按实际执行情况补充： - 功能正确性：使用各目录 `scripts/run.sh` 完成 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 基础冒烟，校验 `expand_x`、`assist_info_for_combine`、`ep_recv_count`、`expert_token_nums`、`x_out` 与 golden 一致。 - 二级冒烟：覆盖 `[2/4/8]` 卡、`[int32_t/float16_t]`、`[bs/topk/h/expertPerPe]` 典型组合，确认 classic 与 DoublePlane 路径均可稳定运行。 - 算子泛化：覆盖不同 `pes`、`bs`、`h`、`topk`、`expertPerPe` 组合，验证不同 shape 下输出语义与 classic 基线保持一致。 - 性能验证：通过 `--perf` 模式对 `dispatch` vs `dispatch_doubleplane`、`combine` vs `combine_doubleplane` 进行 `full_op` 和 `comm_only` 对比，重点观察多卡场景下的带宽和耗时表现。 - 协议一致性：验证 DoublePlane 在 SDMA/direct 混合路径下，payload、assist、ready、count、status 的写入/可见性时序正确，无功能回归。验证结果如下： ![91CDE5B2-5437-41ED-A859-9F0C5013E64C.png](https://raw.gitcode.com/user-images/assets/8546182/a1858f90-6331-47a6-9347-46d1131a7550/91CDE5B2-5437-41ED-A859-9F0C5013E64C.png '91CDE5B2-5437-41ED-A859-9F0C5013E64C.png') ![D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png](https://raw.gitcode.com/user-images/assets/8546182/683ebd6c-df45-4d48-9c08-1360b7bbb0cc/D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png 'D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png') ![E19255BB-42E0-4CDE-E609-7E35E761F343.png](https://raw.gitcode.com/user-images/assets/8546182/7557886f-947a-4b17-b62f-107f1e1823b3/E19255BB-42E0-4CDE-E609-7E35E761F343.png 'E19255BB-42E0-4CDE-E609-7E35E761F343.png') ## 文档更新 - 更新/补充 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 的 README，说明构建、功能运行和性能测试方法。 - 更新/补充 DoublePlane 设计说明文档，记录分段判定、传输策略和控制面协议。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：Add example/perftest See merge request: cann/shmem!350	4 天前
README.md	添加经典和双平面两种实现路径的Dispatch和Combine算子example Co-authored-by: xionglan2<xionglan2@huawei.com> # message auto-generated for no-merge-commit merge: !350 merge feature/dispatch-combine-examples into master 添加经典和双平面两种实现路径的Dispatch和Combine算子example Created-by: xionglan2 Commit-by: xionglan2 Merged-by: cann-robot Description: ## 描述本次 PR 将 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 四个 MoE 通信算子上库，并对 classic / DoublePlane 两类路径做统一整理和性能优化，目标是在保持现有输入输出语义不变的前提下，补齐基础能力、统一脚本与文档，并提升大包传输场景下的吞吐表现。具体改动包括： - 补齐 classic `dispatch` / `combine` 与 DoublePlane `dispatch_doubleplane` / `combine_doubleplane` 四个算子的构建、运行脚本和说明文档，统一功能验证和性能测试入口。 - `dispatch` 路径将输入token按路由规则分发到目标expert，生成后续combine所需的中间结果；`combine` 路径将个expert输出按源token和topK权重回传聚合，生成最终输出结果。 - `dispatch_doubleplane` / `combine_doubleplane` 引入基于 segment size 的自适应传输策略，在 direct MTE 与 SDMA 之间按段大小选择传输 plane；对大段优先走 SDMA，小段保留 direct 路径，以降低小包 SDMA 开销并提升大包场景吞吐。 - 保持四个算子的外部接口、输出格式和校验方式一致，便于后续统一冒烟、泛化和性能对比。 ## 关联的Issue https://gitcode.com/cann/shmem/issues/249 ## 测试已按以下维度进行验证，具体参数和结果可按实际执行情况补充： - 功能正确性：使用各目录 `scripts/run.sh` 完成 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 基础冒烟，校验 `expand_x`、`assist_info_for_combine`、`ep_recv_count`、`expert_token_nums`、`x_out` 与 golden 一致。 - 二级冒烟：覆盖 `[2/4/8]` 卡、`[int32_t/float16_t]`、`[bs/topk/h/expertPerPe]` 典型组合，确认 classic 与 DoublePlane 路径均可稳定运行。 - 算子泛化：覆盖不同 `pes`、`bs`、`h`、`topk`、`expertPerPe` 组合，验证不同 shape 下输出语义与 classic 基线保持一致。 - 性能验证：通过 `--perf` 模式对 `dispatch` vs `dispatch_doubleplane`、`combine` vs `combine_doubleplane` 进行 `full_op` 和 `comm_only` 对比，重点观察多卡场景下的带宽和耗时表现。 - 协议一致性：验证 DoublePlane 在 SDMA/direct 混合路径下，payload、assist、ready、count、status 的写入/可见性时序正确，无功能回归。验证结果如下： ![91CDE5B2-5437-41ED-A859-9F0C5013E64C.png](https://raw.gitcode.com/user-images/assets/8546182/a1858f90-6331-47a6-9347-46d1131a7550/91CDE5B2-5437-41ED-A859-9F0C5013E64C.png '91CDE5B2-5437-41ED-A859-9F0C5013E64C.png') ![D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png](https://raw.gitcode.com/user-images/assets/8546182/683ebd6c-df45-4d48-9c08-1360b7bbb0cc/D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png 'D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png') ![E19255BB-42E0-4CDE-E609-7E35E761F343.png](https://raw.gitcode.com/user-images/assets/8546182/7557886f-947a-4b17-b62f-107f1e1823b3/E19255BB-42E0-4CDE-E609-7E35E761F343.png 'E19255BB-42E0-4CDE-E609-7E35E761F343.png') ## 文档更新 - 更新/补充 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 的 README，说明构建、功能运行和性能测试方法。 - 更新/补充 DoublePlane 设计说明文档，记录分段判定、传输策略和控制面协议。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：Add example/perftest See merge request: cann/shmem!350	4 天前
combine_kernel.cpp	添加经典和双平面两种实现路径的Dispatch和Combine算子example Co-authored-by: xionglan2<xionglan2@huawei.com> # message auto-generated for no-merge-commit merge: !350 merge feature/dispatch-combine-examples into master 添加经典和双平面两种实现路径的Dispatch和Combine算子example Created-by: xionglan2 Commit-by: xionglan2 Merged-by: cann-robot Description: ## 描述本次 PR 将 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 四个 MoE 通信算子上库，并对 classic / DoublePlane 两类路径做统一整理和性能优化，目标是在保持现有输入输出语义不变的前提下，补齐基础能力、统一脚本与文档，并提升大包传输场景下的吞吐表现。具体改动包括： - 补齐 classic `dispatch` / `combine` 与 DoublePlane `dispatch_doubleplane` / `combine_doubleplane` 四个算子的构建、运行脚本和说明文档，统一功能验证和性能测试入口。 - `dispatch` 路径将输入token按路由规则分发到目标expert，生成后续combine所需的中间结果；`combine` 路径将个expert输出按源token和topK权重回传聚合，生成最终输出结果。 - `dispatch_doubleplane` / `combine_doubleplane` 引入基于 segment size 的自适应传输策略，在 direct MTE 与 SDMA 之间按段大小选择传输 plane；对大段优先走 SDMA，小段保留 direct 路径，以降低小包 SDMA 开销并提升大包场景吞吐。 - 保持四个算子的外部接口、输出格式和校验方式一致，便于后续统一冒烟、泛化和性能对比。 ## 关联的Issue https://gitcode.com/cann/shmem/issues/249 ## 测试已按以下维度进行验证，具体参数和结果可按实际执行情况补充： - 功能正确性：使用各目录 `scripts/run.sh` 完成 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 基础冒烟，校验 `expand_x`、`assist_info_for_combine`、`ep_recv_count`、`expert_token_nums`、`x_out` 与 golden 一致。 - 二级冒烟：覆盖 `[2/4/8]` 卡、`[int32_t/float16_t]`、`[bs/topk/h/expertPerPe]` 典型组合，确认 classic 与 DoublePlane 路径均可稳定运行。 - 算子泛化：覆盖不同 `pes`、`bs`、`h`、`topk`、`expertPerPe` 组合，验证不同 shape 下输出语义与 classic 基线保持一致。 - 性能验证：通过 `--perf` 模式对 `dispatch` vs `dispatch_doubleplane`、`combine` vs `combine_doubleplane` 进行 `full_op` 和 `comm_only` 对比，重点观察多卡场景下的带宽和耗时表现。 - 协议一致性：验证 DoublePlane 在 SDMA/direct 混合路径下，payload、assist、ready、count、status 的写入/可见性时序正确，无功能回归。验证结果如下： ![91CDE5B2-5437-41ED-A859-9F0C5013E64C.png](https://raw.gitcode.com/user-images/assets/8546182/a1858f90-6331-47a6-9347-46d1131a7550/91CDE5B2-5437-41ED-A859-9F0C5013E64C.png '91CDE5B2-5437-41ED-A859-9F0C5013E64C.png') ![D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png](https://raw.gitcode.com/user-images/assets/8546182/683ebd6c-df45-4d48-9c08-1360b7bbb0cc/D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png 'D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png') ![E19255BB-42E0-4CDE-E609-7E35E761F343.png](https://raw.gitcode.com/user-images/assets/8546182/7557886f-947a-4b17-b62f-107f1e1823b3/E19255BB-42E0-4CDE-E609-7E35E761F343.png 'E19255BB-42E0-4CDE-E609-7E35E761F343.png') ## 文档更新 - 更新/补充 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 的 README，说明构建、功能运行和性能测试方法。 - 更新/补充 DoublePlane 设计说明文档，记录分段判定、传输策略和控制面协议。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：Add example/perftest See merge request: cann/shmem!350	4 天前
combine_kernel.h	添加经典和双平面两种实现路径的Dispatch和Combine算子example Co-authored-by: xionglan2<xionglan2@huawei.com> # message auto-generated for no-merge-commit merge: !350 merge feature/dispatch-combine-examples into master 添加经典和双平面两种实现路径的Dispatch和Combine算子example Created-by: xionglan2 Commit-by: xionglan2 Merged-by: cann-robot Description: ## 描述本次 PR 将 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 四个 MoE 通信算子上库，并对 classic / DoublePlane 两类路径做统一整理和性能优化，目标是在保持现有输入输出语义不变的前提下，补齐基础能力、统一脚本与文档，并提升大包传输场景下的吞吐表现。具体改动包括： - 补齐 classic `dispatch` / `combine` 与 DoublePlane `dispatch_doubleplane` / `combine_doubleplane` 四个算子的构建、运行脚本和说明文档，统一功能验证和性能测试入口。 - `dispatch` 路径将输入token按路由规则分发到目标expert，生成后续combine所需的中间结果；`combine` 路径将个expert输出按源token和topK权重回传聚合，生成最终输出结果。 - `dispatch_doubleplane` / `combine_doubleplane` 引入基于 segment size 的自适应传输策略，在 direct MTE 与 SDMA 之间按段大小选择传输 plane；对大段优先走 SDMA，小段保留 direct 路径，以降低小包 SDMA 开销并提升大包场景吞吐。 - 保持四个算子的外部接口、输出格式和校验方式一致，便于后续统一冒烟、泛化和性能对比。 ## 关联的Issue https://gitcode.com/cann/shmem/issues/249 ## 测试已按以下维度进行验证，具体参数和结果可按实际执行情况补充： - 功能正确性：使用各目录 `scripts/run.sh` 完成 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 基础冒烟，校验 `expand_x`、`assist_info_for_combine`、`ep_recv_count`、`expert_token_nums`、`x_out` 与 golden 一致。 - 二级冒烟：覆盖 `[2/4/8]` 卡、`[int32_t/float16_t]`、`[bs/topk/h/expertPerPe]` 典型组合，确认 classic 与 DoublePlane 路径均可稳定运行。 - 算子泛化：覆盖不同 `pes`、`bs`、`h`、`topk`、`expertPerPe` 组合，验证不同 shape 下输出语义与 classic 基线保持一致。 - 性能验证：通过 `--perf` 模式对 `dispatch` vs `dispatch_doubleplane`、`combine` vs `combine_doubleplane` 进行 `full_op` 和 `comm_only` 对比，重点观察多卡场景下的带宽和耗时表现。 - 协议一致性：验证 DoublePlane 在 SDMA/direct 混合路径下，payload、assist、ready、count、status 的写入/可见性时序正确，无功能回归。验证结果如下： ![91CDE5B2-5437-41ED-A859-9F0C5013E64C.png](https://raw.gitcode.com/user-images/assets/8546182/a1858f90-6331-47a6-9347-46d1131a7550/91CDE5B2-5437-41ED-A859-9F0C5013E64C.png '91CDE5B2-5437-41ED-A859-9F0C5013E64C.png') ![D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png](https://raw.gitcode.com/user-images/assets/8546182/683ebd6c-df45-4d48-9c08-1360b7bbb0cc/D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png 'D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png') ![E19255BB-42E0-4CDE-E609-7E35E761F343.png](https://raw.gitcode.com/user-images/assets/8546182/7557886f-947a-4b17-b62f-107f1e1823b3/E19255BB-42E0-4CDE-E609-7E35E761F343.png 'E19255BB-42E0-4CDE-E609-7E35E761F343.png') ## 文档更新 - 更新/补充 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 的 README，说明构建、功能运行和性能测试方法。 - 更新/补充 DoublePlane 设计说明文档，记录分段判定、传输策略和控制面协议。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：Add example/perftest See merge request: cann/shmem!350	4 天前
main.cpp	添加经典和双平面两种实现路径的Dispatch和Combine算子example Co-authored-by: xionglan2<xionglan2@huawei.com> # message auto-generated for no-merge-commit merge: !350 merge feature/dispatch-combine-examples into master 添加经典和双平面两种实现路径的Dispatch和Combine算子example Created-by: xionglan2 Commit-by: xionglan2 Merged-by: cann-robot Description: ## 描述本次 PR 将 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 四个 MoE 通信算子上库，并对 classic / DoublePlane 两类路径做统一整理和性能优化，目标是在保持现有输入输出语义不变的前提下，补齐基础能力、统一脚本与文档，并提升大包传输场景下的吞吐表现。具体改动包括： - 补齐 classic `dispatch` / `combine` 与 DoublePlane `dispatch_doubleplane` / `combine_doubleplane` 四个算子的构建、运行脚本和说明文档，统一功能验证和性能测试入口。 - `dispatch` 路径将输入token按路由规则分发到目标expert，生成后续combine所需的中间结果；`combine` 路径将个expert输出按源token和topK权重回传聚合，生成最终输出结果。 - `dispatch_doubleplane` / `combine_doubleplane` 引入基于 segment size 的自适应传输策略，在 direct MTE 与 SDMA 之间按段大小选择传输 plane；对大段优先走 SDMA，小段保留 direct 路径，以降低小包 SDMA 开销并提升大包场景吞吐。 - 保持四个算子的外部接口、输出格式和校验方式一致，便于后续统一冒烟、泛化和性能对比。 ## 关联的Issue https://gitcode.com/cann/shmem/issues/249 ## 测试已按以下维度进行验证，具体参数和结果可按实际执行情况补充： - 功能正确性：使用各目录 `scripts/run.sh` 完成 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 基础冒烟，校验 `expand_x`、`assist_info_for_combine`、`ep_recv_count`、`expert_token_nums`、`x_out` 与 golden 一致。 - 二级冒烟：覆盖 `[2/4/8]` 卡、`[int32_t/float16_t]`、`[bs/topk/h/expertPerPe]` 典型组合，确认 classic 与 DoublePlane 路径均可稳定运行。 - 算子泛化：覆盖不同 `pes`、`bs`、`h`、`topk`、`expertPerPe` 组合，验证不同 shape 下输出语义与 classic 基线保持一致。 - 性能验证：通过 `--perf` 模式对 `dispatch` vs `dispatch_doubleplane`、`combine` vs `combine_doubleplane` 进行 `full_op` 和 `comm_only` 对比，重点观察多卡场景下的带宽和耗时表现。 - 协议一致性：验证 DoublePlane 在 SDMA/direct 混合路径下，payload、assist、ready、count、status 的写入/可见性时序正确，无功能回归。验证结果如下： ![91CDE5B2-5437-41ED-A859-9F0C5013E64C.png](https://raw.gitcode.com/user-images/assets/8546182/a1858f90-6331-47a6-9347-46d1131a7550/91CDE5B2-5437-41ED-A859-9F0C5013E64C.png '91CDE5B2-5437-41ED-A859-9F0C5013E64C.png') ![D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png](https://raw.gitcode.com/user-images/assets/8546182/683ebd6c-df45-4d48-9c08-1360b7bbb0cc/D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png 'D9BE8A9E-29EE-49FB-8FCD-6C11504295D5.png') ![E19255BB-42E0-4CDE-E609-7E35E761F343.png](https://raw.gitcode.com/user-images/assets/8546182/7557886f-947a-4b17-b62f-107f1e1823b3/E19255BB-42E0-4CDE-E609-7E35E761F343.png 'E19255BB-42E0-4CDE-E609-7E35E761F343.png') ## 文档更新 - 更新/补充 `dispatch`、`combine`、`dispatch_doubleplane`、`combine_doubleplane` 的 README，说明构建、功能运行和性能测试方法。 - 更新/补充 DoublePlane 设计说明文档，记录分段判定、传输策略和控制面协议。 ## 类型标签 - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [x] 其他，请描述：Add example/perftest See merge request: cann/shmem!350	4 天前

经典 MoE Combine 示例

本示例实现非量化 MoE 的经典 combine 算子，对应设计文档 DOC/moe_dispatch_combine_non_quant_architecture.md。Combine 消费 dispatch 阶段生成的 expert 输出和辅助元数据，将每个 topK expert 的结果回传到原始 token 所在 PE，并完成加权归约。

功能说明

输入包括 dispatch 输出：

expand_x：dispatch 后按 expert 聚合的 token 数据。本示例的数据生成脚本使用 identity expert 计算，因此 expand_x 可直接视为 expert 输出。
assist_info_for_combine：每条 expert 输出对应的来源 [src_rank_id, src_token_id, src_topk_id]。
ep_recv_count：dispatch 阶段生成的 segment 累计接收计数。

还需要 combine 自身输入：

expert_ids：每个 token/topK 对应的 expert id。
expert_scales：每个 token/topK 的加权系数。

输出为：

x_out：每个 token 的最终 combine 结果。

校验公式：

x_out[token] = sum(topk_output[token, topk] * expert_scales[token, topk])

实现逻辑

Kernel 启动 pe_size 个 AIV core。发送回传阶段中，core_id == src_rank 的 core 负责将本 PE 上的 expert 输出写回对应来源 rank。

主要流程：

根据 ep_recv_count 遍历每个 (local_expert_id, src_rank_id) segment。
从 assist_info_for_combine 取出原始 token_id 和 topk_id。
使用 MTE put_nbi 将 expert 输出写回来源 PE 的 combine window。
写入 status ready flag，通知来源 PE 对应 topK 结果已可读。
本 PE 等待本地每个 token 的全部 topK status ready。
按 expert_scales 做加权求和，写出 x_out。
清理 status 并做 core 间同步。

当前经典版本数据面和控制面都使用 MTE。

默认 shape

H = 7168
TopK = 8
BS、PEs、expertPerPe 可配置

可以通过 -h、-topk 或性能 sweep 参数覆盖默认值。

构建

在仓库根目录执行：

bash scripts/build.sh -examples

运行

基础 2 卡测试：

cd examples/combine/combine_classic
bash scripts/run.sh -pes 2 -bs 8 -expertPerPe 2 -type int32_t

8 卡、64 expert 测试：

cd examples/combine/combine_classic
bash scripts/run.sh -pes 8 -bs 8 -expertPerPe 8 -type int32_t

float16 正确性测试：

bash scripts/run.sh -pes 2 -bs 16 -expertPerPe 2 -type float16_t

覆盖 TopK：

bash scripts/run.sh -pes 2 -bs 8 -topk 4 -expertPerPe 2 -type int32_t

脚本会自动生成 combine 输入和 golden 输出，启动每个 PE 对应的进程，输出写入 output/x_out_<rank>.bin，并执行父目录公共脚本 ../scripts/check_combine.py 校验结果。

常用参数

-pes <n>            PE/NPU 数量，单机示例要求 -gnpus 与 -pes 相同。
-bs <n>             每个 PE 的 token 数。
-h <n>              token hidden size，默认 7168。
-topk <n>           每个 token 路由的 expert 数，默认 8。
-expertPerPe <n>    每个 PE 上的 local expert 数。
-type <dtype>       数据类型，支持 int32_t、float16_t。
-fnpu <id>          起始 NPU id，默认 0。
-ipport <url>       SHMEM bootstrap 地址，默认 tcp://127.0.0.1:8767。

bfloat16_t 当前未在 combine 示例中实例化。原因是 CANN 9.0 beta 后端不支持 combine 累加路径需要的标量 bf16 cast，脚本会主动拒绝 -type bfloat16_t。

测试流程

一次功能测试会执行：

删除旧的 golden/ 和 output/。
使用父目录公共脚本 ../scripts/data_gen.py 生成输入、dispatch 元数据和 golden。
启动每个 PE 的 combine 进程。
写出 output/x_out_<rank>.bin。
使用父目录公共脚本 ../scripts/check_combine.py 比较所有 rank 输出。

性能测试

单 shape profiling：

cd examples/combine/combine_classic
bash scripts/run.sh --perf -pes 2 -bs 8 -expertPerPe 2 -type int32_t \
    --warmup 5 --loops 50

BS sweep：

bash scripts/run.sh --perf -pes 2 -type int32_t --bs-list 8,16,32 \
    --warmup 5 --loops 50

多卡、多 shape sweep：

bash scripts/run.sh --perf --pes-list 2,4,8 --bs-list 8,16 \
    --expert-per-pe-list 2,8 -type int32_t --prof-pe all

CSV 指标包括：

full_op：完整 combine，包括回传通信、status wait、加权归约、清理和同步。
comm_only：Stage 1 回传通信及必要的完成/status 协议。

CSV 文件写入 output/perf/：

combine_perf_rank0.csv
combine_perf_rank1.csv
combine_perf_summary.csv  # 使用 --prof-pe all 时生成

前六列兼容 examples/utils/perf_data_process.py：

DataSize/B,Npus,Blocks,UBsize/KB,Bandwidth/GB/s,CoreMaxTime/us

附加列包括 Metric、GlobalDataSize/B、PerPeBandwidth/GB/s、BS、H、TopK、ExpertPerPe、Dtype、Warmup、Loops、ProfPe 和 CaseId。

可使用 --analyse plot 或 --analyse md 在性能测试后调用统一性能报告脚本。