msmodeling/docs/perf_database/examples/comm_config_example.yaml-代码预览-MindStudio-Modeling:基于 PyTorch 的神经网络推理性能模拟与分析框架项目 - AtomGit

ascend-robot【同步】【非开发代码】代码从 develop 同步到 master
# comm_config.yaml — 通信配置表
# 路径: tensor_cast/performance_model/perf_database/data/{device}/hccl/{cann_version}/comm_config.yaml
#
# 映射验证来源:
#   - DSV3 Decode Profiling (32 卡): hcom_allReduce_(82x), HcomAllGather(164x),
#     HcomReduceScatter(82x), hcom_alltoallv_(82x), hcom_allGather_(41x), allgatherAicpuKernel(41x)
#   - Qwen3-30B Prefill Profiling (16 卡): hcom_allReduce_(276x), hcom_allGather_(8x)
#   - op-plugin: 仅有 fused MC2 变体 (npu_mm_all_reduce_base, npu_all_gather_base_mm 等)
#     standalone collectives 直接走 HCCL, 不经过 op-plugin

device: ATLAS_800_A3_752T_128G_DIE
cann_version: "8.1.RC1"
collection_date: "2026-03-04"

# ============================================================================
#                             拓扑 (Topology)
# ============================================================================
# ATLAS 800 A3: grid_shape 维度从外到内 [pod 数, node/pod, die/node]

topology:
  grid_shape: [48, 8, 2]
  tiers:
    0: {name: "inter_pod", bandwidth_gbps: 196, latency_us: 5.5, type: "CLOS"}
    1: {name: "intra_pod", bandwidth_gbps: 196, latency_us: 0.5, type: "CLOS"}
    2: {name: "die_level", bandwidth_gbps: 224, latency_us: 0.2, type: "SIO"}


# ============================================================================
#                  通信算子映射 (comm_operator_mappings)
# ============================================================================
# 基于 Profiling 实际观察到的 HCCL kernel type
# kernel_type_variants 列出同一逻辑操作的不同 NPU kernel 名称变体

comm_operator_mappings:

  "tensor_cast.all_reduce.default":
    kernel_type: hcom_allReduce_
    notes: >
      [HIGH] HCCL direct; standalone all_reduce.
      MC2 fused npu_mm_all_reduce_base 不产生独立 hcom_allReduce_.
      Profiling: DSV3(82x), Qwen3(276x) — 唯一观察到的变体.
    hccl_test_op: allreduce

  "tensor_cast.all_gather.default":
    kernel_type: hcom_allGather_
    notes: >
      [MEDIUM] HCCL direct; 三种 profiling 变体:
      hcom_allGather_(41x DSV3, 8x Qwen3, standalone PyTorch dispatch),
      HcomAllGather(164x DSV3, graph-compiled torchair, 可能来自 MC2 pipeline 分片),
      allgatherAicpuKernel(41x DSV3, AICPU 协调路径, MC2 ai_cpu comm_mode).
      TC standalone all_gather → hcom_allGather_; 聚合查询时应包含所有变体.
    hccl_test_op: allgather

  "tensor_cast.reduce_scatter.default":
    kernel_type: HcomReduceScatter
    notes: >
      [MEDIUM] HCCL direct; DSV3 仅见 HcomReduceScatter(82x, CamelCase = graph-compiled torchair);
      hcom_reduceScatter_(snake_case) 可能在非 torchair 场景出现.
      Qwen3 Prefill 无 reduce_scatter (使用 all_reduce for TP).
    hccl_test_op: reducescatter

  "tensor_cast.all_to_all.default":
    kernel_type: hcom_alltoallv_
    notes: >
      [HIGH] HCCL direct; TC 使用 variable split sizes → alltoallv (非 fixed alltoall).
      op-plugin 有融合 npu_gmm_alltoallv (aclnnGroupedMatMulAlltoAllv), TC 独立建模.
      Profiling: DSV3 Decode hcom_alltoallv_(82x = 2x41 MoE layers, dispatch+combine 各一次).
    hccl_test_op: alltoall


# ============================================================================
#             HCCL 测试工具说明 (microbenchmark 数据采集)
# ============================================================================
# 文档: https://www.hiascend.com/document/detail/zh/mindstudio/70RC1/mscommandtoolug/mscommandug/auxiliarydevtool_0017.html
#
# 基本用法:
#   hccl_test -b <min_bytes> -e <max_bytes> -f <factor> -p <operation>
#
# 推荐采集矩阵:
#   - 设备数: [2, 4, 8, 16, 32, 64]
#   - Message size: 1KB ~ 1GB (2x 递增)
#   - 数据类型: [BF16, FP16, INT8, FP8]
#   - 拓扑层级: die_level (2 dies), intra_pod (8 dies), inter_pod (跨 pod)