problem: "Packet Analysis"
description: "Excessive small communication packets may cause host delivery bottlenecks.\n"
sdma_problem: "In the SDMA communication, {abnormal_ratio} of the communication data volume is less than {min_size} MB, and the total time is {abnormal_time} ms.\n"
rdma_problem: "In the RDMA communication, {abnormal_ratio} of the communication data volume is less than {min_size} MB, and the total time is {abnormal_time} ms."
min_sdma_size: 16
min_rdma_size: 1
min_sdma_ratio: 0.2
min_rdma_ratio: 0.2
solutions:
- data parallelism suggestion:
desc: "If abnormal communication is centralized in data parallelism domain, please 1.increase batch size; 2.increase gradient accumulation"
- check the memory optimization policy:
desc: "If the memory optimization policy is Zero3, it is recommended to set it to Zero2/Zero1 if memory conditions allow."
- adopt fusion operators of affinity optimizers:
desc: "using the affinity optimizers or fusion operators may reduce the number of communication operators."