Dynamic EPLB Acceleration

DyEPLB

  • Background

    As visual generation models evolve toward DiT architectures, introducing MoE mechanisms to extend the scaling law has become a common industry direction. However, the large parameter scale of DiT-MoE models forces the use of expert parallelism (EP). Unlike LLM workloads, visual data has strong spatial locality, which can easily overload specific experts and cause severe load imbalance. In addition, expert activation distributions vary dynamically across denoising timesteps, which means traditional static load-balancing strategies fail under combined spatial and temporal heterogeneity. DyEPLB addresses this DiT-MoE scenario with dynamic expert load balancing to improve cluster utilization and inference performance.

  • Principle

    Expert weights are adjusted dynamically across ranks according to load information so that expert load is balanced and model inference is accelerated.

  • Notes

    • DyEPLB is designed to be minimally intrusive, so global synchronization checks and weight-update locations can be selected according to the model implementation and the FPA algorithm scenario.
    • In all-gather full-EP mode, the global synchronization check can be performed earlier to adapt custom operators such as torch_npu.npu_moe_init_routing_v2, which helps preserve token continuity when weights are replaced.
    • It is recommended to place the weight-update module between two Matmul operations to maximize the gain from the FPA algorithm. Because the workflow includes host-to-device data transfer, there may be bandwidth contention if DyEPLB is used together with offload. Adjust the scheduling of the two mechanisms to avoid blocking each other. Weight updates also introduce extra expert-weight memory consumption during the update process, which can raise peak memory usage. The FPA algorithm provides an EX mode to reduce expert-layout changes and mitigate this issue.
  • Integration flow

    Note

    To minimize the impact on the main inference path, the algorithm logic and expert-weight stitching are handled in additional threads and processes.

    For the input and output details of the involved interfaces, see Class initialization and interface reference.

    1. Start the EPLB scheduler process:

      root@node134:/home# python -m mindiesd.eplb.eplb_scheduler --world_size 2 --host localhost -- port 50001 --mode A2A
      

      Common launch parameters include:

      • world_size: number of EP ranks
      • expert_num: total number of global experts
      • block_num: number of MoE layers
      • max_move: maximum number of moved experts in EX mode
      • redundant: number of redundant experts
      • mode: A2A for all-to-all EP, AG for all-gather EP, EX for controlled mode
      • auth_key: reads the EPLB_AUTH_KEY environment variable by default; falls back to secret_key
    2. Import the load collector and dispatcher:

      from mindiesd.eplb.dispatcher import DynamicDispatcher
      from mindiesd.eplb.collector import ExpertLoadCollector
      
    3. Before inference, start the worker thread that handles data through a task queue. After model initialization, initialize the DyEPLB load collector and dispatcher at the MoE-layer granularity:

      # model initialization
      model.init()
      
      # load collector
      model.moe_module.block.expert_load_collector = ExpertLoadCollector(expert_num, lb_interval)
      # dispatcher, holding the complete expert weights on the host side
      model.moe_module.block.dispatcher = DynamicDispatcher(expert_num, weight1, weight2, rank_in_group, ep_size)
      # start worker thread
      if eplb_enabled:
         from mindiesd.eplb.task_manager import construct_expert_info_transfer_pool
         # multiprocessing communication, auth_key must match the EPLB scheduler process
         construct_expert_info_transfer_pool(module=model, rank_in_group=rank_in_group, device=device, ip=host, port=port, auth_key=auth_key)
      
      # inference flow
      model.forward()
      
    4. In all-gather full-EP mode, add an extra matmul between the transform matrix and the expert scores to avoid manually reordering tokens, indices, and related variables later:

      if EP_AG and self.dispatcher.update_flag:
          # transformation matrix generated from expert ordering, shape(global_expert_num * global_expert_num)
          expert_trans_tensor = self.dispatcher.get_expert_trans_tensor()
          trans_scores = torch.matmul(scores, expert_trans_tensor)
      
    5. Recommended enablement order inside MoE: init_routing > collect_load > global_sync_check > weight_replace > GMM.

      expanded_tokens, expanded_row_idx, expanded_indices = torch_npu.npu_moe_init_routing(tokens, row_idx, indices, tokens.shape[0])
      
      # collect expert load
      self.expert_load_collector.collect_expert_load(expanded_indices)
      # global synchronization check
      self.dispatcher.check_consistency()
      # validate synchronization status
      if self.dispatcher.update_flag:
         weight1, weight2, local_expert_num, device_indices_map, local_expert_indices_map, local_expert_list = self.dispatcher.update_module_weight_and_map()
         self.weight1 = weight1
         self.weight2 = weight2
         self.local_expert_num = local_expert_num
      
      tokens = torch_npu.npu_grouped_matmul_finalize_routing()
      

Class initialization and interface reference

  • ExpertLoadCollector Parameters:

    • expert_num: total number of global experts
    • lb_interval: EPLB interval in steps; the default value 1 means every step participates in EPLB

    Return value: none

  • DynamicDispatcher Parameters:

    • expert_num: total number of global experts
    • weight1: UP weights
    • weight2: DOWN weights
    • rank_in_group: rank index within the EP communication group
    • ep_size: EP size

    Return value: none

  • construct_expert_info_transfer_pool Parameters:

    • module: initialized model
    • rank_in_group: rank index within the EP communication group
    • device: device index bound to the rank
    • ip: must match the configured server IP
    • port: must match the configured server port
    • auth_key: multiprocessing secret; reads EPLB_AUTH_KEY by default and falls back to secret_key

    Return value: none

  • get_expert_trans_tensor Used in all-gather EP scenarios to obtain the transform matrix.

  • collect_expert_load Parameters:

    • expanded_indices: token-cumsum values for each expert; the output of npu_moe_init_routing can be passed directly

    Return value: none

  • check_consistency Performs an extra all-gather communication internally to verify synchronization status across ranks.

  • update_module_weight_and_map Parameters: none

    Return values:

    • weight1: UP weights
    • weight2: DOWN weights
    • local_expert_num: number of local experts, including redundant experts
    • device_indices_map: for example [0, 1, 1, 0], meaning which rank each expert index belongs to
    • local_expert_indices_map: for example on rank 0 [0, -1, -1, 1] and on rank 1 [-1, 0, 1, -1], meaning the local position of each expert index in the local expert-weight tensor
    • local_expert_list: for example rank 0 [0, 3] and rank 1 [1, 2], meaning the local expert layout