| [Feature] model fine-tuning supports time division multiplexing of DP on PP0
Co-authored-by: f00620112<fangminghao@huawei.com>
# message auto-generated for no-merge-commit merge:
!2527 merge 20260430_iter6 into master
[Feature] model fine-tuning supports time division multiplexing of DP on PP0
Created-by: fangminghao
Commit-by: f00620112
Merged-by: ascend-robot
Description: https://gitcode.com/Ascend/MindSpeed-MM/issues/176
## What this PR does / why we need it?
[Feature] model fine-tuning supports time division multiplexing of DP on PP0
- 功能说明:针对边侧节点数量不足的情况,支持边侧DP小于云侧DP。
- 非对称DP实现逻辑:在对称DP场景下,通过卡分复用的方式,不同DP域的节点或卡处理各自DP域的数据。不同于对称DP,非对称DP场景下,边侧通过时分复用的方式处理多个DP域的数据,并分别与云侧进行通信。
- 修改概览:
| | 修改点 | 修改路径 | 原路径 |
| ----------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ---------------------------------------------------- |
| vit模型 | 侵入式修改,vit模型层求和操作,适配vpp | mindspeed_mm/models/vision/vision_encoders/vision_transformer_block.py | / |
| patch注册 | / | mindspeed_mm/patchs/patch_manager.py | / |
| 训练流水 | 修改流水编排,适配多DP分时复用 | mindspeed_mm/patchs/layerwise_disaggregated_training/schedules_patch.py | megatron/core/pipeline_parallel/schedules.py |
| 通信算子 | 修改通信算子(recv_forward、recv_backward等),适配多DP分时复用 | mindspeed_mm/patchs/layerwise_disaggregated_training/p2p_communication_patch.py | megatron/core/pipeline_parallel/p2p_communication.py |
| 训练初始化 | 修改通信组初始化、并行初始化,适配多DP分时复用 | mindspeed_mm/patchs/layerwise_disaggregated_training/parallel_state_patch.py | megatron/core/parallel_state.py |
| 模型初始化 | 修改模型初始化和ckpt加载逻辑,适配首尾共部署 | mindspeed_mm/patchs/layerwise_disaggregated_training/vlm_model_patch.py | mindspeed_mm/models/vlm_model.py |
| 训练后处理 | 后处理通信优化 | mindspeed_mm/patchs/layerwise_disaggregated_training/utils_patch.py mindspeed_mm/patchs/layerwise_disaggregated_training/distributed_data_parallel_patch.py | megatron/core/utils.py |
| 模型切分 | ckpt切分方法hf_to_mm_ldt,适配首尾共部署 | checkpoint/vlm_model/hf_to_mm_ldt.py | / |
| 校验前处理/后处理 | 为了通过参数校验,对args进行前处理和后处理 | mindspeed_mm/patchs/validate_args_patch.py | / |
| 删除文件 | / | mindspeed_mm/patchs/layerwise_disaggregated_training/training_patch.py mindspeed_mm/patchs/layerwise_disaggregated_training/utils.py | / |
- 关于侵入式修改的说明:
侵入式修改路径:mindspeed_mm/models/vision/vision_encoders/vision_transformer_block.py:298-301
原代码:对当前PP之前所有pp_rank上的VIT模型层数求和。
```python
previous_layer = sum(self.config.pipeline_num_layers[:pp_rank])
```
新代码:由于边云特性会开启VPP功能,此时self.config.pipeline_num_layers是一个二维数组,不能通过sum进行求和。
修改点:补充了对self.config.pipeline_num_layers是不是二维数组的判断。在开启边云特性的情况下,self.config.pipeline_num_layers是二维数组,此处改为对self.config.pipeline_num_layers[0]进行求和。在不开启边云特性的情况下,self.config.pipeline_num_layers是一维数组,代码进入else分支走原生逻辑。因此此处修改不会影响原有代码逻辑。
```python
if isinstance(self.config.pipeline_num_layers[0], list):
previous_layer = sum(self.config.pipeline_num_layers[0][:pp_rank])
else:
previous_layer = sum(self.config.pipeline_num_layers[:pp_rank])
```
## Does this PR introduce any user-facing change?
Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path.
## How was this patch tested?
Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations.
See merge request: Ascend/MindSpeed-MM!2527 | 1 天前 |