| [Feature]opensoraplan1.3新增动态DPCP切换功能
Co-authored-by: qusongyun1<qusongyun1@noreply.gitcode.com>
# message auto-generated for no-merge-commit merge:
!1677 merge dynamicDPCP into master
[Feature]opensoraplan1.3新增动态DPCP切换功能
Created-by: qusongyun1
Commit-by: qusongyun1
Merged-by: ascend-robot
Description: ## Motivation
当前静态DPCP方案在动态负载下无法充分利用算力,例如在大量短序列和少量长序列的情况下,为了保证不OOM,需要设置较大的CP,然而短序列进行大CP并行会导致性能的下降。本特性新增动态DPCP功能,支持在每轮训练迭代中根据数据特征动态切换DP/CP并行策略。
## Modification
pretrain_sora.py:如果开启了动态DPCP,则优先获取缓存数据
training.py: 在初始化时,新增DPCP并行组的初始化,切换后,将数据在cp组内广播并放入缓存
MindSpeed-MM/mindspeed_mm/utils 中新增dpcp_utils.py文件,所有本特性相关的函数实现均在该文件中
## Self-test (Optional)
If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached.
## BC-breaking (Optional)
If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR.
## Checklist
**Before PR**:
- [x] The new code needs to comply with the Clean Code specification.
- [x] The PR content is self-checked, and the expression can be clear and the writing standardized
**After PR**:
- [x] CLA has been signed and all committers have signed the CLA in this PR.
- [x] The ci-pipeline is passed, Code Check is passed.
See merge request: Ascend/MindSpeed-MM!1677 | 6 个月前 |