MindSpeed-LLM/mindspeed_llm/tasks/posttrain · Ascend/MindSpeed-LLM - AtomGit

ascend-robotfeat(pytorch): support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training

78212c65创建于 13 天前历史提交

文件	最后提交记录	最后更新时间
base	feat(torch): add GLM-4.5 scripts Co-authored-by: cjy840282<chenjingyi9@huawei.com> # message auto-generated for no-merge-commit merge: !4369 merge GLM-4.5-new into master feat(torch): add GLM-4.5 scripts Created-by: cjy840282 Commit-by: cjy840282 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add GLM-4.5 scripts. ## Does this PR introduce any user-facing change? support GLM-4.5 lora finetune. ## How was this patch tested? vllm inference is normal. See merge request: Ascend/MindSpeed-LLM!4369	1 个月前
dpo	fix: 修复多个潜在bug以提高代码健壮性 Co-authored-by: 王姜奔<wangjiangben@huawei.com> # message auto-generated for no-merge-commit merge: !4362 merge master into master fix: 修复多个潜在bug以提高代码健壮性 Created-by: wangjiangben Commit-by: 王姜奔 Merged-by: ascend-robot Description: ## 修复内容本PR修复了代码仓库中发现的多个潜在bug，以提高代码的健壮性和稳定性。 ### 修复详情 #### 1. 修复裸except语句问题: 使用裸的`except:`会捕获所有异常包括系统异常，可能导致难以调试的问题。修复: 改为`except Exception:`，只捕获标准异常。影响文件: - mindspeed_llm/tasks/checkpoint/loader_hf.py - mindspeed_llm/tasks/checkpoint/loader_mg.py #### 2. 修复除零检查逻辑错误问题: `check_divisible_by_zero`函数逻辑错误，原条件会导致非整数除数直接执行除法。修复: 简化为`if divisor != 0:`，正确处理所有数值类型。影响文件: - mindspeed_llm/tasks/utils/error_utils.py #### 3. 修复DPO训练器除零风险问题: `chosen_log_probs / chosen_length`在`chosen_length`为0时会引发除零异常。修复: 使用`torch.clamp(chosen_length, min=1)`确保安全除法。影响文件: - mindspeed_llm/tasks/posttrain/dpo/dpo_trainer.py #### 4. 修复BBH评估除零风险问题: `loss_values.sum(-1).cpu().numpy() / token_ids.size(1)`在token序列为空时会除零。修复: 使用`max(token_ids.size(1), 1)`防止除零。影响文件: - mindspeed_llm/tasks/evaluation/eval_impl/bbh_eval.py ## 测试计划 - [x] 代码修改已完成 - [x] 修改已提交到本地仓库 - [x] 修改已推送到远程仓库 - [x] 等待CI测试通过 - [ ] 等待代码审查 ## 影响范围这些修复主要影响： - 异常处理机制 - 数值计算安全性 - 边缘情况处理所有修改都是防御性编程，不会改变正常情况下的行为逻辑。 See merge request: Ascend/MindSpeed-LLM!4362	1 个月前
ldt_sft	feat(pytorch): support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training Co-authored-by: yanzhenghao<yanzhenghao2@huawei.com> Co-authored-by: xuguoliang3<xuguoliang3@huawei.com> Co-authored-by: fangminghao<fangminghao@huawei.com> # message auto-generated for no-merge-commit merge: !4437 merge 20260409_vdp into master feat(pytorch): support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training Created-by: xuguoliang3 Commit-by: xuguoliang3;yanzhenghao;fangminghao Merged-by: ascend-robot Description: ## What this PR does / why we need it? support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-LLM!4437	13 天前
lora	fix: fix the bug when qlora does not enable the overlap feature Co-authored-by: yanzhixiao<yanzhixiao@h-partners.com> # message auto-generated for no-merge-commit merge: !4315 merge fix-qlora into master fix: fix the bug when qlora does not enable the overlap feature Created-by: yanzhixiao23 Commit-by: yanzhixiao Merged-by: ascend-robot Description: ## What this PR does / why we need it? Fix the bug where QLoRA weights cannot be dequantized when overlap is disabled. ## Does this PR introduce any user-facing change? NA, Only bugfix ## How was this patch tested? Bug fixed. See merge request: Ascend/MindSpeed-LLM!4315	2 个月前
lu_lora	!3043 [pytorch][feature] lulora: localy updated localized learning Merge pull request !3043 from artyomtugaryov/master	9 个月前
sft	feat(pytorch): add DeepSeek V4 fine-tuning trainer Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4452 merge dsv4 into master feat(pytorch): add DeepSeek V4 fine-tuning trainer Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Due to the unique structure of the dsv4 model, we need a new trainer to load the model. ## Does this PR introduce any user-facing change? No, user can also run with posttrain_gpt.py to start training. ## How was this patch tested? We have already been running long-term training on the dataset, and the training loss is converging normally. See merge request: Ascend/MindSpeed-LLM!4452	22 天前
__init__.py	!1958 整改仓库文件结构 Merge pull request !1958 from DONGHAORAN/master	1 年前
launcher.py	feat(pytorch): add DeepSeek V4 fine-tuning trainer Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4452 merge dsv4 into master feat(pytorch): add DeepSeek V4 fine-tuning trainer Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Due to the unique structure of the dsv4 model, we need a new trainer to load the model. ## Does this PR introduce any user-facing change? No, user can also run with posttrain_gpt.py to start training. ## How was this patch tested? We have already been running long-term training on the dataset, and the training loss is converging normally. See merge request: Ascend/MindSpeed-LLM!4452	22 天前
utils.py	[pytorch][bugfix] fix bug in TND fine-tuning to enable mbs Co-authored-by: yanzhixiao<yanzhixiao@h-partners.com> # message auto-generated for no-merge-commit merge: !4226 merge bugfix-tune-tnd-mbs into master [pytorch][bugfix] fix bug in TND fine-tuning to enable mbs Created-by: yanzhixiao23 Commit-by: yanzhixiao Merged-by: ascend-robot Description: [pytorch][bugfix] fix the bug of fine-tuning the TND to enable mbs See merge request: Ascend/MindSpeed-LLM!4226	3 个月前