文件最后提交记录最后更新时间
revert "transformer from 4 upgrade to 5" Co-authored-by: wanggangguo<wanggangguo@huawei.com> # message auto-generated for no-merge-commit merge: !4518 merge upgrade into master revert "transformer from 4 upgrade to 5" Created-by: isfrapples Commit-by: wanggangguo Merged-by: ascend-robot Description: ## What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-LLM!45185 天前
!2569 Refact: embedding and rotary embedding Merge pull request !2569 from shenjiarun/master 1 年前
!2710 Compatibility fixes Merge pull request !2710 from shengjy/bf523 11 个月前
fix: 修复多个潜在bug以提高代码健壮性 Co-authored-by: 王姜奔<wangjiangben@huawei.com> # message auto-generated for no-merge-commit merge: !4362 merge master into master fix: 修复多个潜在bug以提高代码健壮性 Created-by: wangjiangben Commit-by: 王姜奔 Merged-by: ascend-robot Description: ## 修复内容 本PR修复了代码仓库中发现的多个潜在bug,以提高代码的健壮性和稳定性。 ### 修复详情 #### 1. 修复裸except语句 **问题**: 使用裸的except:会捕获所有异常包括系统异常,可能导致难以调试的问题。 **修复**: 改为except Exception:,只捕获标准异常。 **影响文件**: - mindspeed_llm/tasks/checkpoint/loader_hf.py - mindspeed_llm/tasks/checkpoint/loader_mg.py #### 2. 修复除零检查逻辑错误 **问题**: check_divisible_by_zero函数逻辑错误,原条件会导致非整数除数直接执行除法。 **修复**: 简化为if divisor != 0:,正确处理所有数值类型。 **影响文件**: - mindspeed_llm/tasks/utils/error_utils.py #### 3. 修复DPO训练器除零风险 **问题**: chosen_log_probs / chosen_lengthchosen_length为0时会引发除零异常。 **修复**: 使用torch.clamp(chosen_length, min=1)确保安全除法。 **影响文件**: - mindspeed_llm/tasks/posttrain/dpo/dpo_trainer.py #### 4. 修复BBH评估除零风险 **问题**: loss_values.sum(-1).cpu().numpy() / token_ids.size(1)在token序列为空时会除零。 **修复**: 使用max(token_ids.size(1), 1)防止除零。 **影响文件**: - mindspeed_llm/tasks/evaluation/eval_impl/bbh_eval.py ## 测试计划 - [x] 代码修改已完成 - [x] 修改已提交到本地仓库 - [x] 修改已推送到远程仓库 - [x] 等待CI测试通过 - [ ] 等待代码审查 ## 影响范围 这些修复主要影响: - 异常处理机制 - 数值计算安全性 - 边缘情况处理 所有修改都是防御性编程,不会改变正常情况下的行为逻辑。 See merge request: Ascend/MindSpeed-LLM!43621 个月前
[pytorch][mindio][feature]Ensure that the ACP Level 1 asynchronous save feature is compatible with TFT online recovery. Co-authored-by: z30027952<zengyihang2@h-partners.com> # message auto-generated for no-merge-commit merge: !4103 merge acp_tft_compatibility into master [pytorch][mindio][feature]Ensure that the ACP Level 1 asynchronous save feature is compatible with TFT online recovery. Created-by: zengyihang Commit-by: z30027952 Merged-by: ascend-robot Description: [pytorch][mindio][feature]高可用支持ACP&TFT能力兼容,使训练过程中ACP一级异步保存能力和TFT在线恢复能力同时生效 See merge request: Ascend/MindSpeed-LLM!41033 个月前
revert "transformer from 4 upgrade to 5" Co-authored-by: wanggangguo<wanggangguo@huawei.com> # message auto-generated for no-merge-commit merge: !4518 merge upgrade into master revert "transformer from 4 upgrade to 5" Created-by: isfrapples Commit-by: wanggangguo Merged-by: ascend-robot Description: ## What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-LLM!45185 天前
[pytorch][bugfix]fix lora_target_modules in ckpt save_lora_to_hf Co-authored-by: qyzqyz<quyueze@h-partners.com> # message auto-generated for no-merge-commit merge: !4010 merge master into master [pytorch][bugfix]fix lora_target_modules in ckpt save_lora_to_hf Created-by: qyzqyz Commit-by: qyzqyz Merged-by: ascend-robot Description: fix lora_target_modules in ckpt save_lora_to_hf See merge request: Ascend/MindSpeed-LLM!40104 个月前
feat: Optimize deepseekV4's rmsnorm operator performance Co-authored-by: LinShua<707894133@qq.com> # message auto-generated for no-merge-commit merge: !4553 merge master_rmsnorm_ascendC into master feat: Optimize deepseekV4's rmsnorm operator performance Created-by: LinShua Commit-by: LinShua Merged-by: ascend-robot Description: ## What this PR does / why we need it? 优化deepseekV4's rmsnorm性能,调用融合算子 ## Does this PR introduce any user-facing change? NA ## How was this patch tested? NA See merge request: Ascend/MindSpeed-LLM!45531 天前
feat(pytorch): support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training Co-authored-by: yanzhenghao<yanzhenghao2@huawei.com> Co-authored-by: xuguoliang3<xuguoliang3@huawei.com> Co-authored-by: fangminghao<fangminghao@huawei.com> # message auto-generated for no-merge-commit merge: !4437 merge 20260409_vdp into master feat(pytorch): support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training Created-by: xuguoliang3 Commit-by: xuguoliang3;yanzhenghao;fangminghao Merged-by: ascend-robot Description: ## What this PR does / why we need it? support different DP or DP&TP configuration on edge and cloud for layerwise_disaggregated_training ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-LLM!443713 天前
fix(pytorch): fix get_dataset_list bug Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4498 merge bugfix into master fix(pytorch): fix get_dataset_list bug Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: ## What this PR does / why we need it? Fix get_dataset_list bug. ## Does this PR introduce any user-facing change? No. ## How was this patch tested? Use LlamaFactoryInstructionHandler to test. See merge request: Ascend/MindSpeed-LLM!44988 天前
revert "transformer from 4 upgrade to 5" Co-authored-by: wanggangguo<wanggangguo@huawei.com> # message auto-generated for no-merge-commit merge: !4518 merge upgrade into master revert "transformer from 4 upgrade to 5" Created-by: isfrapples Commit-by: wanggangguo Merged-by: ascend-robot Description: ## What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-LLM!45185 天前
!1998 rename: repo package name from modellink to mindspeed_llm Merge pull request !1998 from MeiFei/master-package-rename 1 年前
fix(python): fix protobuf resource conflict bug Co-authored-by: yanzhixiao<yanzhixiao@h-partners.com> # message auto-generated for no-merge-commit merge: !4359 merge bugfix-0330 into master fix(python): fix protobuf resource conflict bug Created-by: yanzhixiao23 Commit-by: yanzhixiao Merged-by: ascend-robot Description: ## What this PR does / why we need it? Fix core dump caused by protobuf loading conflict between PyTorch and TensorFlow ## Does this PR introduce any user-facing change? NA,Only bugfix. ## How was this patch tested? The bug fixed. See merge request: Ascend/MindSpeed-LLM!43592 个月前