文件最后提交记录最后更新时间
[pytorch][feature]add checkpoint manager for fsdp2 Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4045 merge lite into master [pytorch][feature]add checkpoint manager for fsdp2 Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: - Distributed checkpoint saving/loading: Built on PyTorch DCP, supporting full save and restore of the model, optimizer, and training state - HuggingFace format export: Converts DCP checkpoints into standard HuggingFace weight format (safetensors) - Resume from checkpoint: Fully restores global_step, lr_scheduler, random number states, and more See merge request: Ascend/MindSpeed-LLM!40454 个月前
[pytorch][feature]add ckpt manager for ep Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4201 merge ckpt into master [pytorch][feature]add ckpt manager for ep Created-by: HANHU1CHEN Commit-by: HanhuiChen Merged-by: ascend-robot Description: add ckpt manager for ep See merge request: Ascend/MindSpeed-LLM!42013 个月前
[pytorch][feature]add checkpoint save/load in training loop Co-authored-by: HanhuiChen<chenhanhui1@h-partners.com> # message auto-generated for no-merge-commit merge: !4111 merge ckpt into master [pytorch][feature]add checkpoint save/load in training loop Created-by: HANHU1CHEN Commit-by: HANHU1CHEN;HanhuiChen Merged-by: ascend-robot Description: Add checkpoint cleanup after saving to limit disk usage, and support resuming training from checkpoint by skipping already trained batches. See merge request: Ascend/MindSpeed-LLM!41113 个月前
feature(pytorch): FSDP2 support hardware-adaptive execution Co-authored-by: zhyebin01<zhangyebin@h-partners.com> # message auto-generated for no-merge-commit merge: !4343 merge fsdp2_gpu into master feature(pytorch): FSDP2 support hardware-adaptive execution Created-by: zhyebin01 Commit-by: zhyebin01 Merged-by: ascend-robot Description: ## What this PR does / why we need it? FSDP2 support hardware-adaptive execution ## Does this PR introduce any user-facing change? No ## How was this patch tested? pipeline test passed See merge request: Ascend/MindSpeed-LLM!43432 个月前