| [pytorch][mindio][feature]Online recovery after precision error based on the specified number of checkpoint steps is supported. need install latest mindio_ttp.
Co-authored-by: wangguoyan<wangguoyan6@h-partners.com>
# message auto-generated for no-merge-commit merge:
!4071 merge master into master
[pytorch][mindio][feature]Online recovery after precision error based on the specified number of checkpoint steps is supported. need install latest mindio_ttp.
Created-by: guoywang
Commit-by: wangguoyan
Merged-by: ascend-robot
Description: [pytorch][mindio][feature]高可用支持精度异常后按照指定checkpoint步数在线恢复
See merge request: Ascend/MindSpeed-LLM!4071 | 4 个月前 |