| [Feature]Added performance optimization for the Self-Forcing
Co-authored-by: qq_31502577<liuqiyuan6@huawei.com>
# message auto-generated for no-merge-commit merge:
!1832 merge master into master
[Feature]Added performance optimization for the Self-Forcing
Created-by: qq_31502577
Commit-by: qq_31502577
Merged-by: ascend-robot
Description: ## Motivation
新增self forcing模型的性能优化点
## Modification
1、使能rms_norm融合算子
2、将rope_param默认精度从float64修改为float32
## Self-test (Optional)
1、经验证,修改后性能提升25%
2、性能优化后,精度符合精度标准

## BC-breaking (Optional)
If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR.
## Checklist
**Before PR**:
- [ ] The new code needs to comply with the Clean Code specification.
- [ ] The PR content is self-checked, and the expression can be clear and the writing standardized
**After PR**:
- [ ] CLA has been signed and all committers have signed the CLA in this PR.
- [ ] The ci-pipeline is passed, Code Check is passed.
See merge request: Ascend/MindSpeed-MM!1832 | 5 个月前 |