| 【修改说明】【ClusterD】Ascend950进程级重调度重构,通过判断被调度的pod向TaskD报告故障进程
Co-authored-by: east-yuan<yuanzhendong2@h-partners.com>
| 3 个月前 |
| 【clusterd】人工隔离芯片准确性增强-已隔离故障的应用层部分
Co-authored-by: whr_666<772343610@qq.com>
| 3 个月前 |
| 【clusterd】对于mindie任务,不配置默认过滤故障码
Co-authored-by: zhoupan39<zhoupan39@huawei.com>
| 1 个月前 |
| [DP & ClusterD]DPU cm refactor
Co-authored-by: x00953810<xujingru5@huawei.com>
| 4 个月前 |
| [DP & ClusterD]DPU cm refactor
Co-authored-by: x00953810<xujingru5@huawei.com>
| 4 个月前 |
| 支持通过schedule-policy配置和服务器解耦的调度策略
Co-authored-by: lirui2381<2396601465@qq.com>
| 5 个月前 |
| 【MindCluster】 Atlas 350 标卡适配产品形态和芯片改名
Co-authored-by: q00951730<quyitong@huawei.com>
Co-authored-by: cqchou<thekonka@proton.me>
Co-authored-by: leon_xun<xunzeliang@h-partners.com>
| 3 个月前 |
| 【clusterd】对于mindie任务,不配置默认过滤故障码
Co-authored-by: zhoupan39<zhoupan39@huawei.com>
| 1 个月前 |
| <fix>[clusterD]任务失败pg被删除时正确获取异常信息
Co-authored-by: shepherd-cheung<1220798123@qq.com>
| 2 个月前 |
| 【MindCluster】文件名批量修改
Co-authored-by: q00951730<quyitong@huawei.com>
| 2 个月前 |
| !922 【修改说明】【ClusterD】慢网络clusterd pr part1
Merge pull request !922 from tiankaijin/slowpr-clusterpr1
| 1 年前 |
| 【clusterD】fault job info cm资源更新优化
Co-authored-by: zhoupan39<zhoupan39@huawei.com>
| 2 个月前 |
| [clusterd]初始化job statistics时过滤无效任务
Co-authored-by: lijinghan<lijinghan1@huawei.com>
| 2 个月前 |
| 【修改说明】【clusterd】【taskd】ModifyTrainingDataTraceSwitch增加cm文件挂载判断,taskd补充对worker数量的检查触发pullMsg周期执行
Co-authored-by: higher_speeder<wangjun940510@qq.com>
| 5 个月前 |
| 【修改说明 Modification】clusterd统一预隔离故障处理,灵衢亚健康故障不按预隔离故障处理
Co-authored-by: wangjun<wangjun940510@qq.com>
Co-authored-by: wangjun<374719709@qq.com>
| 6 个月前 |
| [clusterd]初始化job statistics时过滤无效任务
Co-authored-by: lijinghan<lijinghan1@huawei.com>
| 2 个月前 |
| 【MindCluster】文件名批量修改
Co-authored-by: q00951730<quyitong@huawei.com>
| 2 个月前 |
| [DP & ClusterD]DPU cm refactor
Co-authored-by: x00953810<xujingru5@huawei.com>
| 4 个月前 |