cann-robot[docs] 修改仓库名字为 cann-recipes-embodied-ai

文件	最后提交记录	最后更新时间
doc	feat(act/train): 优化 ACT 训练性能 Co-authored-by: kiwi<pengda@syslong.cn> # message auto-generated for no-merge-commit merge: !70 merge feat/act-train-ascend into master feat(act/train): 优化 ACT 训练性能 Created-by: SSRVodka Commit-by: kiwi;Kiwi Merged-by: cann-robot Description: ## 描述本次变更聚焦于 ACT 在昇腾 Atlas A2（910B）训练链路上的性能修复与优化，不涉及此前已合入的训练样例初始化、setup.sh 环境准备或基础适配逻辑。本次主要调整 ACT 训练配置中的视频解码路径，将数据集视频后端切换为 torchcodec，以改善训练阶段的数据读取效率和整体吞吐表现。同时同步更新相关文档中的性能结论与使用说明，方便用户按当前推荐配置复现优化后的训练结果。本次变更不新增新的训练目录结构，也不引入新的外部依赖管理方式，重点是对已合入 ACT 训练样例的性能路径进行修复和补充说明。 ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [x] 文档内容更新 ## 如何测试在已准备好的 Ascend 训练环境中，基于现有 ACT 训练样例执行以下检查：确认配置文件中视频解码后端已切换为 torchcodec： - manipulation/act/train/src/configs/act_aloha.yaml - manipulation/act/train/src/configs/act_aloha_smoke.yaml 使用 smoke 配置启动训练： ./manipulation/act/train/src/scripts/run_train.sh act_aloha_smoke --port 29510 确认训练脚本能够正确解析配置并正常启动。使用正式配置启动训练： ./manipulation/act/train/src/scripts/run_train.sh act_aloha --port 29510 确认训练链路能够正常工作，且视频解码路径使用 torchcodec。检查文档内容是否与当前优化后的性能口径保持一致： - manipulation/act/train/README.md - manipulation/act/train/doc/README.md 本次主要涉及文件： - manipulation/act/train/README.md - manipulation/act/train/doc/README.md - manipulation/act/train/src/configs/act_aloha.yaml - manipulation/act/train/src/configs/act_aloha_smoke.yaml ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：feat, fix, refactor, docs, test） ## 其他信息本次验证主要覆盖： - ACT 训练配置切换到 torchcodec 后的启动可用性 - smoke / 正式训练配置解析正确 - 优化后训练吞吐结果整理与文档同步 See merge request: cann/cann-recipes-embodied-intelligence!70	1 个月前
src	feat(act/train): 优化 ACT 训练性能 Co-authored-by: kiwi<pengda@syslong.cn> # message auto-generated for no-merge-commit merge: !70 merge feat/act-train-ascend into master feat(act/train): 优化 ACT 训练性能 Created-by: SSRVodka Commit-by: kiwi;Kiwi Merged-by: cann-robot Description: ## 描述本次变更聚焦于 ACT 在昇腾 Atlas A2（910B）训练链路上的性能修复与优化，不涉及此前已合入的训练样例初始化、setup.sh 环境准备或基础适配逻辑。本次主要调整 ACT 训练配置中的视频解码路径，将数据集视频后端切换为 torchcodec，以改善训练阶段的数据读取效率和整体吞吐表现。同时同步更新相关文档中的性能结论与使用说明，方便用户按当前推荐配置复现优化后的训练结果。本次变更不新增新的训练目录结构，也不引入新的外部依赖管理方式，重点是对已合入 ACT 训练样例的性能路径进行修复和补充说明。 ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [x] 文档内容更新 ## 如何测试在已准备好的 Ascend 训练环境中，基于现有 ACT 训练样例执行以下检查：确认配置文件中视频解码后端已切换为 torchcodec： - manipulation/act/train/src/configs/act_aloha.yaml - manipulation/act/train/src/configs/act_aloha_smoke.yaml 使用 smoke 配置启动训练： ./manipulation/act/train/src/scripts/run_train.sh act_aloha_smoke --port 29510 确认训练脚本能够正确解析配置并正常启动。使用正式配置启动训练： ./manipulation/act/train/src/scripts/run_train.sh act_aloha --port 29510 确认训练链路能够正常工作，且视频解码路径使用 torchcodec。检查文档内容是否与当前优化后的性能口径保持一致： - manipulation/act/train/README.md - manipulation/act/train/doc/README.md 本次主要涉及文件： - manipulation/act/train/README.md - manipulation/act/train/doc/README.md - manipulation/act/train/src/configs/act_aloha.yaml - manipulation/act/train/src/configs/act_aloha_smoke.yaml ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：feat, fix, refactor, docs, test） ## 其他信息本次验证主要覆盖： - ACT 训练配置切换到 torchcodec 后的启动可用性 - smoke / 正式训练配置解析正确 - 优化后训练吞吐结果整理与文档同步 See merge request: cann/cann-recipes-embodied-intelligence!70	1 个月前
README.md	[docs] 修改仓库名字为 cann-recipes-embodied-ai Co-authored-by: chenhongyang<chenhongyang6@huawei.com> # message auto-generated for no-merge-commit merge: !80 merge master into master [docs] 修改仓库名字为 cann-recipes-embodied-ai Created-by: chenhongyang Commit-by: chenhongyang Merged-by: cann-robot Description: # Pull Request 模板 ---- ## 描述请提供此Pull Request的预期功能，以方便检视。 ## 类型 - [ ] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [x] 文档内容更新 ## 如何测试描述测试这个变更的步骤，包括哪些文件需要被修改。 ## Checklist: - [ ] 我的代码遵循这个项目的代码风格 - [ ] 我已经自己测试过我的代码 - [ ] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 其他信息在这里可以添加任何与这个 Pull Request 相关的其他说明。 See merge request: cann/cann-recipes-embodied-ai!80	20 天前

ACT 在昇腾 Atlas A2 上的训练样例

本目录提供 ACT 训练样例，完成 ALOHA transfer_cube 任务的模型训练，以及方便拓展到其他任务上。

当前样例遵循以下原则：

cann-recipes 仓库中仅保存训练样例目录、配置、脚本、文档和补丁；
lerobot 作为外部依赖仓单独 clone；
setup.sh 固定 lerobot commit id，并对已验证的通用 Ascend 训练补丁执行 git apply；

1. 适用场景

硬件：昇腾 Atlas A2
CANN：8.3.0 及以上
任务：AlohaTransferCube-v0
数据集：lerobot/aloha_sim_transfer_cube_human
外部训练框架：huggingface/lerobot

2. 外部依赖与固定版本

本样例不内嵌 lerobot 源码，默认使用如下 commit：

58f70b6bd370864139a3795ac3497a9eae8c42d5

3. 目录说明

manipulation/act/train/
├── README.md
├── doc/
│   └── README.md
└── src/
    ├── configs/
    │   ├── act_aloha.yaml
    │   └── act_aloha_smoke.yaml
    ├── patches/
    │   └── lerobot_ascend_train_common.patch
    └── scripts/
        ├── run_eval.sh
        ├── run_train.sh
        └── setup.sh

4. 环境准备

4.1 clone 代码

git clone https://gitcode.com/cann/cann-recipes-embodied-ai.git
cd cann-recipes-embodied-ai

4.2 准备 `lerobot`

执行：

chmod +x manipulation/act/train/src/scripts/setup.sh
./manipulation/act/train/src/scripts/setup.sh

该脚本会：

在 cann-recipes 同级目录下准备 lerobot 代码仓；
checkout 到固定 commit 58f70b6bd370864139a3795ac3497a9eae8c42d5；
应用当前已验证的 Ascend 训练补丁（包含 ACT 使用 torchcodec 所需的视频解码容忍度修正）；
安装 ACT 所需的 LeRobot 通用 Python 依赖与 gym-aloha；
默认复用当前已激活环境中的 torch / torch_npu；
如需在新环境中执行，可通过参数创建 conda 环境，并通过本地 wheel 注入平台相关的 torch / torchvision / torch_npu。

常见用法：

# 查看脚本帮助
./manipulation/act/train/src/scripts/setup.sh --help

# 用当前已准备好的 Ascend 环境
./manipulation/act/train/src/scripts/setup.sh

# 创建新 conda 环境，并从本地 wheel 安装平台栈
./manipulation/act/train/src/scripts/setup.sh \
  --create-conda \
  --env-name lerobot-act \
  --python-version 3.10 \
  --torch-wheel /path/to/torch.whl \
  --torchvision-wheel /path/to/torchvision.whl \
  --torch-npu-wheel /path/to/torch_npu.whl

说明：

之所以不在脚本中硬编码 torch_npu 下载链接，是因为有效的 wheel 组合依赖于宿主机架构、CANN 版本和 Ascend 软件栈；
这部分平台依赖建议由已有训练环境复用，或由使用者自行提供本地 wheel。
如已提前确认平台栈可用，也可以追加 --skip-torch-check 跳过末尾导入校验。

4.3 数据集路径

当前配置默认使用工作区相对路径：

../dataset/lerobot/aloha_sim_transfer_cube_human

如需调整，请修改：

src/configs/act_aloha.yaml
src/configs/act_aloha_smoke.yaml

这些相对路径默认相对于 lerobot 根目录解析，推荐工作区布局如下：

<workspace>/
├── cann-recipes-embodied-ai/
├── lerobot/
├── dataset/
│   └── lerobot/
│       └── aloha_sim_transfer_cube_human/
└── ckpt/

要求：root 必须直接指向包含 data/、meta/ 的数据集根目录。

4.4 ResNet18 权重缓存

ACT 默认使用：

pretrained_backbone_weights: ResNet18_Weights.IMAGENET1K_V1

首次训练或评测时，PyTorch 可能会尝试下载 resnet18-f37072fd.pth。在无外网环境中，建议提前将该文件放到当前用户的 PyTorch 权重缓存目录，例如：

~/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth

可在有外网的机器上从 PyTorch 官方地址下载：

wget -O resnet18-f37072fd.pth https://download.pytorch.org/models/resnet18-f37072fd.pth

也可以使用：

curl -L https://download.pytorch.org/models/resnet18-f37072fd.pth -o resnet18-f37072fd.pth

下载后，将文件拷贝到目标机器的 PyTorch 权重缓存目录：

mkdir -p ~/.cache/torch/hub/checkpoints
cp resnet18-f37072fd.pth ~/.cache/torch/hub/checkpoints/

如果设置了 TORCH_HOME，则实际缓存目录为 $TORCH_HOME/hub/checkpoints/。可以通过以下命令确认当前环境的缓存根目录：

python -c "import torch; print(torch.hub.get_dir())"

如果服务器无法联网，又没有提前缓存，ACT 会在模型构建阶段失败。

5. 训练配置

5.1 smoke 配置

配置文件：src/configs/act_aloha_smoke.yaml
作用：快速验证环境、数据、依赖和多卡训练链路
关键参数：
- steps: 20
- wandb.enable: false

启动：

./manipulation/act/train/src/scripts/run_train.sh act_aloha_smoke --port 29510

5.2 长训配置

配置文件：src/configs/act_aloha.yaml
关键参数：
- dataset.video_backend: torchcodec
- steps: 100000
- batch_size: 8
- num_workers: 4
- wandb.enable: true

启动：

./manipulation/act/train/src/scripts/run_train.sh act_aloha --port 29510

6. 评测说明

run_eval.sh 只是对 lerobot-eval 的轻量封装，参数直接透传。

在线评测建议将 MuJoCo 仿真与渲染放在 CPU 侧执行，policy 推理继续使用 NPU；
原因见 doc/README.md。

示例：

export MUJOCO_GL=osmesa
./manipulation/act/train/src/scripts/run_eval.sh \
  --policy.path=/path/to/pretrained_model \
  --policy.device=npu \
  --env.type=aloha \
  --env.task=AlohaTransferCube-v0 \
  --eval.n_episodes=100 \
  --eval.batch_size=20 \
  --output_dir=/path/to/eval_out

说明：

MUJOCO_GL=osmesa 表示 MuJoCo 使用 CPU 软件渲染；
--policy.device=npu 表示模型前向推理继续放在 NPU；
这种方式对应“仿真在 CPU，推理在 NPU”。

7. 已验证结果摘要

当前样例已切换为默认使用 torchcodec 解码视频。当前已验证的一组参考结果：

训练任务：ACT on lerobot/aloha_sim_transfer_cube_human
任务环境：AlohaTransferCube-v0
数据规模：50 episodes，20000 frames
训练硬件：昇腾 Atlas A2 8 卡
训练步数：100000
训练 batch 配置：batch_size: 8，全局 batch size 64
统计区间：W&B train/steps = 5000 ~ 20000
评测方式：5 x 100 episodes
评测总成功率：68.0%

当前已完成一组 100 step 的快速吞吐验证，可作为当前配置下的参考最佳结果：

场景	统计区间	mean_updt_s	mean_data_s	end-to-end samples/s
`8 cards x bs64 x torchcodec`	`step 10~100`	`0.3191`	`0.3544`	`760.24`

更详细的环境、日志、checkpoint 路径和评测说明见：

doc/README.md

说明：

默认配置中的 video_backend 已显式设置为 torchcodec；

8. W&B 记录占位

9. 常用命令

查看训练日志

cd ../lerobot
tail -f ../ckpt/logs/train_act_aloha_*.log

resume 训练

./manipulation/act/train/src/scripts/run_train.sh act_aloha --resume --port 29510

10. 相关说明

本样例目录不包含 lerobot 源码；
若后续需要扩展到其他 ALOHA 数据集，可新增新的 YAML。
样例参考 https://gitcode.com/cann/cann-recipes-embodied-ai/blob/master/manipulation/pi05/train/README.md