MolmoAct-7B-D-LIBERO-Spatial-0812:开源视觉语言动作模型,助力机器人操作任务推理

基于Qwen2.5-7B和SigLip2视觉骨干,专为机器人操作开发的开源动作推理模型,在多基准测试中表现领先,支持空间推理与安全行为审计。【此简介由AI生成】

分支1Tags0
cf20ec1c创建于 2025年9月2日12次提交
文件最后提交记录最后更新时间
Upload folder using huggingface_hub9 个月前
Update README.md8 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前
update to fp32 weights 8 个月前
Upload folder using huggingface_hub9 个月前
Upload 3 files9 个月前
Upload 3 files9 个月前
Upload folder using huggingface_hub9 个月前
Update image_processing_molmoact.py8 个月前
Upload folder using huggingface_hub9 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
update to fp32 weights 8 个月前
remove normalization for gripper state8 个月前
Upload 3 files9 个月前
Upload folder using huggingface_hub9 个月前
Update processing_molmoact.py8 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前

base_model:

  • Qwen/Qwen2.5-7B
  • google/siglip2-so400m-patch14-384 language:
  • en library_name: transformers license: apache-2.0 pipeline_tag: robotics tags:
  • molmoact
  • molmo
  • olmo
  • reasoning
  • vla
  • robotics
  • manipulation paper: 2508.07917

MolmoAct Logo

MolmoAct 7B-D LIBERO-Spatial

MolmoAct 是由艾伦人工智能研究院开发的完全开源的机器人操作动作推理模型。该模型基于 OXE 数据集的一个子集以及 MolmoAct 数据集进行训练,后者包含 1 万个高质量轨迹,记录了单臂 Franka 机器人在家庭和桌面环境中执行 93 项独特操作任务的过程。作为一款完全开源的模型,MolmoAct 在多个基准测试中均达到了视觉-语言-动作模型的最先进性能。您可以在此处找到 MolmoAct 系列的所有模型。

通过我们的公告博客文章论文深入了解 MolmoAct

MolmoAct 7B-D LIBERO-Spatial 基于 Qwen2.5-7B 构建,并采用 SigLip2 作为视觉主干网络,其初始化采用了 Molmo 的预训练方法。该模型首先在 MolmoAct 的 Pre-training Mixture 上进行预训练,随后在 MolmoAct Dataset 上进行中期训练,最后在 LIBERO-Spatial 上进行后期训练。此模型旨在复现我们在 LIBERO-Spatial 上的实验结果。

本 checkpoint 是 MolmoAct 发布版本的预览版。创建 MolmoAct 所使用的所有资源(数据、训练代码、评估方法、中间 checkpoint 等)将在后续发布,以进一步践行我们对开源 AI 开发和可复现性的承诺。

快速链接:

快速开始

要运行 MolmoAct,首先请安装依赖项:

pip install einops torchvision accelerate
pip install transformers==4.52

然后,请按照以下步骤操作:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from PIL import Image
import requests
from io import BytesIO

ckpt = "allenai/MolmoAct-7B-D-LIBERO-Spatial-0812"

# load the processor
processor = AutoProcessor.from_pretrained(
    ckpt,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
    padding_side="left",
)

# load the model
model = AutoModelForImageTextToText.from_pretrained(
    ckpt,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)

# task instruction
instruction = "pick up the black bowl on the stove and place it on the plate"

# strictly follow this reasoning prompt
prompt = (
    f"The task is {instruction}. "
    "What is the action that the robot should take. "
    f"To figure out the action that the robot should take to {instruction}, "
    "let's think through it step by step. "
    "First, what is the depth map for the first image? "
    "Second, what is the trajectory of the end effector in the first image? "
    "Based on the depth map of the first image and the trajectory of the end effector in the first image, "
    "along with other images from different camera views as additional information, "
    "what is the action that the robot should take?"
)

# apply chat template
text = processor.apply_chat_template(
    [
        {
            "role": "user",
            "content": [dict(type="text", text=prompt)]
        }
    ], 
    tokenize=False, 
    add_generation_prompt=True,
)

# image observation (side + wrist)
url1 = "https://huggingface.co/allenai/MolmoAct-7B-D-LIBERO-Spatial/resolve/main/example_1.png"
url2 = "https://huggingface.co/allenai/MolmoAct-7B-D-LIBERO-Spatial/resolve/main/example_2.png"
r1 = requests.get(url1, headers={"User-Agent": "python-requests"}, timeout=30)
r1.raise_for_status()
r2 = requests.get(url2, headers={"User-Agent": "python-requests"}, timeout=30)
r2.raise_for_status()
img1 = Image.open(BytesIO(r1.content)).convert("RGB")
img2 = Image.open(BytesIO(r2.content)).convert("RGB")
imgs = [img1, img2]

# process the image and text
inputs = processor(
    images=[imgs],
    text=text,
    padding=True,
    return_tensors="pt",
)

# move inputs to the correct device
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# generate output
with torch.inference_mode():
    with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
        generated_ids = model.generate(**inputs, max_new_tokens=512)

# only get generated tokens; decode them to text
generated_tokens = generated_ids[:, inputs['input_ids'].size(1):]
generated_text = processor.batch_decode(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# print the generated text
print(f"generated text: {generated_text}")

# >>>  The depth map of the first image is ... The trajectory of the end effector in the first image is ...
#      Based on these information, along with other images from different camera views as additional information,
#      the action that the robot should take is ...

# parse out all depth perception tokens
depth = model.parse_depth(generated_text)
print(f"generated depth perception tokens: {depth}")

# >>>  [ "<DEPTH_START><DEPTH_1><DEPTH_2>...<DEPTH_END>" ]

# parse out all visual reasoning traces
trace = model.parse_trace(generated_text)
print(f"generated visual reasoning trace: {trace}")

# >>>  [ [[242, 115], [140, 77], [94, 58], [140, 44], [153, 26]]] ]

# parse out all actions, unnormalizing with key "libero_spatial_no_noops_modified"
action = model.parse_action(generated_text, unnorm_key="libero_spatial_no_noops_modified")
print(f"generated action: {action}")

# >>>  [ [0.0732076061122558, 0.08228153779226191, -0.027760173818644346, 
#         0.15932856272248652, -0.09686601126895233, 0.043916773912953344, 
#         0.996078431372549], ... ]

许可与使用

本模型采用 Apache 2.0 许可协议,仅供研究和教育用途。 如需了解更多信息,请参阅我们的负责任使用指南

模型与硬件安全

MolmoAct 能够在实际行动发生前,对其在空间中的预期动作进行可视化轨迹检查。这使得用户能够通过主动审计和调整在模型指令下运行的任何硬件的动作,确保行为安全。MolmoAct 的动作空间限定在提供的数据范围内,并且模型内置了合规机制,以在检测到阻力时防止过度用力。在将本模型与机器人配合使用时,请遵循硬件制造商的指导方针,并在安全配置的环境中执行所有操作。

引用

@misc{molmoact2025,
      title={MolmoAct: Action Reasoning Models that can Reason in Space}, 
      author={Jason Lee and Jiafei Duan and Haoquan Fang and Yuquan Deng and Shuo Liu and Boyang Li and Bohan Fang and Jieyu Zhang and Yi Ru Wang and Sangho Lee and Winson Han and Wilbert Pumacay and Angelica Wu and Rose Hendrix and Karen Farley and Eli VanderBilt and Ali Farhadi and Dieter Fox and Ranjay Krishna},
      year={2025},
      eprint={2508.07917},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.07917}
}

项目介绍

基于Qwen2.5-7B和SigLip2视觉骨干,专为机器人操作开发的开源动作推理模型,在多基准测试中表现领先,支持空间推理与安全行为审计。【此简介由AI生成】

定制我的领域

下载使用量

0

项目总下载次数(含Clone、Pull、 zip 包及 release 下载),每日凌晨更新

语言类型

Python99.58%
Jinja0.42%