基于Qwen2.5-7B和SigLip2视觉骨干，专为机器人操作开发的开源动作推理模型，在多基准测试中表现领先，支持空间推理与安全行为审计。【此简介由AI生成】

cf20ec1c创建于 2025年9月2日12次提交

文件	最后提交记录	最后更新时间
.gitattributes	Upload folder using huggingface_hub	9 个月前
README.md	Update README.md	8 个月前
added_tokens.json	Upload folder using huggingface_hub	9 个月前
chat_template.jinja	Upload folder using huggingface_hub	9 个月前
config.json	update to fp32 weights	8 个月前
configuration_molmoact.py	Upload folder using huggingface_hub	9 个月前
example_1.png	Upload 3 files	9 个月前
example_2.png	Upload 3 files	9 个月前
generation_config.json	Upload folder using huggingface_hub	9 个月前
image_processing_molmoact.py	Update image_processing_molmoact.py	8 个月前
merges.txt	Upload folder using huggingface_hub	9 个月前
model-00001-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model-00002-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model-00003-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model-00004-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model-00005-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model-00006-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model-00007-of-00007.safetensorsLFS	update to fp32 weights	8 个月前
model.safetensors.index.json	update to fp32 weights	8 个月前
model.yaml	update to fp32 weights	8 个月前
modeling_molmoact.py	remove normalization for gripper state	8 个月前
molmoact_logo.svg	Upload 3 files	9 个月前
preprocessor_config.json	Upload folder using huggingface_hub	9 个月前
processing_molmoact.py	Update processing_molmoact.py	8 个月前
processor_config.json	Upload folder using huggingface_hub	9 个月前
special_tokens_map.json	Upload folder using huggingface_hub	9 个月前
tokenizer.jsonLFS	Upload folder using huggingface_hub	9 个月前
tokenizer_config.json	Upload folder using huggingface_hub	9 个月前
vocab.json	Upload folder using huggingface_hub	9 个月前

自动翻译

base_model:

Qwen/Qwen2.5-7B
google/siglip2-so400m-patch14-384 language:
en library_name: transformers license: apache-2.0 pipeline_tag: robotics tags:
molmoact
molmo
olmo
reasoning
vla
robotics
manipulation paper: 2508.07917

MolmoAct Logo

MolmoAct 7B-D LIBERO-Spatial

MolmoAct 是由艾伦人工智能研究院开发的完全开源的机器人操作动作推理模型。该模型基于 OXE 数据集的一个子集以及 MolmoAct 数据集进行训练，后者包含 1 万个高质量轨迹，记录了单臂 Franka 机器人在家庭和桌面环境中执行 93 项独特操作任务的过程。作为一款完全开源的模型，MolmoAct 在多个基准测试中均达到了视觉-语言-动作模型的最先进性能。您可以在此处找到 MolmoAct 系列的所有模型。

通过我们的公告博客文章或论文，深入了解 MolmoAct。

MolmoAct 7B-D LIBERO-Spatial 基于 Qwen2.5-7B 构建，并采用 SigLip2 作为视觉主干网络，其初始化采用了 Molmo 的预训练方法。该模型首先在 MolmoAct 的 Pre-training Mixture 上进行预训练，随后在 MolmoAct Dataset 上进行中期训练，最后在 LIBERO-Spatial 上进行后期训练。此模型旨在复现我们在 LIBERO-Spatial 上的实验结果。

本 checkpoint 是 MolmoAct 发布版本的预览版。创建 MolmoAct 所使用的所有资源（数据、训练代码、评估方法、中间 checkpoint 等）将在后续发布，以进一步践行我们对开源 AI 开发和可复现性的承诺。

快速链接：

📂 所有模型
📂 所有数据
📄 论文
💻 代码
🎥 博客文章
🎥 视频

快速开始

要运行 MolmoAct，首先请安装依赖项：

pip install einops torchvision accelerate
pip install transformers==4.52

然后，请按照以下步骤操作：

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from PIL import Image
import requests
from io import BytesIO

ckpt = "allenai/MolmoAct-7B-D-LIBERO-Spatial-0812"

# load the processor
processor = AutoProcessor.from_pretrained(
    ckpt,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
    padding_side="left",
)

# load the model
model = AutoModelForImageTextToText.from_pretrained(
    ckpt,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)

# task instruction
instruction = "pick up the black bowl on the stove and place it on the plate"

# strictly follow this reasoning prompt
prompt = (
    f"The task is {instruction}. "
    "What is the action that the robot should take. "
    f"To figure out the action that the robot should take to {instruction}, "
    "let's think through it step by step. "
    "First, what is the depth map for the first image? "
    "Second, what is the trajectory of the end effector in the first image? "
    "Based on the depth map of the first image and the trajectory of the end effector in the first image, "
    "along with other images from different camera views as additional information, "
    "what is the action that the robot should take?"
)

# apply chat template
text = processor.apply_chat_template(
    [
        {
            "role": "user",
            "content": [dict(type="text", text=prompt)]
        }
    ], 
    tokenize=False, 
    add_generation_prompt=True,
)

# image observation (side + wrist)
url1 = "https://huggingface.co/allenai/MolmoAct-7B-D-LIBERO-Spatial/resolve/main/example_1.png"
url2 = "https://huggingface.co/allenai/MolmoAct-7B-D-LIBERO-Spatial/resolve/main/example_2.png"
r1 = requests.get(url1, headers={"User-Agent": "python-requests"}, timeout=30)
r1.raise_for_status()
r2 = requests.get(url2, headers={"User-Agent": "python-requests"}, timeout=30)
r2.raise_for_status()
img1 = Image.open(BytesIO(r1.content)).convert("RGB")
img2 = Image.open(BytesIO(r2.content)).convert("RGB")
imgs = [img1, img2]

# process the image and text
inputs = processor(
    images=[imgs],
    text=text,
    padding=True,
    return_tensors="pt",
)

# move inputs to the correct device
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# generate output
with torch.inference_mode():
    with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
        generated_ids = model.generate(**inputs, max_new_tokens=512)

# only get generated tokens; decode them to text
generated_tokens = generated_ids[:, inputs['input_ids'].size(1):]
generated_text = processor.batch_decode(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# print the generated text
print(f"generated text: {generated_text}")

# >>>  The depth map of the first image is ... The trajectory of the end effector in the first image is ...
#      Based on these information, along with other images from different camera views as additional information,
#      the action that the robot should take is ...

# parse out all depth perception tokens
depth = model.parse_depth(generated_text)
print(f"generated depth perception tokens: {depth}")

# >>>  [ "<DEPTH_START><DEPTH_1><DEPTH_2>...<DEPTH_END>" ]

# parse out all visual reasoning traces
trace = model.parse_trace(generated_text)
print(f"generated visual reasoning trace: {trace}")

# >>>  [ [[242, 115], [140, 77], [94, 58], [140, 44], [153, 26]]] ]

# parse out all actions, unnormalizing with key "libero_spatial_no_noops_modified"
action = model.parse_action(generated_text, unnorm_key="libero_spatial_no_noops_modified")
print(f"generated action: {action}")

# >>>  [ [0.0732076061122558, 0.08228153779226191, -0.027760173818644346, 
#         0.15932856272248652, -0.09686601126895233, 0.043916773912953344, 
#         0.996078431372549], ... ]

许可与使用

本模型采用 Apache 2.0 许可协议，仅供研究和教育用途。如需了解更多信息，请参阅我们的负责任使用指南。

模型与硬件安全

MolmoAct 能够在实际行动发生前，对其在空间中的预期动作进行可视化轨迹检查。这使得用户能够通过主动审计和调整在模型指令下运行的任何硬件的动作，确保行为安全。MolmoAct 的动作空间限定在提供的数据范围内，并且模型内置了合规机制，以在检测到阻力时防止过度用力。在将本模型与机器人配合使用时，请遵循硬件制造商的指导方针，并在安全配置的环境中执行所有操作。

引用

@misc{molmoact2025,
      title={MolmoAct: Action Reasoning Models that can Reason in Space}, 
      author={Jason Lee and Jiafei Duan and Haoquan Fang and Yuquan Deng and Shuo Liu and Boyang Li and Bohan Fang and Jieyu Zhang and Yi Ru Wang and Sangho Lee and Winson Han and Wilbert Pumacay and Angelica Wu and Rose Hendrix and Karen Farley and Eli VanderBilt and Ali Farhadi and Dieter Fox and Ranjay Krishna},
      year={2025},
      eprint={2508.07917},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.07917}
}