扩展Qwen2.5-Omni-3B，生成ColBERT风格多向量表示，支持动态图像分辨率，实现高效文档检索，具备零样本音频检索能力。【此简介由AI生成】

61e78ef2创建于 2025年7月17日12次提交

文件	最后提交记录	最后更新时间
checkpoint-2310	Upload folder using huggingface_hub	10 个月前
.gitattributes	Upload folder using huggingface_hub	10 个月前
README.md	Update README.md	10 个月前
adapter_config.json	Upload folder using huggingface_hub	10 个月前
adapter_model.safetensorsLFS	Upload folder using huggingface_hub	10 个月前
added_tokens.json	Upload folder using huggingface_hub	10 个月前
chat_template.jinja	Upload folder using huggingface_hub	10 个月前
git_hash.txt	Upload folder using huggingface_hub	10 个月前
merges.txt	Upload folder using huggingface_hub	10 个月前
preprocessor_config.json	Upload folder using huggingface_hub	10 个月前
special_tokens_map.json	Upload folder using huggingface_hub	10 个月前
tokenizer.jsonLFS	Upload folder using huggingface_hub	10 个月前
tokenizer_config.json	Upload folder using huggingface_hub	10 个月前
train_colqwenomni_model.py	Upload folder using huggingface_hub	10 个月前
video_preprocessor_config.json	Update video_preprocessor_config.json	10 个月前
vocab.json	Upload folder using huggingface_hub	10 个月前

自动翻译

base_model: vidore/colqwen2.5omni-base license: mit library_name: colpali language:

en tags:
colpali
vidore
vidore-experimental pipeline_tag: visual-document-retrieval

ColQwen2.5-Omni：基于 Qwen2.5-Omni-3B-Instruct 与 ColBERT 策略的视觉+音频检索模型

查看发布的博客文章，获取深度解析和使用教程！

ColQwen-Omni 是一款基于全新模型架构和训练策略的模型，它以全模态语言模型为基础，能够高效地从文档的视觉特征中建立索引。该模型是 Qwen2.5-Omni-3B 的扩展版本，可生成ColBERT风格的文本和图像多向量表示。它在论文ColPali: Efficient Document Retrieval with Vision Language Models中首次提出，并在此仓库中首次发布。

版本特性

该模型支持动态输入图像分辨率，不会像 ColPali 那样对图像进行缩放从而改变其宽高比。最大分辨率设置为最多生成 1024 个图像补丁。实验表明，增加图像补丁数量能带来显著性能提升，但会增加内存需求。

此版本使用 colpali-engine==0.3.11 进行训练。

训练数据与论文中描述的 ColPali 数据相同。

模型训练

数据集

音频检索能力是通过零样本方式获得的，因为所有训练数据均为纯图像-文本匹配数据。在训练过程中，音频和视觉塔被冻结。

我们的训练数据集包含 127,460 个查询-页面对，由公开可用的学术数据集的训练集（占 63%）和一个合成数据集（占 37%）组成。合成数据集由网络爬取的 PDF 文档页面构成，并通过 VLM（Claude-3 Sonnet）生成伪问题进行增强。我们的训练集特意设计为全英文，以便研究其对非英语语言的零样本泛化能力。我们明确确保没有多页 PDF 文档同时用于ViDoRe和训练集，以防止评估污染。我们还创建了一个包含 2% 样本的验证集，用于超参数调优。

注：语言模型的预训练语料中存在多语言数据，并且很可能在多模态训练中也存在。

使用方法

确保从源码安装 colpali-engine，或安装版本高于 0.3.11 的 colpali-engine。

pip install git+https://github.com/illuin-tech/colpali


import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available
from tqdm import tqdm
from torch.utils.data import DataLoader

from colpali_engine.models import ColQwen2_5Omni, ColQwen2_5OmniProcessor

model = ColQwen2_5Omni.from_pretrained(
    "vidore/colqwen-omni-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="cuda",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" # if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2_5OmniProcessor.from_pretrained("vidore/colqwen-omni-v0.1")

dataset = load_dataset("eustlb/dailytalk-conversations-grouped", split="train[:500]")
audios = [x["array"] for x in dataset["audio"]]


dataloader = DataLoader(
    dataset=audios,
    batch_size=2,
    shuffle=False,
    collate_fn=lambda x: processor.process_audios(x),
)

ds  = []
for batch_doc in tqdm(dataloader):
    with torch.no_grad():
        batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
        embeddings_doc = model(**batch_doc)
    ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))

def get_results(query: str, k=10):
    batch_queries = processor.process_queries([query]).to(model.device)

    # Forward pass
    with torch.no_grad():
        query_embeddings = model(**batch_queries)

    scores = processor.score_multi_vector(query_embeddings, ds)
    # get top-5 scores
    return scores[0].topk(k).indices.tolist()

res = get_results("A person looking for a taxi")

# In colab
display(Audio(dataset[res[0]]["audio"]["array"], autoplay=True, rate=dataset[res[0]]["audio"]["sampling_rate"]))

联系方式

Manuel Faysse：manuel.faysse@illuin.tech
Antonio Loison：antonio.loison@illuin.tech

引用说明

如果您在研究中使用了本机构的任何数据集或模型，请按以下方式引用原始数据集：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}