Voxtral-Mini-3B-2507-ONNX:多语言音频转文本,支持转录、翻译与理解,32k上下文

基于Ministral 3B增强,具备强大音频理解能力。支持多语言自动检测与转录,32k上下文处理30分钟音频,集成问答、摘要功能,支持语音直接触发函数调用。【此简介由AI生成】

分支2Tags0
d2c87bb6创建于 2025年7月25日22次提交
文件最后提交记录最后更新时间
Upload audio encoder quantizations (#2) - Upload folder using huggingface_hub (02c96a5d4f57acc409c2f7633e1b14298f0df4b1) 9 个月前
Upload audio encoder quantizations (#2) - Upload folder using huggingface_hub (02c96a5d4f57acc409c2f7633e1b14298f0df4b1) 9 个月前
Update README.md9 个月前
Update chat_template.jinja9 个月前
Update config.json9 个月前
Create generation_config.json9 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前
Upload folder using huggingface_hub9 个月前
Upload 2 files9 个月前
Upload tokenizer_config.json9 个月前

language:

  • en
  • fr
  • de
  • es
  • it
  • pt
  • nl
  • hi license: apache-2.0 library_name: transformers.js base_model:
  • mistralai/Voxtral-Mini-3B-2507 pipeline_tag: audio-text-to-text

Voxtral Mini 1.0 (3B) - 2507

Voxtral Mini 是 Ministral 3B 的增强版本,它融合了最先进的音频输入能力,同时保留了一流的文本性能。该模型在语音转录、翻译和音频理解方面表现出色。

本仓库包含原始模型 mistralai/Voxtral-Mini-3B-2507 的 ONNX 权重。

通过他们的博客文章 此处 了解更多关于 Voxtral 的信息。

主要特性

Voxtral 以 Ministral-3B 为基础,具备强大的音频理解能力。

  • 专用转录模式:Voxtral 可在纯语音转录模式下运行,以实现性能最大化。默认情况下,Voxtral 会自动预测源音频语言并据此转录文本
  • 长文本上下文:凭借 32k 令牌的上下文长度,Voxtral 可处理长达 30 分钟的音频转录,或 40 分钟的音频理解任务
  • 内置问答与摘要功能:支持直接通过音频提问。无需单独的语音识别(ASR)和语言模型,即可分析音频并生成结构化摘要
  • 原生多语言支持:具备自动语言检测功能,在全球最广泛使用的语言(英语、西班牙语、法语、葡萄牙语、印地语、德语、荷兰语、意大利语)中均实现了最先进的性能
  • 语音直接函数调用:能够基于用户口头表达的意图,直接触发后端函数、工作流或 API 调用
  • 卓越的文本处理能力:保留了其语言模型基础 Ministral-3B 的文本理解能力

基准测试结果

音频

在 FLEURS、Mozilla Common Voice 和 Multilingual LibriSpeech 基准测试中的平均词错误率(WER):

image/png

文本

image/png

使用方法

注意事项

  • 聊天补全(例如音频理解)使用 temperature=0.2top_p=0.95,转录使用 temperature=0.0
  • 支持每条消息包含多个音频以及包含音频的多个用户轮次
  • 暂不支持系统提示词

Transformers.js

在线演示

通过我们的 在线演示 进行体验:

代码片段

如果尚未安装,可以通过 NPM 安装 Transformers.js JavaScript 库,命令如下:

npm i @huggingface/transformers

示例:转录

import { VoxtralForConditionalGeneration, VoxtralProcessor, TextStreamer, read_audio } from "@huggingface/transformers";

// Load the processor and model
const model_id = "onnx-community/Voxtral-Mini-3B-2507-ONNX";
const processor = await VoxtralProcessor.from_pretrained(model_id);
const model = await VoxtralForConditionalGeneration.from_pretrained(
    model_id,
    {
        dtype: {
            embed_tokens: "fp16", // "fp32", "fp16", "q8", "q4"
            audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
            decoder_model_merged: "q4", // "q4", "q4f16"
        },
        device: "webgpu",
    },
);

// Prepare the conversation
const conversation = [
    {
        "role": "user",
        "content": [
            { "type": "audio" },
            { "type": "text", "text": "lang:en [TRANSCRIBE]" },
        ],
    }
];
const text = processor.apply_chat_template(conversation, { tokenize: false });
const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000);
const inputs = await processor(text, audio);

// Generate the response
const generated_ids = await model.generate({
    ...inputs,
    max_new_tokens: 256,
    streamer: new TextStreamer(processor.tokenizer, { skip_special_tokens: true, skip_prompt: true }),
});

// Decode the generated tokens
const new_tokens = generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]);
const generated_texts = processor.batch_decode(
    new_tokens,
    { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// I have a dream that one day this nation will rise up and live out the true meaning of its creed.

示例:音频理解

import { VoxtralForConditionalGeneration, VoxtralProcessor, TextStreamer, read_audio } from "@huggingface/transformers";

// Load the processor and model
const model_id = "onnx-community/Voxtral-Mini-3B-2507-ONNX";
const processor = await VoxtralProcessor.from_pretrained(model_id);
const model = await VoxtralForConditionalGeneration.from_pretrained(
    model_id,
    {
        dtype: {
            embed_tokens: "fp16", // "fp32", "fp16", "q8", "q4"
            audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
            decoder_model_merged: "q4", // "q4", "q4f16"
        },
        device: "webgpu",
    },
);

// Prepare the conversation
const conversation = [
    {
        "role": "user",
        "content": [
            { "type": "audio" },
            { "type": "audio" },
            { "type": "text", "text": "Describe these two audio clips in detail." },
        ],
    }
];
const text = processor.apply_chat_template(conversation, { tokenize: false });
const audio = await Promise.all([
    read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav", 16000),
    read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000),
]);
const inputs = await processor(text, audio);

// Generate the response
const generated_ids = await model.generate({
    ...inputs,
    max_new_tokens: 256,
    streamer: new TextStreamer(processor.tokenizer, { skip_special_tokens: true, skip_prompt: true }),
});

// Decode the generated tokens
const new_tokens = generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]);
const generated_texts = processor.batch_decode(
    new_tokens,
    { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// The first audio clip is a speech by a leader, likely a politician or a public figure, addressing a large audience. The speaker begins by encouraging the listeners to ask not what their country can do for them, but what they can do for their country. This is a call to action and a reminder of the individual's responsibility to contribute to the nation's well-being. The second audio clip is a passionate speech by a different leader, possibly a civil rights activist or a community organizer. This speaker expresses a dream of a nation that will rise up and live out the true meaning of its creed, suggesting a vision of a more just and equitable society.

项目介绍

基于Ministral 3B增强,具备强大音频理解能力。支持多语言自动检测与转录,32k上下文处理30分钟音频,集成问答、摘要功能,支持语音直接触发函数调用。【此简介由AI生成】

定制我的领域

下载使用量

0

项目总下载次数(含Clone、Pull、 zip 包及 release 下载),每日凌晨更新

语言类型

Jinja100%