198c480a创建于 2025年11月12日历史提交
文件最后提交记录最后更新时间
Make kalosm-sound usable in wasm (#407) * make kalosm-sound usable in wasm * smaller default tiny en model * fix cargo check * forward input features * fix examples6 个月前
fix denoise in debug 6 个月前
fix doc tests 6 个月前
Remote chat, remote structured generation models, and single file gguf chat model loading (#319) * add chat template support and remove the VectorSpace trait * move sampling and chat templates to kalosm llama * update kalosm-llama unstructured generation to the new interface * restore structured generation module * Restore llama implementation of structured generation * clean up kalosm-llama clippy lints * restore llama chat and structured chat implementation * improve infer chat example * add support for remote chat models * support constraints for openai remote models * load the tokenizer from the gguf file if a huggingface tokenizer is not present * Fix tokenizer conversion * restore chat struct * Fix chat implementation with llama * remove tokio from language model * Create chat and text completion extension traits * add task helper to the chat extension trait * update kalosm-language to new task interface * make llama callable * add with_constraints method to task * fix task example * update examples to new chat and task api * set tools to none to fix llama chat template * Add helpers for the default parser for a specific type and model combo * simplify constrained rust type example * restore prompt annealing * fix structured example * document text completion model * document new chat api * update task documentation * Fix tokenizer gguf * fix custom llama source example * fix remaining tests * add logging to remote examples * Clippy fixes * More clippy fixes * use function call in docs more constantly * fix remaining doc tests1 年前
README.md

Kalosm Sound

Kalosm Sound is a collection of audio models and utilities for the Kalosm framework. It supports several voice activity detection models, and provides utilities for transcribing audio into text.

Sound Streams

Models in kalosm sound work with any [AsyncSource]. You can use [MicInput::stream] to stream audio from the microphone, or any synchronous audio source that implements [rodio::Source] like a mp3 or wav file.

You can transform the audio streams with:

  • [VoiceActivityDetectorExt::voice_activity_stream]: Detect voice activity in the audio data
  • [DenoisedExt::denoise_and_detect_voice_activity]: Denoise the audio data and detect voice activity
  • [AsyncSourceTranscribeExt::transcribe]: Chunk an audio stream based on voice activity and then transcribe the chunked audio data
  • [VoiceActivityStreamExt::rechunk_voice_activity]: Chunk an audio stream based on voice activity
  • [VoiceActivityStreamExt::filter_voice_activity]: Filter chunks of audio data based on voice activity
  • [TranscribeChunkedAudioStreamExt::transcribe]: Transcribe a chunked audio stream

Voice Activity Detection

VAD models are used to detect when a speaker is speaking in a given audio stream. The simplest way to use a VAD model is to create an audio stream and call [VoiceActivityDetectorExt::voice_activity_stream] to stream audio chunks that are actively being spoken:

use kalosm::sound::*;
#[tokio::main]
async fn main() {
    // Get the default microphone input
    let mic = MicInput::default();
    // Stream the audio from the microphone
    let stream = mic.stream();
    // Detect voice activity in the audio stream
    let mut vad = stream.voice_activity_stream();
    while let Some(input) = vad.next().await {
        println!("Probability: {}", input.probability);
    }
}

Kalosm also provides [VoiceActivityStreamExt::rechunk_voice_activity] to collect chunks of consecutive audio samples with a high vad probability. This can be useful for applications like speech recognition where context between consecutive audio samples is important.

use kalosm::sound::*;
use rodio::Source;
#[tokio::main]
async fn main() {
    // Get the default microphone input
    let mic = MicInput::default();
    // Stream the audio from the microphone
    let stream = mic.stream();
    // Chunk the audio into chunks of speech
    let vad = stream.voice_activity_stream();
    let mut audio_chunks = vad.rechunk_voice_activity();
    // Print the chunks as they are streamed in
    while let Some(input) = audio_chunks.next().await {
        println!("New voice activity chunk with duration {:?}", input.total_duration());
    }
}

Transcription

You can use the [Whisper] model to transcribe audio into text. Kalosm can transcribe any [AsyncSource] into a transcription stream with the [AsyncSourceTranscribeExt::transcribe] method:

use kalosm::sound::*;
#[tokio::main]
async fn main() {
    // Get the default microphone input
    let mic = MicInput::default();
    // Stream the audio from the microphone
    let stream = mic.stream();
    // Transcribe the audio into text with the default Whisper model
    let mut transcribe = stream.transcribe(Whisper::new().await.unwrap());
    // Print the text as it is streamed in
    transcribe.to_std_out().await.unwrap();
}