GGitHubGeneralize task chunker and add more logging (#423 )

bd60469e创建于 3月5日历史提交

文件	最后提交记录	最后更新时间
src	Generalize task chunker and add more logging (#423) * generalize task chunker and add more logging * Fix misleading variable name in chunking example (#429) * Initial plan * Fix variable name: rename hypothetical to summarizer in chunking.rs example Co-authored-by: ealmloff <66571940+ealmloff@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ealmloff <66571940+ealmloff@users.noreply.github.com> * fix formatting --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>	2 个月前
Cargo.toml	Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy	4 个月前
README.md	Add support for Qwen 2.5 Vision (#382) * implement qwen vision embed and patch merger * implement qwen vision block * calculate the rope index of images and videos * add get_window_index * fix get window index * unwrap less * Create media source api * integrate the new media support into the language model trait * Create QwenVisionTransformer * implement QwenVisionTransformer::forward * fix formatting * fix loading qwen 2.5 vl * fix rot_pos_emb * add image preprocessing utilities * fix vision rope * fix mask * Fix feed forward * qwen vision forward working * unwrap less * clean up * create tensor tools cli * fix cli * fix fuse tokenizer * move parse into its own module * Use llama.cpp compatible tensor names * add preset * load qwen vision metadata from the gguf file * fix loading the vision encoder * test process image * forward eps and add more tests * fix image processing * implement image chat templating * full pipeline running * fix formatting * use 3d rope index * fix dimension_sections decoding * qwen vl rope working * remove logs * fix rope tests * fix rope size * fix rope index to tensor conversion * Fix rope updates * normalize image input * match image resize behavior * fix fullatt_block calculation * vision model works * remove logs * add more qwen vl presets * fix some clippy lints * fix clippy * Fix ToChatMessage * expose image processing hints * remove unwraps * fix unwraps in tests * fix more examples	11 个月前

Kalosm Language

Language processing utilities for the Kalosm framework.

The language part of Kalosm has a few core parts:

Models: Text generation and embedding models
Context: Document collection, format support, search and chunking
Integrations: SurrealDB, Serper, and other integrations

Text Generation Models

Model and ModelExt are the core traits for text generation models. Any model that implements these traits can be used with Kalosm.

The simplest way to use a model is to create a llama model and call stream_text on it:

use kalosm::language::*;
#[tokio::main]
async fn main() {
    let mut llm = Llama::new().await.unwrap();
    let prompt = "The following is a 300 word essay about why the capital of France is Paris:";
    print!("{prompt}");
    // Any model that implements the [`TextCompletionModel`] trait can be used to stream text
    let mut stream = llm.complete(prompt);
    // You can then use the stream however you need. to_std_out will print the text to the console as it is generated
    stream.to_std_out().await.unwrap();
}

Tasks

You can define a Task with a description then run it with an input. The task will cache the description to repeated calls faster. Tasks work with chat models.

use kalosm::language::*;
#[tokio::main]
async fn main() {
    // Create a new model
    let model = Llama::new_chat().await.unwrap();
    // Create a new task that summarizes text
    let task = model.task("You take a long description and summarize it into a single short sentence");
    let mut output = task(&"You can define a Task with a description then run it with an input. The task will cache the description to repeated calls faster. Tasks work with chat models.");
    // Then stream the output to the console
    output.to_std_out().await.unwrap();
}

Structured Generation

Structured generation gives you more control over the output of the text generation. You can derive a parser for your data to easily get structured data out of an LLM:

use kalosm::language::*;
#[derive(Parse, Clone)]
struct Pet {
    name: String,
    age: u32,
    description: String,
}

Then you can generate text that works with the parser in a Task:

# use kalosm::language::*;
# use std::sync::Arc;
#[derive(Parse, Debug, Clone)]
struct Pet {
    name: String,
    age: u32,
    description: String,
}

#[tokio::main]
async fn main() {
    // First create a model. Chat models tend to work best with structured generation
    let model = Llama::new_chat().await.unwrap();
    // Then create a parser for your data. Any type that implements the `Parse` trait has the `new_parser` method
    let parser = Arc::new(Pet::new_parser());
    // Then create a task with the parser as constraints
    let task = model.task("You generate realistic JSON placeholders")
        .with_constraints(parser);
    // Finally, run the task
    let pet: Pet = task(&"Generate a pet in the form {\"name\": \"Pet name\", \"age\": 0, \"description\": \"Pet description\"}").await.unwrap();
    println!("{pet:?}");
}

Embedding Models

Embedder and EmbedderExt are the core traits for text embedding models. Any model that implements these traits can be used with Kalosm.

The simplest way to use an embedding model is to create a bert model and call embed on it. The Embedding you get back represents the meaning of the text in a numerical format:

use kalosm::language::*;
#[tokio::main]
async fn main() {
    // First create a model. Bert::new() is a good default embedding model for general tasks
    let model = Bert::new().await.unwrap();
    // Then embed some text into the vector space
    let embedding = model.embed("Kalosm is a library for building AI applications").await.unwrap();
    // And some more text
    let embedding = model.embed(prompt_input("Text: ").unwrap()).await.unwrap();
    // You can compare the cosine similarity of the two embeddings to see how similar they are
    println!("cosine similarity: {}", embedding.cosine_similarity(&embedding));
}

Context

Gathering context is a key part of building LLM applications. Providing the right context to the model makes the output more relevant and useful. It can also help to prevent hallucinations.

Kalosm provides tools to generate gather, and process context from a variety of sources.

Gathering context

Kalosm provides utilities for collecting context from a variety of sources:

Local files (.txt, .md, .html, .docx, .pdf)
RSS feeds
Websites
Search engines
Microphone input and audio input through whisper transcriptions

Each of these sources implements either IntoDocument or IntoDocuments to convert the data into a Document with the contents and metadata about the document.

use kalosm::language::*;
use std::convert::TryFrom;
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Try to extract an article from a URL
    let page = Url::parse("https://www.nytimes.com/live/2023/09/21/world/zelensky-russia-ukraine-news")?;
    let document = page.into_document().await?;
    println!("Title: {}", document.title());
    println!("Body: {}", document.body());

    Ok(())
}

Chunking context

After you have gathered context, it is often useful to chunk it into smaller pieces for search. Kalosm provides utilities for chunking context into documents, sentences, paragraphs, or semantic chunks. Kalosm will embed each chunk as it splits the document into smaller pieces. One of the most powerful chunker is the semantic chunker, which lets you chunk documents into semantically similar chunks without explicitly setting the size of the chunks:

use kalosm::language::*;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // First, create an embedding model for semantic chunking
    let model = Bert::new().await?;
    // Then create a document folder with some documents
    let documents = DocumentFolder::new("./documents")?.into_documents().await?;
    // Then chunk the documents into sentences
    let chunked = SemanticChunker::new().chunk_batch(&documents, &model).await?;
    println!("{:?}", chunked);
    Ok(())
}

Embedding-powered search

After you have chunked your context, you can use the embeddings for search or retrieval augmented generation. Embedding-based search lets you find documents that are semantically similar to a specific word or phrase even if no words are an exact match:

use kalosm::language::*;
use surrealdb::{engine::local::SurrealKv, Surreal};

#[tokio::main]
async fn main() {
    // Create database connection
    let db = Surreal::new::<SurrealKv>(std::env::temp_dir().join("temp.db")).await.unwrap();

    // Select a specific namespace / database
    db.use_ns("search").use_db("documents").await.unwrap();

    // Create a table in the surreal database to store the embeddings
    let document_table = db
        .document_table_builder("documents")
        .build::<Document>()
        .await
        .unwrap();

    // Add documents to the database
    document_table.add_context(DocumentFolder::new("./documents").unwrap()).await.unwrap();

    loop {
        // Get the user's question
        let user_question = prompt_input("Query: ").unwrap();

        let nearest_5 = document_table
            .search(user_question)
            .with_results(5)
            .await
            .unwrap();

        println!("{:?}", nearest_5);
    }
}