58bc60c3创建于 2025年12月23日历史提交
文件最后提交记录最后更新时间
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前
Remote chat, remote structured generation models, and single file gguf chat model loading (#319) * add chat template support and remove the VectorSpace trait * move sampling and chat templates to kalosm llama * update kalosm-llama unstructured generation to the new interface * restore structured generation module * Restore llama implementation of structured generation * clean up kalosm-llama clippy lints * restore llama chat and structured chat implementation * improve infer chat example * add support for remote chat models * support constraints for openai remote models * load the tokenizer from the gguf file if a huggingface tokenizer is not present * Fix tokenizer conversion * restore chat struct * Fix chat implementation with llama * remove tokio from language model * Create chat and text completion extension traits * add task helper to the chat extension trait * update kalosm-language to new task interface * make llama callable * add with_constraints method to task * fix task example * update examples to new chat and task api * set tools to none to fix llama chat template * Add helpers for the default parser for a specific type and model combo * simplify constrained rust type example * restore prompt annealing * fix structured example * document text completion model * document new chat api * update task documentation * Fix tokenizer gguf * fix custom llama source example * fix remaining tests * add logging to remote examples * Clippy fixes * More clippy fixes * use function call in docs more constantly * fix remaining doc tests1 年前
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前
Add support for Qwen 2.5 Vision (#382) * implement qwen vision embed and patch merger * implement qwen vision block * calculate the rope index of images and videos * add get_window_index * fix get window index * unwrap less * Create media source api * integrate the new media support into the language model trait * Create QwenVisionTransformer * implement QwenVisionTransformer::forward * fix formatting * fix loading qwen 2.5 vl * fix rot_pos_emb * add image preprocessing utilities * fix vision rope * fix mask * Fix feed forward * qwen vision forward working * unwrap less * clean up * create tensor tools cli * fix cli * fix fuse tokenizer * move parse into its own module * Use llama.cpp compatible tensor names * add preset * load qwen vision metadata from the gguf file * fix loading the vision encoder * test process image * forward eps and add more tests * fix image processing * implement image chat templating * full pipeline running * fix formatting * use 3d rope index * fix dimension_sections decoding * qwen vl rope working * remove logs * fix rope tests * fix rope size * fix rope index to tensor conversion * Fix rope updates * normalize image input * match image resize behavior * fix fullatt_block calculation * vision model works * remove logs * add more qwen vl presets * fix some clippy lints * fix clippy * Fix ToChatMessage * expose image processing hints * remove unwraps * fix unwraps in tests * fix more examples11 个月前