| Llama fusor v2 (#413)
* add dim trait
* add repeat
* expand index
* llama runs with fusor
* more assertions
* failing rope test
* failing cat test
* 5d index
* fix visit_tiled with large z dims
* fix assertion for zero sized tensors
* fix sgemv batch
* fix cache test and attention layer dequant
* 3x faster
* bench rope and attention
* fused rope kernel
* flash attention
* better benchmarks
* more benches
* fix formatting
* remove useless shared arrays
* use shape bindings instead of separate shape inputs
* remove bound check
* pull out re-used exp
* optimized flash attention
* unroll reduction
* fix clippy
* vectorized
* bench one seq len
* handle non-contiguous inputs
* failing flash attention test
* fix flash attention
* add optional mask support
* integrate fused attention into llama
* add fuzzing test
* reformatting
* block based loading
* fix flash attention when the query and kv lengths don't match
* used fused rope
* normal fused rope
* optimize repeat_kv
* optimize mask cache
* integrate mqa into flash attention
* remove block on
* fix rope freq rank
* 16 t/s
* no tiling
* more f16 fixes
* use f16 activations
* more f16 stability fixes
* fix q4k type
* slightly faster
* fix formatting
* remove old examples
* fix kalosm
* restore vision adapter
* use conv3d
* fix loading the 3d conv
* fix clippy
* more clippy fixes
* nary kernel
* clean up some of the warnings
* simplify nary optimization
* fix clippy
* use graph rewrites instead of visiting
* worklist
* simpler optimizations
* clean up some unused code
* fuse nary into reduce/matmul/index/etc
* remove session serialization
* fix formatting
* allow changing the activation type
* fix the default type
* fix clippy and formatting
* fix doc tests
* reuse the same device for all tests
* fix gemv on cuda
* start merging into nary
* fuse index select
* fix infer example
* fix matmul fusion
* fix formatting
* fix clippy | 4 个月前 |
| Llama fusor v2 (#413)
* add dim trait
* add repeat
* expand index
* llama runs with fusor
* more assertions
* failing rope test
* failing cat test
* 5d index
* fix visit_tiled with large z dims
* fix assertion for zero sized tensors
* fix sgemv batch
* fix cache test and attention layer dequant
* 3x faster
* bench rope and attention
* fused rope kernel
* flash attention
* better benchmarks
* more benches
* fix formatting
* remove useless shared arrays
* use shape bindings instead of separate shape inputs
* remove bound check
* pull out re-used exp
* optimized flash attention
* unroll reduction
* fix clippy
* vectorized
* bench one seq len
* handle non-contiguous inputs
* failing flash attention test
* fix flash attention
* add optional mask support
* integrate fused attention into llama
* add fuzzing test
* reformatting
* block based loading
* fix flash attention when the query and kv lengths don't match
* used fused rope
* normal fused rope
* optimize repeat_kv
* optimize mask cache
* integrate mqa into flash attention
* remove block on
* fix rope freq rank
* 16 t/s
* no tiling
* more f16 fixes
* use f16 activations
* more f16 stability fixes
* fix q4k type
* slightly faster
* fix formatting
* remove old examples
* fix kalosm
* restore vision adapter
* use conv3d
* fix loading the 3d conv
* fix clippy
* more clippy fixes
* nary kernel
* clean up some of the warnings
* simplify nary optimization
* fix clippy
* use graph rewrites instead of visiting
* worklist
* simpler optimizations
* clean up some unused code
* fuse nary into reduce/matmul/index/etc
* remove session serialization
* fix formatting
* allow changing the activation type
* fix the default type
* fix clippy and formatting
* fix doc tests
* reuse the same device for all tests
* fix gemv on cuda
* start merging into nary
* fuse index select
* fix infer example
* fix matmul fusion
* fix formatting
* fix clippy | 4 个月前 |
| Remote chat, remote structured generation models, and single file gguf chat model loading (#319)
* add chat template support and remove the VectorSpace trait
* move sampling and chat templates to kalosm llama
* update kalosm-llama unstructured generation to the new interface
* restore structured generation module
* Restore llama implementation of structured generation
* clean up kalosm-llama clippy lints
* restore llama chat and structured chat implementation
* improve infer chat example
* add support for remote chat models
* support constraints for openai remote models
* load the tokenizer from the gguf file if a huggingface tokenizer is not present
* Fix tokenizer conversion
* restore chat struct
* Fix chat implementation with llama
* remove tokio from language model
* Create chat and text completion extension traits
* add task helper to the chat extension trait
* update kalosm-language to new task interface
* make llama callable
* add with_constraints method to task
* fix task example
* update examples to new chat and task api
* set tools to none to fix llama chat template
* Add helpers for the default parser for a specific type and model combo
* simplify constrained rust type example
* restore prompt annealing
* fix structured example
* document text completion model
* document new chat api
* update task documentation
* Fix tokenizer gguf
* fix custom llama source example
* fix remaining tests
* add logging to remote examples
* Clippy fixes
* More clippy fixes
* use function call in docs more constantly
* fix remaining doc tests | 1 年前 |
| Llama fusor v2 (#413)
* add dim trait
* add repeat
* expand index
* llama runs with fusor
* more assertions
* failing rope test
* failing cat test
* 5d index
* fix visit_tiled with large z dims
* fix assertion for zero sized tensors
* fix sgemv batch
* fix cache test and attention layer dequant
* 3x faster
* bench rope and attention
* fused rope kernel
* flash attention
* better benchmarks
* more benches
* fix formatting
* remove useless shared arrays
* use shape bindings instead of separate shape inputs
* remove bound check
* pull out re-used exp
* optimized flash attention
* unroll reduction
* fix clippy
* vectorized
* bench one seq len
* handle non-contiguous inputs
* failing flash attention test
* fix flash attention
* add optional mask support
* integrate fused attention into llama
* add fuzzing test
* reformatting
* block based loading
* fix flash attention when the query and kv lengths don't match
* used fused rope
* normal fused rope
* optimize repeat_kv
* optimize mask cache
* integrate mqa into flash attention
* remove block on
* fix rope freq rank
* 16 t/s
* no tiling
* more f16 fixes
* use f16 activations
* more f16 stability fixes
* fix q4k type
* slightly faster
* fix formatting
* remove old examples
* fix kalosm
* restore vision adapter
* use conv3d
* fix loading the 3d conv
* fix clippy
* more clippy fixes
* nary kernel
* clean up some of the warnings
* simplify nary optimization
* fix clippy
* use graph rewrites instead of visiting
* worklist
* simpler optimizations
* clean up some unused code
* fuse nary into reduce/matmul/index/etc
* remove session serialization
* fix formatting
* allow changing the activation type
* fix the default type
* fix clippy and formatting
* fix doc tests
* reuse the same device for all tests
* fix gemv on cuda
* start merging into nary
* fuse index select
* fix infer example
* fix matmul fusion
* fix formatting
* fix clippy | 4 个月前 |
| Llama fusor v2 (#413)
* add dim trait
* add repeat
* expand index
* llama runs with fusor
* more assertions
* failing rope test
* failing cat test
* 5d index
* fix visit_tiled with large z dims
* fix assertion for zero sized tensors
* fix sgemv batch
* fix cache test and attention layer dequant
* 3x faster
* bench rope and attention
* fused rope kernel
* flash attention
* better benchmarks
* more benches
* fix formatting
* remove useless shared arrays
* use shape bindings instead of separate shape inputs
* remove bound check
* pull out re-used exp
* optimized flash attention
* unroll reduction
* fix clippy
* vectorized
* bench one seq len
* handle non-contiguous inputs
* failing flash attention test
* fix flash attention
* add optional mask support
* integrate fused attention into llama
* add fuzzing test
* reformatting
* block based loading
* fix flash attention when the query and kv lengths don't match
* used fused rope
* normal fused rope
* optimize repeat_kv
* optimize mask cache
* integrate mqa into flash attention
* remove block on
* fix rope freq rank
* 16 t/s
* no tiling
* more f16 fixes
* use f16 activations
* more f16 stability fixes
* fix q4k type
* slightly faster
* fix formatting
* remove old examples
* fix kalosm
* restore vision adapter
* use conv3d
* fix loading the 3d conv
* fix clippy
* more clippy fixes
* nary kernel
* clean up some of the warnings
* simplify nary optimization
* fix clippy
* use graph rewrites instead of visiting
* worklist
* simpler optimizations
* clean up some unused code
* fuse nary into reduce/matmul/index/etc
* remove session serialization
* fix formatting
* allow changing the activation type
* fix the default type
* fix clippy and formatting
* fix doc tests
* reuse the same device for all tests
* fix gemv on cuda
* start merging into nary
* fuse index select
* fix infer example
* fix matmul fusion
* fix formatting
* fix clippy | 4 个月前 |
| Add support for Qwen 2.5 Vision (#382)
* implement qwen vision embed and patch merger
* implement qwen vision block
* calculate the rope index of images and videos
* add get_window_index
* fix get window index
* unwrap less
* Create media source api
* integrate the new media support into the language model trait
* Create QwenVisionTransformer
* implement QwenVisionTransformer::forward
* fix formatting
* fix loading qwen 2.5 vl
* fix rot_pos_emb
* add image preprocessing utilities
* fix vision rope
* fix mask
* Fix feed forward
* qwen vision forward working
* unwrap less
* clean up
* create tensor tools cli
* fix cli
* fix fuse tokenizer
* move parse into its own module
* Use llama.cpp compatible tensor names
* add preset
* load qwen vision metadata from the gguf file
* fix loading the vision encoder
* test process image
* forward eps and add more tests
* fix image processing
* implement image chat templating
* full pipeline running
* fix formatting
* use 3d rope index
* fix dimension_sections decoding
* qwen vl rope working
* remove logs
* fix rope tests
* fix rope size
* fix rope index to tensor conversion
* Fix rope updates
* normalize image input
* match image resize behavior
* fix fullatt_block calculation
* vision model works
* remove logs
* add more qwen vl presets
* fix some clippy lints
* fix clippy
* Fix ToChatMessage
* expose image processing hints
* remove unwraps
* fix unwraps in tests
* fix more examples | 11 个月前 |