| Optimize fusor (#393)
* rename fusor core
* create fusor error type
* try into for gguf value
* remove candle
* create index select kernel
* refactor quantized implementation
* add dequanitze kernel template
* add where cond
* fuse dequantize and visit tiled ops
* fix hello example
* Automatically spawn polling thread
* llama port compiles
* matmul almost working
* fuzzing small matrixes passes
* fix fuzz matmul test
* rope test passing
* fix metadata loading
* Fix shape calculation for index select
* fix qmatmul shape
* Fix shape calculation for mat mul
* remove some logging
* fix compute graph deadlock
* batched matrix multiplication
* fix cache
* fix attention mask
* building the compute graph works
* fix graphvis
* remove recursion from resolve
* handle > 3 dimensions in map tiled
* make timing info optional
* stable wgpu
* fx hash
* Fix passes after nodes are garbage collected
* remove log
* add a extra_assertions flag
* add a sleep to the device poll loop
* Fix more merging bugs
* fix cycles
* model runs without panicking
* fix qmatmul output buffer size
* add type assertions
* Fix rectangular qmatmul
* tokens generating
* fix softmax test
* fix index select on large arrays
* fix rms norm
* matmul failing
* use fused multiply add in the matmul kernel
* add strides to matmul
* handle non-contiguous tensors in qmatmul
* Fix attention mechanism. It works!!!
* fix timing queries
* more graphviz fixes
* fuse matmul kernels
* clippy fix
* Fix kernel fusion
* refactor mid level representation
* just remove queries
* more benchmarks
* more candle benchmarks
* create Operation trait
* fix tests
* add device to the workgroup size constraints function
* implement operation for reduce
* fix some lints
* implement operation for resize
* remove check_bounds_contiguous
* add index select to MIR
* clean up formatting
* remove log
* remove more logs
* bench larger inputs
* add dequantize to MIR
* clean up
* MIR for qmatmul
* add matmul to MIR
* fix formatting
* queue all operations before running anything
* fix tests
* fix cached tensors
* simplify dependencies
* fuse multiple unrelated kernels
* rename values
* fix merging
* linearize size for reduce
* remove logs
* move output into a separate method
* tests passing
* merge adjacent non-related kernels without synchronization
* remove visit
* disable merging and bump wgpu
* use pipeline cache
* cache most compilation steps
* Fix bench dependencies
* move the caches to the device
* fix tensor partialeq
* memory coalescing in visit_tiled
* remove log
* More consistent performance
* re-enable non-conflicting merges
* add kernel name for debugging
* cargo update
* faster builds for infer example
* double tokens per second
* only materialize every other layer
* more detailed pair wise name
* fix pairwise bench
* add a many dimensional pairwise benchmark
* skip empty dimensions in tiled map
* scale tile size down as the rank scales up
* materialize every layer
* add support for custom operations
* better round up method
* faster reduce kernel
* unroll reduction in softmax kernel
* vectorized softmax load
* add a separate case for large softmax
* implement the same special case for reduce
* custom operations are sync
* don't repeat dequantize
* label everything
* cache dequantize rms norm
* where cond custom opt
* bench larger qmatmul
* split out sgemv variant
* initial attempt at sgemv
* match braces
* fix dispatching
* tests passing
* use subgroupAdd function
* clean up imports
* faster sgemv
* slightly faster sgemv
* simd sgemv
* add unrolled dequantize variants
* 70% faster sgemv
* split chunk size and vector size
* implement vectorized sgemm
* add more qmatmul benchmarks
* skip second sum pass if this is a single subgroup
* pull out dequantize_vec4_block
* test and fix vec4 dequant
* more optimized q6k dequantize
* use the same pattern for unrolled
* specialized vec4 q6k dequant
* add more qmatmul fuzzing tests
* longer fuzzing
* fix dequantize q6k
* Fix fuzz_de_quantize vec4 test
* specialized dequantize_vec4_block q4k implementation
* restore multi-operation fusion
* more flexable sgemv kernel
* interleaved blocks
* fix sgemv
* ignore tokenizer.json
* slightly cleaner q6k dequantize
* specialized q6k sgemv kernel
* remove log
* add a link to the llama.cpp kernel
* make q6k work with multiple rows at once
* make preloading optional in q6k gemv
* specialized q4k gemv implementation
* first value correct
* simplify scale calculation
* fix q4k
* remove log
* slightly faster
* specialized q_n gemv
* add q5_0
* fix llama.cpp link
* cache downloads for tests
* cache qmatmul bench file
* faster q_n kernels
* add specialized q8_0 gemv kernel
* double dispatch size
* bump dependencies and move closer to wasm compat
* fix compilation
* disable zero initialization
* same configuration for tests
* slightly faster Q4k
* explicit vectorization
* unroll loops
* fix kalosm llama
* fix dispatch size
* faster q8_0 gmv
* refactor matmul impl
* vectorized sgemm multiply
* revert changes to kalosm-llama
* undo kalosm-language cargo.toml changes
* restore ocr changes
* fix formatting
* fix tokenizers
* clippy fix
* fix dependencies
* fix clippy and formatting
* fix formatting
* fix clippy
* fix tests | 9 个月前 |