58bc60c3创建于 2025年12月23日历史提交
文件最后提交记录最后更新时间
Optimize fusor (#393) * rename fusor core * create fusor error type * try into for gguf value * remove candle * create index select kernel * refactor quantized implementation * add dequanitze kernel template * add where cond * fuse dequantize and visit tiled ops * fix hello example * Automatically spawn polling thread * llama port compiles * matmul almost working * fuzzing small matrixes passes * fix fuzz matmul test * rope test passing * fix metadata loading * Fix shape calculation for index select * fix qmatmul shape * Fix shape calculation for mat mul * remove some logging * fix compute graph deadlock * batched matrix multiplication * fix cache * fix attention mask * building the compute graph works * fix graphvis * remove recursion from resolve * handle > 3 dimensions in map tiled * make timing info optional * stable wgpu * fx hash * Fix passes after nodes are garbage collected * remove log * add a extra_assertions flag * add a sleep to the device poll loop * Fix more merging bugs * fix cycles * model runs without panicking * fix qmatmul output buffer size * add type assertions * Fix rectangular qmatmul * tokens generating * fix softmax test * fix index select on large arrays * fix rms norm * matmul failing * use fused multiply add in the matmul kernel * add strides to matmul * handle non-contiguous tensors in qmatmul * Fix attention mechanism. It works!!! * fix timing queries * more graphviz fixes * fuse matmul kernels * clippy fix * Fix kernel fusion * refactor mid level representation * just remove queries * more benchmarks * more candle benchmarks * create Operation trait * fix tests * add device to the workgroup size constraints function * implement operation for reduce * fix some lints * implement operation for resize * remove check_bounds_contiguous * add index select to MIR * clean up formatting * remove log * remove more logs * bench larger inputs * add dequantize to MIR * clean up * MIR for qmatmul * add matmul to MIR * fix formatting * queue all operations before running anything * fix tests * fix cached tensors * simplify dependencies * fuse multiple unrelated kernels * rename values * fix merging * linearize size for reduce * remove logs * move output into a separate method * tests passing * merge adjacent non-related kernels without synchronization * remove visit * disable merging and bump wgpu * use pipeline cache * cache most compilation steps * Fix bench dependencies * move the caches to the device * fix tensor partialeq * memory coalescing in visit_tiled * remove log * More consistent performance * re-enable non-conflicting merges * add kernel name for debugging * cargo update * faster builds for infer example * double tokens per second * only materialize every other layer * more detailed pair wise name * fix pairwise bench * add a many dimensional pairwise benchmark * skip empty dimensions in tiled map * scale tile size down as the rank scales up * materialize every layer * add support for custom operations * better round up method * faster reduce kernel * unroll reduction in softmax kernel * vectorized softmax load * add a separate case for large softmax * implement the same special case for reduce * custom operations are sync * don't repeat dequantize * label everything * cache dequantize rms norm * where cond custom opt * bench larger qmatmul * split out sgemv variant * initial attempt at sgemv * match braces * fix dispatching * tests passing * use subgroupAdd function * clean up imports * faster sgemv * slightly faster sgemv * simd sgemv * add unrolled dequantize variants * 70% faster sgemv * split chunk size and vector size * implement vectorized sgemm * add more qmatmul benchmarks * skip second sum pass if this is a single subgroup * pull out dequantize_vec4_block * test and fix vec4 dequant * more optimized q6k dequantize * use the same pattern for unrolled * specialized vec4 q6k dequant * add more qmatmul fuzzing tests * longer fuzzing * fix dequantize q6k * Fix fuzz_de_quantize vec4 test * specialized dequantize_vec4_block q4k implementation * restore multi-operation fusion * more flexable sgemv kernel * interleaved blocks * fix sgemv * ignore tokenizer.json * slightly cleaner q6k dequantize * specialized q6k sgemv kernel * remove log * add a link to the llama.cpp kernel * make q6k work with multiple rows at once * make preloading optional in q6k gemv * specialized q4k gemv implementation * first value correct * simplify scale calculation * fix q4k * remove log * slightly faster * specialized q_n gemv * add q5_0 * fix llama.cpp link * cache downloads for tests * cache qmatmul bench file * faster q_n kernels * add specialized q8_0 gemv kernel * double dispatch size * bump dependencies and move closer to wasm compat * fix compilation * disable zero initialization * same configuration for tests * slightly faster Q4k * explicit vectorization * unroll loops * fix kalosm llama * fix dispatch size * faster q8_0 gmv * refactor matmul impl * vectorized sgemm multiply * revert changes to kalosm-llama * undo kalosm-language cargo.toml changes * restore ocr changes * fix formatting * fix tokenizers * clippy fix * fix dependencies * fix clippy and formatting * fix formatting * fix clippy * fix tests9 个月前