Fork
0
代码
介绍
代码
Issues
Pull Requests
流水线
Actions
讨论
Wiki
项目成员
分析
项目设置
Fork
0
main
kalosm
/
fusor-ml
/
core
/
benches
下载当前目录
G
GitHub
Unified conformance test library (
#434
)
eed48f54
创建于
28 天前
历史提交
文件
最后提交记录
最后更新时间
elementwise.rs
Optimize fusor (#393) * rename fusor core * create fusor error type * try into for gguf value * remove candle * create index select kernel * refactor quantized implementation * add dequanitze kernel template * add where cond * fuse dequantize and visit tiled ops * fix hello example * Automatically spawn polling thread * llama port compiles * matmul almost working * fuzzing small matrixes passes * fix fuzz matmul test * rope test passing * fix metadata loading * Fix shape calculation for index select * fix qmatmul shape * Fix shape calculation for mat mul * remove some logging * fix compute graph deadlock * batched matrix multiplication * fix cache * fix attention mask * building the compute graph works * fix graphvis * remove recursion from resolve * handle > 3 dimensions in map tiled * make timing info optional * stable wgpu * fx hash * Fix passes after nodes are garbage collected * remove log * add a extra_assertions flag * add a sleep to the device poll loop * Fix more merging bugs * fix cycles * model runs without panicking * fix qmatmul output buffer size * add type assertions * Fix rectangular qmatmul * tokens generating * fix softmax test * fix index select on large arrays * fix rms norm * matmul failing * use fused multiply add in the matmul kernel * add strides to matmul * handle non-contiguous tensors in qmatmul * Fix attention mechanism. It works!!! * fix timing queries * more graphviz fixes * fuse matmul kernels * clippy fix * Fix kernel fusion * refactor mid level representation * just remove queries * more benchmarks * more candle benchmarks * create Operation trait * fix tests * add device to the workgroup size constraints function * implement operation for reduce * fix some lints * implement operation for resize * remove check_bounds_contiguous * add index select to MIR * clean up formatting * remove log * remove more logs * bench larger inputs * add dequantize to MIR * clean up * MIR for qmatmul * add matmul to MIR * fix formatting * queue all operations before running anything * fix tests * fix cached tensors * simplify dependencies * fuse multiple unrelated kernels * rename values * fix merging * linearize size for reduce * remove logs * move output into a separate method * tests passing * merge adjacent non-related kernels without synchronization * remove visit * disable merging and bump wgpu * use pipeline cache * cache most compilation steps * Fix bench dependencies * move the caches to the device * fix tensor partialeq * memory coalescing in visit_tiled * remove log * More consistent performance * re-enable non-conflicting merges * add kernel name for debugging * cargo update * faster builds for infer example * double tokens per second * only materialize every other layer * more detailed pair wise name * fix pairwise bench * add a many dimensional pairwise benchmark * skip empty dimensions in tiled map * scale tile size down as the rank scales up * materialize every layer * add support for custom operations * better round up method * faster reduce kernel * unroll reduction in softmax kernel * vectorized softmax load * add a separate case for large softmax * implement the same special case for reduce * custom operations are sync * don't repeat dequantize * label everything * cache dequantize rms norm * where cond custom opt * bench larger qmatmul * split out sgemv variant * initial attempt at sgemv * match braces * fix dispatching * tests passing * use subgroupAdd function * clean up imports * faster sgemv * slightly faster sgemv * simd sgemv * add unrolled dequantize variants * 70% faster sgemv * split chunk size and vector size * implement vectorized sgemm * add more qmatmul benchmarks * skip second sum pass if this is a single subgroup * pull out dequantize_vec4_block * test and fix vec4 dequant * more optimized q6k dequantize * use the same pattern for unrolled * specialized vec4 q6k dequant * add more qmatmul fuzzing tests * longer fuzzing * fix dequantize q6k * Fix fuzz_de_quantize vec4 test * specialized dequantize_vec4_block q4k implementation * restore multi-operation fusion * more flexable sgemv kernel * interleaved blocks * fix sgemv * ignore tokenizer.json * slightly cleaner q6k dequantize * specialized q6k sgemv kernel * remove log * add a link to the llama.cpp kernel * make q6k work with multiple rows at once * make preloading optional in q6k gemv * specialized q4k gemv implementation * first value correct * simplify scale calculation * fix q4k * remove log * slightly faster * specialized q_n gemv * add q5_0 * fix llama.cpp link * cache downloads for tests * cache qmatmul bench file * faster q_n kernels * add specialized q8_0 gemv kernel * double dispatch size * bump dependencies and move closer to wasm compat * fix compilation * disable zero initialization * same configuration for tests * slightly faster Q4k * explicit vectorization * unroll loops * fix kalosm llama * fix dispatch size * faster q8_0 gmv * refactor matmul impl * vectorized sgemm multiply * revert changes to kalosm-llama * undo kalosm-language cargo.toml changes * restore ocr changes * fix formatting * fix tokenizers * clippy fix * fix dependencies * fix clippy and formatting * fix formatting * fix clippy * fix tests
9 个月前
fused.rs
Optimize fusor (#393) * rename fusor core * create fusor error type * try into for gguf value * remove candle * create index select kernel * refactor quantized implementation * add dequanitze kernel template * add where cond * fuse dequantize and visit tiled ops * fix hello example * Automatically spawn polling thread * llama port compiles * matmul almost working * fuzzing small matrixes passes * fix fuzz matmul test * rope test passing * fix metadata loading * Fix shape calculation for index select * fix qmatmul shape * Fix shape calculation for mat mul * remove some logging * fix compute graph deadlock * batched matrix multiplication * fix cache * fix attention mask * building the compute graph works * fix graphvis * remove recursion from resolve * handle > 3 dimensions in map tiled * make timing info optional * stable wgpu * fx hash * Fix passes after nodes are garbage collected * remove log * add a extra_assertions flag * add a sleep to the device poll loop * Fix more merging bugs * fix cycles * model runs without panicking * fix qmatmul output buffer size * add type assertions * Fix rectangular qmatmul * tokens generating * fix softmax test * fix index select on large arrays * fix rms norm * matmul failing * use fused multiply add in the matmul kernel * add strides to matmul * handle non-contiguous tensors in qmatmul * Fix attention mechanism. It works!!! * fix timing queries * more graphviz fixes * fuse matmul kernels * clippy fix * Fix kernel fusion * refactor mid level representation * just remove queries * more benchmarks * more candle benchmarks * create Operation trait * fix tests * add device to the workgroup size constraints function * implement operation for reduce * fix some lints * implement operation for resize * remove check_bounds_contiguous * add index select to MIR * clean up formatting * remove log * remove more logs * bench larger inputs * add dequantize to MIR * clean up * MIR for qmatmul * add matmul to MIR * fix formatting * queue all operations before running anything * fix tests * fix cached tensors * simplify dependencies * fuse multiple unrelated kernels * rename values * fix merging * linearize size for reduce * remove logs * move output into a separate method * tests passing * merge adjacent non-related kernels without synchronization * remove visit * disable merging and bump wgpu * use pipeline cache * cache most compilation steps * Fix bench dependencies * move the caches to the device * fix tensor partialeq * memory coalescing in visit_tiled * remove log * More consistent performance * re-enable non-conflicting merges * add kernel name for debugging * cargo update * faster builds for infer example * double tokens per second * only materialize every other layer * more detailed pair wise name * fix pairwise bench * add a many dimensional pairwise benchmark * skip empty dimensions in tiled map * scale tile size down as the rank scales up * materialize every layer * add support for custom operations * better round up method * faster reduce kernel * unroll reduction in softmax kernel * vectorized softmax load * add a separate case for large softmax * implement the same special case for reduce * custom operations are sync * don't repeat dequantize * label everything * cache dequantize rms norm * where cond custom opt * bench larger qmatmul * split out sgemv variant * initial attempt at sgemv * match braces * fix dispatching * tests passing * use subgroupAdd function * clean up imports * faster sgemv * slightly faster sgemv * simd sgemv * add unrolled dequantize variants * 70% faster sgemv * split chunk size and vector size * implement vectorized sgemm * add more qmatmul benchmarks * skip second sum pass if this is a single subgroup * pull out dequantize_vec4_block * test and fix vec4 dequant * more optimized q6k dequantize * use the same pattern for unrolled * specialized vec4 q6k dequant * add more qmatmul fuzzing tests * longer fuzzing * fix dequantize q6k * Fix fuzz_de_quantize vec4 test * specialized dequantize_vec4_block q4k implementation * restore multi-operation fusion * more flexable sgemv kernel * interleaved blocks * fix sgemv * ignore tokenizer.json * slightly cleaner q6k dequantize * specialized q6k sgemv kernel * remove log * add a link to the llama.cpp kernel * make q6k work with multiple rows at once * make preloading optional in q6k gemv * specialized q4k gemv implementation * first value correct * simplify scale calculation * fix q4k * remove log * slightly faster * specialized q_n gemv * add q5_0 * fix llama.cpp link * cache downloads for tests * cache qmatmul bench file * faster q_n kernels * add specialized q8_0 gemv kernel * double dispatch size * bump dependencies and move closer to wasm compat * fix compilation * disable zero initialization * same configuration for tests * slightly faster Q4k * explicit vectorization * unroll loops * fix kalosm llama * fix dispatch size * faster q8_0 gmv * refactor matmul impl * vectorized sgemm multiply * revert changes to kalosm-llama * undo kalosm-language cargo.toml changes * restore ocr changes * fix formatting * fix tokenizers * clippy fix * fix dependencies * fix clippy and formatting * fix formatting * fix clippy * fix tests
9 个月前
matmul.rs
Implement whisper with fusor (#405) * sliding window view * implement conv and pool * add zeros function * more layers * faster cache * remove candle dependency from whisper * Port loading logic * start conversion * more progress converting * add casting from u32 tensors * add variance and mean functions * Initial whisper implementation working * use rustfft instead of manually computing the fft * re-use some allocations * fix bert * clean up some warnings * fix max seq len * delete floneum * clean up remains of unquantized variant * fix whisper * faster final layer whisper * use sgemv for small n values * more matmul bench shapes * larger n limit * smarter resize lowering * limit the buffer re-use cache * allow larger allocations * fix embedding tests * add a bunch more sources * fix formatting * fix clippy * fix cargo check * restore transcribe file * fix doc examples
6 个月前
pairwise.rs
Optimize fusor (#393) * rename fusor core * create fusor error type * try into for gguf value * remove candle * create index select kernel * refactor quantized implementation * add dequanitze kernel template * add where cond * fuse dequantize and visit tiled ops * fix hello example * Automatically spawn polling thread * llama port compiles * matmul almost working * fuzzing small matrixes passes * fix fuzz matmul test * rope test passing * fix metadata loading * Fix shape calculation for index select * fix qmatmul shape * Fix shape calculation for mat mul * remove some logging * fix compute graph deadlock * batched matrix multiplication * fix cache * fix attention mask * building the compute graph works * fix graphvis * remove recursion from resolve * handle > 3 dimensions in map tiled * make timing info optional * stable wgpu * fx hash * Fix passes after nodes are garbage collected * remove log * add a extra_assertions flag * add a sleep to the device poll loop * Fix more merging bugs * fix cycles * model runs without panicking * fix qmatmul output buffer size * add type assertions * Fix rectangular qmatmul * tokens generating * fix softmax test * fix index select on large arrays * fix rms norm * matmul failing * use fused multiply add in the matmul kernel * add strides to matmul * handle non-contiguous tensors in qmatmul * Fix attention mechanism. It works!!! * fix timing queries * more graphviz fixes * fuse matmul kernels * clippy fix * Fix kernel fusion * refactor mid level representation * just remove queries * more benchmarks * more candle benchmarks * create Operation trait * fix tests * add device to the workgroup size constraints function * implement operation for reduce * fix some lints * implement operation for resize * remove check_bounds_contiguous * add index select to MIR * clean up formatting * remove log * remove more logs * bench larger inputs * add dequantize to MIR * clean up * MIR for qmatmul * add matmul to MIR * fix formatting * queue all operations before running anything * fix tests * fix cached tensors * simplify dependencies * fuse multiple unrelated kernels * rename values * fix merging * linearize size for reduce * remove logs * move output into a separate method * tests passing * merge adjacent non-related kernels without synchronization * remove visit * disable merging and bump wgpu * use pipeline cache * cache most compilation steps * Fix bench dependencies * move the caches to the device * fix tensor partialeq * memory coalescing in visit_tiled * remove log * More consistent performance * re-enable non-conflicting merges * add kernel name for debugging * cargo update * faster builds for infer example * double tokens per second * only materialize every other layer * more detailed pair wise name * fix pairwise bench * add a many dimensional pairwise benchmark * skip empty dimensions in tiled map * scale tile size down as the rank scales up * materialize every layer * add support for custom operations * better round up method * faster reduce kernel * unroll reduction in softmax kernel * vectorized softmax load * add a separate case for large softmax * implement the same special case for reduce * custom operations are sync * don't repeat dequantize * label everything * cache dequantize rms norm * where cond custom opt * bench larger qmatmul * split out sgemv variant * initial attempt at sgemv * match braces * fix dispatching * tests passing * use subgroupAdd function * clean up imports * faster sgemv * slightly faster sgemv * simd sgemv * add unrolled dequantize variants * 70% faster sgemv * split chunk size and vector size * implement vectorized sgemm * add more qmatmul benchmarks * skip second sum pass if this is a single subgroup * pull out dequantize_vec4_block * test and fix vec4 dequant * more optimized q6k dequantize * use the same pattern for unrolled * specialized vec4 q6k dequant * add more qmatmul fuzzing tests * longer fuzzing * fix dequantize q6k * Fix fuzz_de_quantize vec4 test * specialized dequantize_vec4_block q4k implementation * restore multi-operation fusion * more flexable sgemv kernel * interleaved blocks * fix sgemv * ignore tokenizer.json * slightly cleaner q6k dequantize * specialized q6k sgemv kernel * remove log * add a link to the llama.cpp kernel * make q6k work with multiple rows at once * make preloading optional in q6k gemv * specialized q4k gemv implementation * first value correct * simplify scale calculation * fix q4k * remove log * slightly faster * specialized q_n gemv * add q5_0 * fix llama.cpp link * cache downloads for tests * cache qmatmul bench file * faster q_n kernels * add specialized q8_0 gemv kernel * double dispatch size * bump dependencies and move closer to wasm compat * fix compilation * disable zero initialization * same configuration for tests * slightly faster Q4k * explicit vectorization * unroll loops * fix kalosm llama * fix dispatch size * faster q8_0 gmv * refactor matmul impl * vectorized sgemm multiply * revert changes to kalosm-llama * undo kalosm-language cargo.toml changes * restore ocr changes * fix formatting * fix tokenizers * clippy fix * fix dependencies * fix clippy and formatting * fix formatting * fix clippy * fix tests
9 个月前
qmatmul.rs
Benchmark fusor changes (#395) * benchmark fusor changes * fix branchname * bench false workspace * only benchmark core * bench only qmatmul * add cache * all benchmarks * better names * fix clippy * more clippy fixes
7 个月前
reduce.rs
Implement Bert with fusor (#399) * add squeeze op * add dims variants of squeeze and unsqueeze * more broadcast ops * fix unsqueeze * add more flatten methods * fix formatting and clippy * use the same traits for reduce kernels * fix tests * add varbuilder * add relu and gelu * add reduce_keepdim * start porting bert * more progress porting bert * generalize layer norm * fix weight names * fix batched matmul * compute graph builds * model runs * add batch to the qmatmul fuzzing test * partially fix qmatmul batch * fix test_fuzz_q_mat_mul_q8_0 batch test * fix batch with sgemv * fix test_fuzz_q_mat_mul_q8_0 * fix workgroup solver * limit number of inputs by the device limits * model running * fix workgroup test * fix gelu * remove some logs * remove more logs * fix bert encoder * add more dims to the batched matmul fuzzing test * partially working * disable batches * remove some assertions * add more sparse tensor op tests * add failing matmul test * fix test structure * fix transposed matmul test * remove logs * fix broadcast semantics * fix dequantize if the datatypes already match * remove materialize * almost wasm compatable * wasm compatable * fix native compilation * fix tests * running in chrome! * add quantized snowflake source * add dispatch visit * remove dispatch * reduce allocations * load larger chunks at a time for qk6 * cooprative loading q6k gemm * Fix q6k gemm on web * Fix compilation * re-use allocations * fix gc for custom nodes * add more docs * Fix clippy for claude model * fix clippy * fix type inference * more mat4x4 dequantized implementations * use block quantized for all implemented quantizaiton types * better q4k dequantize_4x4_block * more q4k, q5, and q8 improvements * better q4_0 * faster q6k * simplify dequantize_vec4_block * generalize qmatmul params * simplify tests and checks * allow batches in sgemv kernel * Fix transposed sgemv * fix lint * Fix formatting * Fix lint * more formatting fixes * fix docs * lower resize operations * add compute graph hash * Update .gitignore * tensor allocation cache * fix most buffer re-use issues * fix quantized test * fix writing to cached tensors * fix bert in chrome * faster chunked qsgemm * transpose workgroup cache * better memory access patern for non-quantized tensor * more flexible qsgemm kernel * simplify pair_index_row * fix non 8x8 outputs * fix dispatch * clean up write_acc_back * maybe subgroup tiling * fix subgroup indexing * fix indexing into second b chunk * don't transpose a blocks * tighter bounds * fix cache_b loading * single subgroup tile working * same indexing loops * first thread works with subgroup swizzle * slightly looser bounds * pull out thread indexes * fix lint * cast after tile matmul * move subgroup shuffle loading to a flag * tweak default config * more quantized bert sources * unroll just the thread loop * add linear benchmark * wip: bench components of bert * use sgemv for some matrix matrix multlpicaitons * exclude kalosm-learning on linux * exclude rbert on linux * temporarily disable windows tests
6 个月前
softmax.rs
Unified conformance test library (#434) * Extract fusor conformance library * Include mismatch position in conformance errors * Remove remaining crate-local fusor tests * Add missing conformance coverage * Broaden conformance test coverage * Use variable-size conformance fuzz shapes * fix formatting * fix tests * larger fuzz ranges * more tests * better cpu parity * fix clippy * fix formatting * fix clippy * refactor: replace clippy #[allow] suppressions with real fixes Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx, AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for the conformance comparator return type, and rewrite needless_range_loop sites in tests/common/mod.rs to use iter_mut().enumerate(). * chore: ignore .claude scheduler artifacts * fix(conformance): skip f16 tests on GPU adapters without SHADER_F16 Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so the f16 shader fails to validate. Filter the device list per test rather than removing GPU coverage entirely — Mac Metal still runs the f16 path. * fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh near their domain edges. 1e-4 covers the observed gap without masking real regressions; the macOS Metal adapter still passes comfortably. * fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a different fuzz seed. lavapipe asin precision is limited near the asymptotes where the derivative blows up. 1e-3 covers observed drift; algorithmic regressions would diverge by orders of magnitude. * fix transcribe example * fix(conformance): stabilize CI edge cases * fix(ci): avoid brittle benchmark formatter * fix(conformance): cover Windows WARP tanh drift * fix tanh * more ci fixes * more software gpu backend fixes * looser bounds for trig * relative tolerance * more relative comparisons * passing on warp
28 天前