| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Bump wgpu and fix wasm support for llama (#416)
* bump wgpu and fix wasm support for llama
* use git version of wgpu
* fix checks and require tokio for ocr
* fix formatting
* fix the quantized test
* fix large dispatches
* fix doctests
* fix clippy
* match wgpu version in ci and cache windows
* fix formatting
* setup vulkan
* set vulkan version
* cargo update
* failing tests on all platforms
* fix f16 tensors on gpus that don't support f16
* restore f16 detection
* more resilient caching
* pull out lock current logic
* use dxc
* use std file locks
* try using a smaller batch limit
* install warp | 4 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Get whisper running in wasm and enable support for gpus without subgroups or f16 (#406)
* make rwhisper wasm compatable
* move around traits to minimize whisper dependencies
* fix transcription task
* fix import FutureWasmNotSend
* more wasm fixes
* fix sgemv
* reduce with or without subgroups
* fix subgroup reduction
* softmax without subgroups
* tiled map without subgroups
* quantized without subgroups
* all tests passing without subgroups
* fix sgemv dispatch
* restore subgroupless CI workflows
* only require f16 support for quantization support
* don't require f16 support for qmatmul
* fix kalosm model-types dev dependancies
* fix kalosm-model-types tests
* fix softmax bounds without subgroups
* better required limits
* fix clippy
* more clippy fixes
* shorter odyssey test
* more clippy fixes
* fix solve when workgroups are not supported
* exclude fusor core - these tests pass locally, but are too slow to run
in CI
* exclude inference tests on windows
* exclude kalosm-learning on windows which doesn't have f16 support | 6 个月前 |
| Optimize fusor (#393)
* rename fusor core
* create fusor error type
* try into for gguf value
* remove candle
* create index select kernel
* refactor quantized implementation
* add dequanitze kernel template
* add where cond
* fuse dequantize and visit tiled ops
* fix hello example
* Automatically spawn polling thread
* llama port compiles
* matmul almost working
* fuzzing small matrixes passes
* fix fuzz matmul test
* rope test passing
* fix metadata loading
* Fix shape calculation for index select
* fix qmatmul shape
* Fix shape calculation for mat mul
* remove some logging
* fix compute graph deadlock
* batched matrix multiplication
* fix cache
* fix attention mask
* building the compute graph works
* fix graphvis
* remove recursion from resolve
* handle > 3 dimensions in map tiled
* make timing info optional
* stable wgpu
* fx hash
* Fix passes after nodes are garbage collected
* remove log
* add a extra_assertions flag
* add a sleep to the device poll loop
* Fix more merging bugs
* fix cycles
* model runs without panicking
* fix qmatmul output buffer size
* add type assertions
* Fix rectangular qmatmul
* tokens generating
* fix softmax test
* fix index select on large arrays
* fix rms norm
* matmul failing
* use fused multiply add in the matmul kernel
* add strides to matmul
* handle non-contiguous tensors in qmatmul
* Fix attention mechanism. It works!!!
* fix timing queries
* more graphviz fixes
* fuse matmul kernels
* clippy fix
* Fix kernel fusion
* refactor mid level representation
* just remove queries
* more benchmarks
* more candle benchmarks
* create Operation trait
* fix tests
* add device to the workgroup size constraints function
* implement operation for reduce
* fix some lints
* implement operation for resize
* remove check_bounds_contiguous
* add index select to MIR
* clean up formatting
* remove log
* remove more logs
* bench larger inputs
* add dequantize to MIR
* clean up
* MIR for qmatmul
* add matmul to MIR
* fix formatting
* queue all operations before running anything
* fix tests
* fix cached tensors
* simplify dependencies
* fuse multiple unrelated kernels
* rename values
* fix merging
* linearize size for reduce
* remove logs
* move output into a separate method
* tests passing
* merge adjacent non-related kernels without synchronization
* remove visit
* disable merging and bump wgpu
* use pipeline cache
* cache most compilation steps
* Fix bench dependencies
* move the caches to the device
* fix tensor partialeq
* memory coalescing in visit_tiled
* remove log
* More consistent performance
* re-enable non-conflicting merges
* add kernel name for debugging
* cargo update
* faster builds for infer example
* double tokens per second
* only materialize every other layer
* more detailed pair wise name
* fix pairwise bench
* add a many dimensional pairwise benchmark
* skip empty dimensions in tiled map
* scale tile size down as the rank scales up
* materialize every layer
* add support for custom operations
* better round up method
* faster reduce kernel
* unroll reduction in softmax kernel
* vectorized softmax load
* add a separate case for large softmax
* implement the same special case for reduce
* custom operations are sync
* don't repeat dequantize
* label everything
* cache dequantize rms norm
* where cond custom opt
* bench larger qmatmul
* split out sgemv variant
* initial attempt at sgemv
* match braces
* fix dispatching
* tests passing
* use subgroupAdd function
* clean up imports
* faster sgemv
* slightly faster sgemv
* simd sgemv
* add unrolled dequantize variants
* 70% faster sgemv
* split chunk size and vector size
* implement vectorized sgemm
* add more qmatmul benchmarks
* skip second sum pass if this is a single subgroup
* pull out dequantize_vec4_block
* test and fix vec4 dequant
* more optimized q6k dequantize
* use the same pattern for unrolled
* specialized vec4 q6k dequant
* add more qmatmul fuzzing tests
* longer fuzzing
* fix dequantize q6k
* Fix fuzz_de_quantize vec4 test
* specialized dequantize_vec4_block q4k implementation
* restore multi-operation fusion
* more flexable sgemv kernel
* interleaved blocks
* fix sgemv
* ignore tokenizer.json
* slightly cleaner q6k dequantize
* specialized q6k sgemv kernel
* remove log
* add a link to the llama.cpp kernel
* make q6k work with multiple rows at once
* make preloading optional in q6k gemv
* specialized q4k gemv implementation
* first value correct
* simplify scale calculation
* fix q4k
* remove log
* slightly faster
* specialized q_n gemv
* add q5_0
* fix llama.cpp link
* cache downloads for tests
* cache qmatmul bench file
* faster q_n kernels
* add specialized q8_0 gemv kernel
* double dispatch size
* bump dependencies and move closer to wasm compat
* fix compilation
* disable zero initialization
* same configuration for tests
* slightly faster Q4k
* explicit vectorization
* unroll loops
* fix kalosm llama
* fix dispatch size
* faster q8_0 gmv
* refactor matmul impl
* vectorized sgemm multiply
* revert changes to kalosm-llama
* undo kalosm-language cargo.toml changes
* restore ocr changes
* fix formatting
* fix tokenizers
* clippy fix
* fix dependencies
* fix clippy and formatting
* fix formatting
* fix clippy
* fix tests | 9 个月前 |