| Bump wgpu and fix wasm support for llama (#416)
* bump wgpu and fix wasm support for llama
* use git version of wgpu
* fix checks and require tokio for ocr
* fix formatting
* fix the quantized test
* fix large dispatches
* fix doctests
* fix clippy
* match wgpu version in ci and cache windows
* fix formatting
* setup vulkan
* set vulkan version
* cargo update
* failing tests on all platforms
* fix f16 tensors on gpus that don't support f16
* restore f16 detection
* more resilient caching
* pull out lock current logic
* use dxc
* use std file locks
* try using a smaller batch limit
* install warp | 4 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Get whisper running in wasm and enable support for gpus without subgroups or f16 (#406)
* make rwhisper wasm compatable
* move around traits to minimize whisper dependencies
* fix transcription task
* fix import FutureWasmNotSend
* more wasm fixes
* fix sgemv
* reduce with or without subgroups
* fix subgroup reduction
* softmax without subgroups
* tiled map without subgroups
* quantized without subgroups
* all tests passing without subgroups
* fix sgemv dispatch
* restore subgroupless CI workflows
* only require f16 support for quantization support
* don't require f16 support for qmatmul
* fix kalosm model-types dev dependancies
* fix kalosm-model-types tests
* fix softmax bounds without subgroups
* better required limits
* fix clippy
* more clippy fixes
* shorter odyssey test
* more clippy fixes
* fix solve when workgroups are not supported
* exclude fusor core - these tests pass locally, but are too slow to run
in CI
* exclude inference tests on windows
* exclude kalosm-learning on windows which doesn't have f16 support | 6 个月前 |
| Feature: Add Anthropic Structured Outputs (#428)
* refactor
* add e2e tests for anthropic structured gen and more models
* pull out shared logic
* fix clippy
---------
Co-authored-by: Evan Almloff <evanalmloff@gmail.com> | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Get whisper running in wasm and enable support for gpus without subgroups or f16 (#406)
* make rwhisper wasm compatable
* move around traits to minimize whisper dependencies
* fix transcription task
* fix import FutureWasmNotSend
* more wasm fixes
* fix sgemv
* reduce with or without subgroups
* fix subgroup reduction
* softmax without subgroups
* tiled map without subgroups
* quantized without subgroups
* all tests passing without subgroups
* fix sgemv dispatch
* restore subgroupless CI workflows
* only require f16 support for quantization support
* don't require f16 support for qmatmul
* fix kalosm model-types dev dependancies
* fix kalosm-model-types tests
* fix softmax bounds without subgroups
* better required limits
* fix clippy
* more clippy fixes
* shorter odyssey test
* more clippy fixes
* fix solve when workgroups are not supported
* exclude fusor core - these tests pass locally, but are too slow to run
in CI
* exclude inference tests on windows
* exclude kalosm-learning on windows which doesn't have f16 support | 6 个月前 |