文件最后提交记录最后更新时间
Bump wgpu and fix wasm support for llama (#416) * bump wgpu and fix wasm support for llama * use git version of wgpu * fix checks and require tokio for ocr * fix formatting * fix the quantized test * fix large dispatches * fix doctests * fix clippy * match wgpu version in ci and cache windows * fix formatting * setup vulkan * set vulkan version * cargo update * failing tests on all platforms * fix f16 tensors on gpus that don't support f16 * restore f16 detection * more resilient caching * pull out lock current logic * use dxc * use std file locks * try using a smaller batch limit * install warp4 个月前
Floneum qwen embed and rbert cpu support (#425) * qwen embed working * Add 3×4 outer-loop unrolled quantized matmul for m≥2 Process 3 LHS rows simultaneously so each weight block is loaded once and reused across all 3 rows, reducing memory traffic by ~3×. 2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged. * Deduplicate activation quantization in parallel m=1 path Each thread now quantizes the activation row once instead of re-quantizing in every 32-column chunk. 14-23% speedup for m=1. * Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x)) for signed i8×i8 dot products. Runtime AVX2 detection with scalar fallback for older x86_64 CPUs. * Add x86 AVX2 SIMD for BlockQ8_0 activation quantization Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction, scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8. Runtime AVX2 detection with scalar fallback. On aarch64, the compiler auto-vectorizes the scalar code better than explicit NEON intrinsics (confirmed by benchmarks showing 7-11% regression with explicit NEON). * Remove unused process_row_integer_range function * cpu support for rbert * better parallelization * remove uninit unchecked * start refactoring * reduce unsafe * fix formatting * clean up conditional * more refactoring * a bit more cleanup * more formatting + clippy * fix clippy in fusion bench * fix flash attention * fix flash * fix tests * fix clippy * make device week2 个月前
Get whisper running in wasm and enable support for gpus without subgroups or f16 (#406) * make rwhisper wasm compatable * move around traits to minimize whisper dependencies * fix transcription task * fix import FutureWasmNotSend * more wasm fixes * fix sgemv * reduce with or without subgroups * fix subgroup reduction * softmax without subgroups * tiled map without subgroups * quantized without subgroups * all tests passing without subgroups * fix sgemv dispatch * restore subgroupless CI workflows * only require f16 support for quantization support * don't require f16 support for qmatmul * fix kalosm model-types dev dependancies * fix kalosm-model-types tests * fix softmax bounds without subgroups * better required limits * fix clippy * more clippy fixes * shorter odyssey test * more clippy fixes * fix solve when workgroups are not supported * exclude fusor core - these tests pass locally, but are too slow to run in CI * exclude inference tests on windows * exclude kalosm-learning on windows which doesn't have f16 support6 个月前
Feature: Add Anthropic Structured Outputs (#428) * refactor * add e2e tests for anthropic structured gen and more models * pull out shared logic * fix clippy --------- Co-authored-by: Evan Almloff <evanalmloff@gmail.com>2 个月前
Floneum qwen embed and rbert cpu support (#425) * qwen embed working * Add 3×4 outer-loop unrolled quantized matmul for m≥2 Process 3 LHS rows simultaneously so each weight block is loaded once and reused across all 3 rows, reducing memory traffic by ~3×. 2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged. * Deduplicate activation quantization in parallel m=1 path Each thread now quantizes the activation row once instead of re-quantizing in every 32-column chunk. 14-23% speedup for m=1. * Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x)) for signed i8×i8 dot products. Runtime AVX2 detection with scalar fallback for older x86_64 CPUs. * Add x86 AVX2 SIMD for BlockQ8_0 activation quantization Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction, scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8. Runtime AVX2 detection with scalar fallback. On aarch64, the compiler auto-vectorizes the scalar code better than explicit NEON intrinsics (confirmed by benchmarks showing 7-11% regression with explicit NEON). * Remove unused process_row_integer_range function * cpu support for rbert * better parallelization * remove uninit unchecked * start refactoring * reduce unsafe * fix formatting * clean up conditional * more refactoring * a bit more cleanup * more formatting + clippy * fix clippy in fusion bench * fix flash attention * fix flash * fix tests * fix clippy * make device week2 个月前
Get whisper running in wasm and enable support for gpus without subgroups or f16 (#406) * make rwhisper wasm compatable * move around traits to minimize whisper dependencies * fix transcription task * fix import FutureWasmNotSend * more wasm fixes * fix sgemv * reduce with or without subgroups * fix subgroup reduction * softmax without subgroups * tiled map without subgroups * quantized without subgroups * all tests passing without subgroups * fix sgemv dispatch * restore subgroupless CI workflows * only require f16 support for quantization support * don't require f16 support for qmatmul * fix kalosm model-types dev dependancies * fix kalosm-model-types tests * fix softmax bounds without subgroups * better required limits * fix clippy * more clippy fixes * shorter odyssey test * more clippy fixes * fix solve when workgroups are not supported * exclude fusor core - these tests pass locally, but are too slow to run in CI * exclude inference tests on windows * exclude kalosm-learning on windows which doesn't have f16 support6 个月前