| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |