文件最后提交记录最后更新时间
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前
Floneum qwen embed and rbert cpu support (#425) * qwen embed working * Add 3×4 outer-loop unrolled quantized matmul for m≥2 Process 3 LHS rows simultaneously so each weight block is loaded once and reused across all 3 rows, reducing memory traffic by ~3×. 2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged. * Deduplicate activation quantization in parallel m=1 path Each thread now quantizes the activation row once instead of re-quantizing in every 32-column chunk. 14-23% speedup for m=1. * Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x)) for signed i8×i8 dot products. Runtime AVX2 detection with scalar fallback for older x86_64 CPUs. * Add x86 AVX2 SIMD for BlockQ8_0 activation quantization Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction, scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8. Runtime AVX2 detection with scalar fallback. On aarch64, the compiler auto-vectorizes the scalar code better than explicit NEON intrinsics (confirmed by benchmarks showing 7-11% regression with explicit NEON). * Remove unused process_row_integer_range function * cpu support for rbert * better parallelization * remove uninit unchecked * start refactoring * reduce unsafe * fix formatting * clean up conditional * more refactoring * a bit more cleanup * more formatting + clippy * fix clippy in fusion bench * fix flash attention * fix flash * fix tests * fix clippy * make device week2 个月前
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前