kalosm/interfaces · fl/kalosm - AtomGit

GGitHubFloneum qwen embed and rbert cpu support (#425 )

文件	最后提交记录	最后更新时间
kalosm-common	Add caching for wasm file downloads and limit the hypothetical chunker length (#417) * cache downloads in wasm * more type coercion * add a length limit to the hypothetical chunker * a bit of cleanup * fix formatting * remove logs * more resilient progress updates * fix caching progress * fix clippy	4 个月前
kalosm-language	Generalize task chunker and add more logging (#423) * generalize task chunker and add more logging * Fix misleading variable name in chunking example (#429) * Initial plan * Fix variable name: rename hypothetical to summarizer in chunking.rs example Co-authored-by: ealmloff <66571940+ealmloff@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ealmloff <66571940+ealmloff@users.noreply.github.com> * fix formatting --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>	2 个月前
kalosm-learning-macro	Add crate metadata	1 年前
kalosm-learning	use total cmp	4 个月前
kalosm-model-types	Add caching for wasm file downloads and limit the hypothetical chunker length (#417) * cache downloads in wasm * more type coercion * add a length limit to the hypothetical chunker * a bit of cleanup * fix formatting * remove logs * more resilient progress updates * fix caching progress * fix clippy	4 个月前
kalosm-parse-macro	Fix clippy lints (#381)	11 个月前
kalosm-sample	Add caching for wasm file downloads and limit the hypothetical chunker length (#417) * cache downloads in wasm * more type coercion * add a length limit to the hypothetical chunker * a bit of cleanup * fix formatting * remove logs * more resilient progress updates * fix caching progress * fix clippy	4 个月前
kalosm-sound	fix denoise in debug	6 个月前
kalosm-streams	Add crate metadata	1 年前
kalosm-vision	Optimize fusor (#393) * rename fusor core * create fusor error type * try into for gguf value * remove candle * create index select kernel * refactor quantized implementation * add dequanitze kernel template * add where cond * fuse dequantize and visit tiled ops * fix hello example * Automatically spawn polling thread * llama port compiles * matmul almost working * fuzzing small matrixes passes * fix fuzz matmul test * rope test passing * fix metadata loading * Fix shape calculation for index select * fix qmatmul shape * Fix shape calculation for mat mul * remove some logging * fix compute graph deadlock * batched matrix multiplication * fix cache * fix attention mask * building the compute graph works * fix graphvis * remove recursion from resolve * handle > 3 dimensions in map tiled * make timing info optional * stable wgpu * fx hash * Fix passes after nodes are garbage collected * remove log * add a extra_assertions flag * add a sleep to the device poll loop * Fix more merging bugs * fix cycles * model runs without panicking * fix qmatmul output buffer size * add type assertions * Fix rectangular qmatmul * tokens generating * fix softmax test * fix index select on large arrays * fix rms norm * matmul failing * use fused multiply add in the matmul kernel * add strides to matmul * handle non-contiguous tensors in qmatmul * Fix attention mechanism. It works!!! * fix timing queries * more graphviz fixes * fuse matmul kernels * clippy fix * Fix kernel fusion * refactor mid level representation * just remove queries * more benchmarks * more candle benchmarks * create Operation trait * fix tests * add device to the workgroup size constraints function * implement operation for reduce * fix some lints * implement operation for resize * remove check_bounds_contiguous * add index select to MIR * clean up formatting * remove log * remove more logs * bench larger inputs * add dequantize to MIR * clean up * MIR for qmatmul * add matmul to MIR * fix formatting * queue all operations before running anything * fix tests * fix cached tensors * simplify dependencies * fuse multiple unrelated kernels * rename values * fix merging * linearize size for reduce * remove logs * move output into a separate method * tests passing * merge adjacent non-related kernels without synchronization * remove visit * disable merging and bump wgpu * use pipeline cache * cache most compilation steps * Fix bench dependencies * move the caches to the device * fix tensor partialeq * memory coalescing in visit_tiled * remove log * More consistent performance * re-enable non-conflicting merges * add kernel name for debugging * cargo update * faster builds for infer example * double tokens per second * only materialize every other layer * more detailed pair wise name * fix pairwise bench * add a many dimensional pairwise benchmark * skip empty dimensions in tiled map * scale tile size down as the rank scales up * materialize every layer * add support for custom operations * better round up method * faster reduce kernel * unroll reduction in softmax kernel * vectorized softmax load * add a separate case for large softmax * implement the same special case for reduce * custom operations are sync * don't repeat dequantize * label everything * cache dequantize rms norm * where cond custom opt * bench larger qmatmul * split out sgemv variant * initial attempt at sgemv * match braces * fix dispatching * tests passing * use subgroupAdd function * clean up imports * faster sgemv * slightly faster sgemv * simd sgemv * add unrolled dequantize variants * 70% faster sgemv * split chunk size and vector size * implement vectorized sgemm * add more qmatmul benchmarks * skip second sum pass if this is a single subgroup * pull out dequantize_vec4_block * test and fix vec4 dequant * more optimized q6k dequantize * use the same pattern for unrolled * specialized vec4 q6k dequant * add more qmatmul fuzzing tests * longer fuzzing * fix dequantize q6k * Fix fuzz_de_quantize vec4 test * specialized dequantize_vec4_block q4k implementation * restore multi-operation fusion * more flexable sgemv kernel * interleaved blocks * fix sgemv * ignore tokenizer.json * slightly cleaner q6k dequantize * specialized q6k sgemv kernel * remove log * add a link to the llama.cpp kernel * make q6k work with multiple rows at once * make preloading optional in q6k gemv * specialized q4k gemv implementation * first value correct * simplify scale calculation * fix q4k * remove log * slightly faster * specialized q_n gemv * add q5_0 * fix llama.cpp link * cache downloads for tests * cache qmatmul bench file * faster q_n kernels * add specialized q8_0 gemv kernel * double dispatch size * bump dependencies and move closer to wasm compat * fix compilation * disable zero initialization * same configuration for tests * slightly faster Q4k * explicit vectorization * unroll loops * fix kalosm llama * fix dispatch size * faster q8_0 gmv * refactor matmul impl * vectorized sgemm multiply * revert changes to kalosm-llama * undo kalosm-language cargo.toml changes * restore ocr changes * fix formatting * fix tokenizers * clippy fix * fix dependencies * fix clippy and formatting * fix formatting * fix clippy * fix tests	9 个月前
kalosm	Floneum qwen embed and rbert cpu support (#425) * qwen embed working * Add 3×4 outer-loop unrolled quantized matmul for m≥2 Process 3 LHS rows simultaneously so each weight block is loaded once and reused across all 3 rows, reducing memory traffic by ~3×. 2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged. * Deduplicate activation quantization in parallel m=1 path Each thread now quantizes the activation row once instead of re-quantizing in every 32-column chunk. 14-23% speedup for m=1. * Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x)) for signed i8×i8 dot products. Runtime AVX2 detection with scalar fallback for older x86_64 CPUs. * Add x86 AVX2 SIMD for BlockQ8_0 activation quantization Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction, scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8. Runtime AVX2 detection with scalar fallback. On aarch64, the compiler auto-vectorizes the scalar code better than explicit NEON intrinsics (confirmed by benchmarks showing 7-11% regression with explicit NEON). * Remove unused process_row_integer_range function * cpu support for rbert * better parallelization * remove uninit unchecked * start refactoring * reduce unsafe * fix formatting * clean up conditional * more refactoring * a bit more cleanup * more formatting + clippy * fix clippy in fusion bench * fix flash attention * fix flash * fix tests * fix clippy * make device week	2 个月前
language-model	Floneum qwen embed and rbert cpu support (#425) * qwen embed working * Add 3×4 outer-loop unrolled quantized matmul for m≥2 Process 3 LHS rows simultaneously so each weight block is loaded once and reused across all 3 rows, reducing memory traffic by ~3×. 2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged. * Deduplicate activation quantization in parallel m=1 path Each thread now quantizes the activation row once instead of re-quantizing in every 32-column chunk. 14-23% speedup for m=1. * Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x)) for signed i8×i8 dot products. Runtime AVX2 detection with scalar fallback for older x86_64 CPUs. * Add x86 AVX2 SIMD for BlockQ8_0 activation quantization Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction, scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8. Runtime AVX2 detection with scalar fallback. On aarch64, the compiler auto-vectorizes the scalar code better than explicit NEON intrinsics (confirmed by benchmarks showing 7-11% regression with explicit NEON). * Remove unused process_row_integer_range function * cpu support for rbert * better parallelization * remove uninit unchecked * start refactoring * reduce unsafe * fix formatting * clean up conditional * more refactoring * a bit more cleanup * more formatting + clippy * fix clippy in fusion bench * fix flash attention * fix flash * fix tests * fix clippy * make device week	2 个月前