文件最后提交记录最后更新时间
Unified conformance test library (#434) * Extract fusor conformance library * Include mismatch position in conformance errors * Remove remaining crate-local fusor tests * Add missing conformance coverage * Broaden conformance test coverage * Use variable-size conformance fuzz shapes * fix formatting * fix tests * larger fuzz ranges * more tests * better cpu parity * fix clippy * fix formatting * fix clippy * refactor: replace clippy #[allow] suppressions with real fixes Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx, AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for the conformance comparator return type, and rewrite needless_range_loop sites in tests/common/mod.rs to use iter_mut().enumerate(). * chore: ignore .claude scheduler artifacts * fix(conformance): skip f16 tests on GPU adapters without SHADER_F16 Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so the f16 shader fails to validate. Filter the device list per test rather than removing GPU coverage entirely — Mac Metal still runs the f16 path. * fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh near their domain edges. 1e-4 covers the observed gap without masking real regressions; the macOS Metal adapter still passes comfortably. * fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a different fuzz seed. lavapipe asin precision is limited near the asymptotes where the derivative blows up. 1e-3 covers observed drift; algorithmic regressions would diverge by orders of magnitude. * fix transcribe example * fix(conformance): stabilize CI edge cases * fix(ci): avoid brittle benchmark formatter * fix(conformance): cover Windows WARP tanh drift * fix tanh * more ci fixes * more software gpu backend fixes * looser bounds for trig * relative tolerance * more relative comparisons * passing on warp28 天前
Llama fusor v2 (#413) * add dim trait * add repeat * expand index * llama runs with fusor * more assertions * failing rope test * failing cat test * 5d index * fix visit_tiled with large z dims * fix assertion for zero sized tensors * fix sgemv batch * fix cache test and attention layer dequant * 3x faster * bench rope and attention * fused rope kernel * flash attention * better benchmarks * more benches * fix formatting * remove useless shared arrays * use shape bindings instead of separate shape inputs * remove bound check * pull out re-used exp * optimized flash attention * unroll reduction * fix clippy * vectorized * bench one seq len * handle non-contiguous inputs * failing flash attention test * fix flash attention * add optional mask support * integrate fused attention into llama * add fuzzing test * reformatting * block based loading * fix flash attention when the query and kv lengths don't match * used fused rope * normal fused rope * optimize repeat_kv * optimize mask cache * integrate mqa into flash attention * remove block on * fix rope freq rank * 16 t/s * no tiling * more f16 fixes * use f16 activations * more f16 stability fixes * fix q4k type * slightly faster * fix formatting * remove old examples * fix kalosm * restore vision adapter * use conv3d * fix loading the 3d conv * fix clippy * more clippy fixes * nary kernel * clean up some of the warnings * simplify nary optimization * fix clippy * use graph rewrites instead of visiting * worklist * simpler optimizations * clean up some unused code * fuse nary into reduce/matmul/index/etc * remove session serialization * fix formatting * allow changing the activation type * fix the default type * fix clippy and formatting * fix doc tests * reuse the same device for all tests * fix gemv on cuda * start merging into nary * fuse index select * fix infer example * fix matmul fusion * fix formatting * fix clippy4 个月前
Unified conformance test library (#434) * Extract fusor conformance library * Include mismatch position in conformance errors * Remove remaining crate-local fusor tests * Add missing conformance coverage * Broaden conformance test coverage * Use variable-size conformance fuzz shapes * fix formatting * fix tests * larger fuzz ranges * more tests * better cpu parity * fix clippy * fix formatting * fix clippy * refactor: replace clippy #[allow] suppressions with real fixes Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx, AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for the conformance comparator return type, and rewrite needless_range_loop sites in tests/common/mod.rs to use iter_mut().enumerate(). * chore: ignore .claude scheduler artifacts * fix(conformance): skip f16 tests on GPU adapters without SHADER_F16 Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so the f16 shader fails to validate. Filter the device list per test rather than removing GPU coverage entirely — Mac Metal still runs the f16 path. * fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh near their domain edges. 1e-4 covers the observed gap without masking real regressions; the macOS Metal adapter still passes comfortably. * fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a different fuzz seed. lavapipe asin precision is limited near the asymptotes where the derivative blows up. 1e-3 covers observed drift; algorithmic regressions would diverge by orders of magnitude. * fix transcribe example * fix(conformance): stabilize CI edge cases * fix(ci): avoid brittle benchmark formatter * fix(conformance): cover Windows WARP tanh drift * fix tanh * more ci fixes * more software gpu backend fixes * looser bounds for trig * relative tolerance * more relative comparisons * passing on warp28 天前
WGPU ML runtime core (#345) * Move wgpu runtime into repo * fix formatting1 年前
Unified conformance test library (#434) * Extract fusor conformance library * Include mismatch position in conformance errors * Remove remaining crate-local fusor tests * Add missing conformance coverage * Broaden conformance test coverage * Use variable-size conformance fuzz shapes * fix formatting * fix tests * larger fuzz ranges * more tests * better cpu parity * fix clippy * fix formatting * fix clippy * refactor: replace clippy #[allow] suppressions with real fixes Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx, AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for the conformance comparator return type, and rewrite needless_range_loop sites in tests/common/mod.rs to use iter_mut().enumerate(). * chore: ignore .claude scheduler artifacts * fix(conformance): skip f16 tests on GPU adapters without SHADER_F16 Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so the f16 shader fails to validate. Filter the device list per test rather than removing GPU coverage entirely — Mac Metal still runs the f16 path. * fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh near their domain edges. 1e-4 covers the observed gap without masking real regressions; the macOS Metal adapter still passes comfortably. * fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a different fuzz seed. lavapipe asin precision is limited near the asymptotes where the derivative blows up. 1e-3 covers observed drift; algorithmic regressions would diverge by orders of magnitude. * fix transcribe example * fix(conformance): stabilize CI edge cases * fix(ci): avoid brittle benchmark formatter * fix(conformance): cover Windows WARP tanh drift * fix tanh * more ci fixes * more software gpu backend fixes * looser bounds for trig * relative tolerance * more relative comparisons * passing on warp28 天前
WGPU ML runtime core (#345) * Move wgpu runtime into repo * fix formatting1 年前
README.md

Fusor ML

This is a WGPU ML runtime with kernel fusion for ergonomic high performance custom operations. This will hopefully serve as the web and amd runtime for kalosm once it is stable enough.

Status

Basic operations are working and simple kernel fusion is implemented, but this is not production ready yet.

Features:

  • Elementwise ops
  • Fuse Elementwise ops together
  • MatMul
  • Reduce ops
  • Fuse Elementwise ops into Reduce ops
  • PairWise ops
  • Fuse Elementwise ops into PairWise ops
  • Analyze buffer usage for in-place ops
  • Memory move/cat/etc ops
  • Cast ops
  • Fuse PairWise ops together?
  • Fuse parallel Reduce ops?
  • Fuse PairWise ops with two of the same input into an elementwise op
  • Dynamically apply fusion based on runtime throughput data

Operations required for a Llama implementation:

  • RmsNorm
  • Matmul
  • Rope
  • Unqueeze
  • Cat
  • Reshape
  • Transpose
  • Softmax
  • narraw
  • silu
  • arange
  • sin
  • cos

Resources