| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Floneum qwen embed and rbert cpu support (#425)
* qwen embed working
* Add 3×4 outer-loop unrolled quantized matmul for m≥2
Process 3 LHS rows simultaneously so each weight block is loaded
once and reused across all 3 rows, reducing memory traffic by ~3×.
2-4.4× speedup for prompt processing (m≥2); m=1 path unchanged.
* Deduplicate activation quantization in parallel m=1 path
Each thread now quantizes the activation row once instead of
re-quantizing in every 32-column chunk. 14-23% speedup for m=1.
* Add x86 AVX2 SIMD for BlockQ4_0 and BlockQ8_0 vec_dot
Use _mm256_maddubs_epi16 with the sign trick (abs(x) * sign(y,x))
for signed i8×i8 dot products. Runtime AVX2 detection with scalar
fallback for older x86_64 CPUs.
* Add x86 AVX2 SIMD for BlockQ8_0 activation quantization
Use AVX2 for the full quantize pipeline on x86_64: max-abs reduction,
scale+round via _mm256_cvtps_epi32, and saturating pack i32→i16→i8.
Runtime AVX2 detection with scalar fallback. On aarch64, the compiler
auto-vectorizes the scalar code better than explicit NEON intrinsics
(confirmed by benchmarks showing 7-11% regression with explicit NEON).
* Remove unused process_row_integer_range function
* cpu support for rbert
* better parallelization
* remove uninit unchecked
* start refactoring
* reduce unsafe
* fix formatting
* clean up conditional
* more refactoring
* a bit more cleanup
* more formatting + clippy
* fix clippy in fusion bench
* fix flash attention
* fix flash
* fix tests
* fix clippy
* make device week | 2 个月前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |
| Unified conformance test library (#434)
* Extract fusor conformance library
* Include mismatch position in conformance errors
* Remove remaining crate-local fusor tests
* Add missing conformance coverage
* Broaden conformance test coverage
* Use variable-size conformance fuzz shapes
* fix formatting
* fix tests
* larger fuzz ranges
* more tests
* better cpu parity
* fix clippy
* fix formatting
* fix clippy
* refactor: replace clippy #[allow] suppressions with real fixes
Bundle args into structs (FlushBatch, MmaParams, TileALoadCtx, TileBLoadCtx,
AttentionInputs, BertShape, QMatMulFuzz), introduce CompareFut type alias for
the conformance comparator return type, and rewrite needless_range_loop sites
in tests/common/mod.rs to use iter_mut().enumerate().
* chore: ignore .claude scheduler artifacts
* fix(conformance): skip f16 tests on GPU adapters without SHADER_F16
Linux CI's lavapipe adapter doesn't expose wgpu::Features::SHADER_F16, so
the f16 shader fails to validate. Filter the device list per test rather
than removing GPU coverage entirely — Mac Metal still runs the f16 path.
* fix(conformance): bump inverse-trig tolerance to 1e-4 for lavapipe parity
Linux CI's lavapipe adapter diverges from libm by ~6e-5 on asin/acos/atanh/acosh
near their domain edges. 1e-4 covers the observed gap without masking real
regressions; the macOS Metal adapter still passes comfortably.
* fix(conformance): bump inverse-trig tolerance to 1e-3 for lavapipe drift
First CI run showed asin diverging 5.6e-5; second run hit 2.1e-4 on a
different fuzz seed. lavapipe asin precision is limited near the asymptotes
where the derivative blows up. 1e-3 covers observed drift; algorithmic
regressions would diverge by orders of magnitude.
* fix transcribe example
* fix(conformance): stabilize CI edge cases
* fix(ci): avoid brittle benchmark formatter
* fix(conformance): cover Windows WARP tanh drift
* fix tanh
* more ci fixes
* more software gpu backend fixes
* looser bounds for trig
* relative tolerance
* more relative comparisons
* passing on warp | 28 天前 |