文件最后提交记录最后更新时间
[AMD] Move address sanitizer tests into amd backend (#5524) 1 年前
[AMD][LLVM] Scalarize packed fops in the same mfma/wmma block (#6656) This PR adds an _LLVM Pass_ that scalarizes vector fmuls and fadds in basic blocks that contain MFMAs/WMMAs. The point/purpose/value of doing this is these instructions get codegened to "packed" ops (v_pk_mul_f32/v_pk_add_f32), which cannot be co-issued with mfma, thus there is a performance cost. Concretely/specifically this eliminates v_pk_mul_f32/v_pk_add_f32 operations in the final asm in bbs with MFMAs. Note, these "scalar" floating point ops will still get lowered to vector instructions like v_mul_f32_e32 and v_add_u32_e32, just not the "packed" variants. Note, these packed fops aren't actually emitted by triton per se - they are introduced/inserted by the VectorCombine::foldPermuteOfBinops pattern during the optimize_module pipeline (hence why this LLVM pass needs to follow that pipeline).1 年前
[Tests] Use tmp_path pytest fixture instead of tempfile.NamedTemporaryFile (#7735) tmp_path is the main style in tests. And plus to this it does not cause problems when running these tests on Windows. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>9 个月前
[AMD] Turn buffer ops support on by default (#5960) This commit moves buffer ops to be on by default. Under the hood it means we are emitting buffer load/store intrinsics instead of global load/store ones when possible. This aims to improve performance due to better out of bound access, better register usage, etc. Though buffer ops have limitations. To fully get the benefits, some kernel tl.assume annotations are needed to guide the analysis to help it kick in.1 年前
Reland "byte permutes in intra-warp layout conversion" (#7933) Reland https://github.com/triton-lang/triton/pull/7809, https://github.com/triton-lang/triton/pull/7825, https://github.com/triton-lang/triton/pull/7861 Add a workaround for ptxas bug and add a regression test9 个月前
[Tests] Use tmp_path pytest fixture instead of tempfile.NamedTemporaryFile (#7735) tmp_path is the main style in tests. And plus to this it does not cause problems when running these tests on Windows. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>9 个月前
[AMD][LLVM] Scalarize packed fops in the same mfma/wmma block (#6656) This PR adds an _LLVM Pass_ that scalarizes vector fmuls and fadds in basic blocks that contain MFMAs/WMMAs. The point/purpose/value of doing this is these instructions get codegened to "packed" ops (v_pk_mul_f32/v_pk_add_f32), which cannot be co-issued with mfma, thus there is a performance cost. Concretely/specifically this eliminates v_pk_mul_f32/v_pk_add_f32 operations in the final asm in bbs with MFMAs. Note, these "scalar" floating point ops will still get lowered to vector instructions like v_mul_f32_e32 and v_add_u32_e32, just not the "packed" variants. Note, these packed fops aren't actually emitted by triton per se - they are introduced/inserted by the VectorCombine::foldPermuteOfBinops pattern during the optimize_module pipeline (hence why this LLVM pass needs to follow that pipeline).1 年前