[AMD][LLVM] Scalarize packed fops in the same mfma/wmma block (#6656)
This PR adds an _LLVM Pass_ that scalarizes vector fmuls and fadds
in basic blocks that contain MFMAs/WMMAs.
The point/purpose/value of doing this is these instructions get
codegened to "packed" ops (v_pk_mul_f32/v_pk_add_f32), which
cannot be co-issued with mfma, thus there is a performance cost.
Concretely/specifically this eliminates v_pk_mul_f32/v_pk_add_f32
operations in the final asm in bbs with MFMAs.
Note, these "scalar" floating point ops will still get lowered to vector
instructions like v_mul_f32_e32 and v_add_u32_e32, just not the
"packed" variants.
Note, these packed fops aren't actually emitted by triton per se - they
are introduced/inserted by the VectorCombine::foldPermuteOfBinops
pattern during the optimize_module pipeline (hence why this LLVM pass
needs to follow that pipeline).
[AMD] Turn buffer ops support on by default (#5960)
This commit moves buffer ops to be on by default.
Under the hood it means we are emitting buffer
load/store intrinsics instead of global load/store ones
when possible. This aims to improve performance due
to better out of bound access, better register usage, etc.
Though buffer ops have limitations. To fully get the
benefits, some kernel tl.assume annotations are
needed to guide the analysis to help it kick in.
[AMD][LLVM] Scalarize packed fops in the same mfma/wmma block (#6656)
This PR adds an _LLVM Pass_ that scalarizes vector fmuls and fadds
in basic blocks that contain MFMAs/WMMAs.
The point/purpose/value of doing this is these instructions get
codegened to "packed" ops (v_pk_mul_f32/v_pk_add_f32), which
cannot be co-issued with mfma, thus there is a performance cost.
Concretely/specifically this eliminates v_pk_mul_f32/v_pk_add_f32
operations in the final asm in bbs with MFMAs.
Note, these "scalar" floating point ops will still get lowered to vector
instructions like v_mul_f32_e32 and v_add_u32_e32, just not the
"packed" variants.
Note, these packed fops aren't actually emitted by triton per se - they
are introduced/inserted by the VectorCombine::foldPermuteOfBinops
pattern during the optimize_module pipeline (hence why this LLVM pass
needs to follow that pipeline).