文件最后提交记录最后更新时间
[AMD] Introduce amdgpu.buffer_load and amdgpu.buffer_store (#4903) This PR is introducing support for two new AMDGPU specific operations: - amdgpu.buffer_load : it loads from global memory via a pointer and a tensor offset - amdgpu.buffer_store : it store a value in global memory via a pointer and a tensor offset I am also adding conversions patterns in LoadStoreOpToLLVM.cpp. These are similar to the ones for tt.load and tt.store, but different enough to deserve a specific rewrite. I tried to hoist common functionalities between the 4 different patterns, to reduce duplication.1 年前
[AMD] Add fast_expf to libdevice (#4937) This PR added fast_expf operator under libdevice for AMD hardwares. Aligning with other operators in the exp family, the way to deal with denorm inputs is controled by __HIP_FTZ, which currently is fixed to be True. - If __HIP_FTZ = 1, the operator uses llvm.amdgcn.exp2.f32, which will flush denorms in inputs and outputs; - If __HIP_FTZ = 0, the operator uses llvm.exp2.f32, which will not flush denorms. Ref: https://github.com/ROCm/llvm-project/blob/amd-staging/amd/device-libs/cuda2gcn/src/precision.cl Fixes https://github.com/ROCm/triton-internal/issues/3141 年前
[AMD] Enable shared->MFMA dot operand conversion through LinearLayout (#4983) This PR: - Introduces fallback from normal TTG->LLVM converter in case it does not support given local_load. - Enables conversion of MFMA dot layout to Linear Layout in local_load pattern.1 年前
[AMD] Refactor unsupported conversion decomposition pass (#4262) - Use common code, simplify pass code - Support 3d tensors in mfma -> dot conversion(supported in common code from item above) - More tests for decompose-unsupported-amd-conversions pass1 年前
[Backend] Bypass conversion for suitable blocked to dotOperand layout (#4538) This PR extends shared memory bypass for blocked->dotOperand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication. This commit is a preparation step towards improving CodeGen and efficiency of skinny dot cases.1 年前
[AMD] Fix an issue with deduplication with mfma layout (#3870) This PR fixes the following two issues 1. getContigPerThread() for sliceLayout should erase the dim after getting the vector from its parent 2. maybeDeduplicate() should use getContigPerThread() instead of getSizePerThread() Note that sizePerThread and contigPerThread are the same for blocked layout and sliceLayout with blocked layout as parent. But they are different for mfma layout. Using sizePerThread in maybeDeduplicate causes an issue when we do masked store with mfma layout. Because the constancy is not adjusted properly, mask values that are not supposed to be duplicated are duplicated, which leads to out-of-bound memory access. --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>1 年前
[AMD] Fix tt.fp_to_fp bf16 to fp32 conversion (#4238) Added missing handling of tt.fp_to_fp in FpToFpOpConversion::createDestOps when type of src operand is bf16 and type of dst operand is fp32. When having tt.fp_to_fp with src type bf16 and dst operand of type fp32, there were following error message: error: failed to legalize operation 'tt.fp_to_fp' that was explicitly marked illegal BF16 to FP16 (FP32) was not handled neither in FpToFpOpConversion::createDestOps nor in getConversionFunc, thus conversion is added directly in FpToFpOpConversion::createDestOps, similarly, as for the case when source and destination operands are of the same type.1 年前
[AMD] Enable masked load and pointer canonicalization pass (#4638) This PR is doing two things: - We are using the new llvm.masked{load/store} intrinsics. This means that the backend will take responsibility to lower the stores/loads. - We are enabling the canonicalization pointer pass on the Triton IR. I extensively run testing and corrected a couple of minor issues still present in the implementation. The reason why I am enabling both at the same time is because I saw a minor regression with llvm.masked{load,store} which seems to go away when using the pointer canonicalization. Also, this combination seems to reduce the numbers of vgprs used (at least for GEMM kernels).1 年前
[AMD] Handle denorms properly for exp2 and exp (#3816) This PR enables denorm flushing for tl.math.exp2 and preserves denorms for tl.math.exp, which match their behaviors on Nvidia backend. More specifically, - denorm flushing for tl.math.exp2 with f32 inputs is controlled by __CUDA_FTZ or __HIP_FTZ and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument. tl.math.exp2(f32) | NV | NV | AMD | AMD -- | -- | -- | -- | -- control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1 (default) | __HIP_FTZ=0 device lib | __nv_exp2f | __nv_exp2f | | llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f | llvm.amdgcn.exp2.f32 | llvm.exp2.f32 ptx | ex2.approx.ftz.f32 | ex2.approx.f32 |   |   sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to check and<br>adjust for<br>denorms - denorms are preserved for tl.math.exp2 with f64 inputs tl.math.exp2(f64) | NV | AMD -- | -- | -- device lib | __nv_exp2 | __ocml_exp2_f64 - denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the .ftz flag. tl.math.exp(f32) | NV | AMD -- | -- | -- llvm intrinsics |   | llvm.exp2.f32 ptx | ex2.approx.f32 |   tl.math.exp(f64) | NV | AMD -- | -- | -- device lib | __nv_exp | __ocml_exp_f641 年前
[AMD] Move MFMA shortcut check to not compute scratch buffer shape if it is not needed (#3790) This PR: - moves shortcut check earlier, to not compute scratch buffer shape if it is not needed - raise priority of AMD specific over common conversions to eliminate uncertainty which pattern to apply. - add regression test for MFMA to Dot Op shortcut1 年前
[AMD] Enable shared->MFMA dot operand conversion through LinearLayout (#4983) This PR: - Introduces fallback from normal TTG->LLVM converter in case it does not support given local_load. - Enables conversion of MFMA dot layout to Linear Layout in local_load pattern.1 年前
[AMD] Support WMMAv2 dot instruction generation (#4491) - Added intrinsic generation according to the operands type, cache them to avoid repetitive calculations - Fixed parameters dependent on the version in the main logic of WMMA operation generator - Added a lit test to verify number of llvm instructions Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>1 年前