文件最后提交记录最后更新时间
[AMD] Enable shared->MFMA dot operand conversion through LinearLayout (#4983) This PR: - Introduces fallback from normal TTG->LLVM converter in case it does not support given local_load. - Enables conversion of MFMA dot layout to Linear Layout in local_load pattern.1 年前
Re-enable linear layouts. (#3883) Revert "Disable linear layouts. (#3867)" This reverts commit 987dd045c0af345d26f98dcd451189ef47646030, with a fix for the breakage (which was caused by the fact that order of argument evaluation in C++ is undefined).2 年前
[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the triton_gpu.compute-capability into triton_gpu.target to accomodate backends other than cuda. triton_gpu.target now accepts a string attribute in the form of <backend>:<arch> as values, for example, cuda:80, hip:gfx942. This commit also removes the direct computat capability reference in the OptimizeDotOperandsPass--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.2 年前
[BE] Accumulator init optimization (#4680) Adding a transformation pass that skips filling the accumulator with zero value if the HW supports accumulator scale or init flag. In such case flag value is created and maintained, and passed to the MMA op indicating if accumulator should be taken into an account when calculating the dot product. The pass is intended to be generic enough to be reusable between different HW platforms, therefore it is not placed in the Nvidia specific folder, even though it is supporting only Hopper MMA for the moment.1 年前
[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the triton_gpu.compute-capability into triton_gpu.target to accomodate backends other than cuda. triton_gpu.target now accepts a string attribute in the form of <backend>:<arch> as values, for example, cuda:80, hip:gfx942. This commit also removes the direct computat capability reference in the OptimizeDotOperandsPass--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.2 年前
Revert "[BACKEND] Optimize code generation for load with other arg (#4582)" This reverts commit 78af5c9fd5dfa89c3479473f97d3c4a31aef2601. 1 年前
[Backend] Bypass conversion for suitable blocked to dotOperand layout (#4538) This PR extends shared memory bypass for blocked->dotOperand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication. This commit is a preparation step towards improving CodeGen and efficiency of skinny dot cases.1 年前
[BACKEND][NVIDIA] pass ptx-version to ttgir->llir conversion pass and use it for vectorized atomics (#4982) Vectorized atomics on NVIDIA (https://github.com/triton-lang/triton/pull/4971) are only available on Hopper (>=sm90) and PTX >= 8.1. It's possible to be running with PTX 8.0 on a Hopper machine. This PR passes ptx-version to the ttgir->llir conversion pass for NVIDIA, and uses the ptx version to determine whether vectorized atomics should be used.1 年前
[BACKEND][NVIDIA] pass ptx-version to ttgir->llir conversion pass and use it for vectorized atomics (#4982) Vectorized atomics on NVIDIA (https://github.com/triton-lang/triton/pull/4971) are only available on Hopper (>=sm90) and PTX >= 8.1. It's possible to be running with PTX 8.0 on a Hopper machine. This PR passes ptx-version to the ttgir->llir conversion pass for NVIDIA, and uses the ptx version to determine whether vectorized atomics should be used.1 年前
[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the triton_gpu.compute-capability into triton_gpu.target to accomodate backends other than cuda. triton_gpu.target now accepts a string attribute in the form of <backend>:<arch> as values, for example, cuda:80, hip:gfx942. This commit also removes the direct computat capability reference in the OptimizeDotOperandsPass--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.2 年前
[Frontend][Backend] Add device-side tma descriptor update API (#4633) This adds two new triton IR operators: 1. ExperimentalTensormapCreateOp which creates a descriptor and stores it in global memory 2. ExperimentalTensormapFenceproxyAcquireOp which produces the required fence to use the updated descriptor I then use these to expose 3 new functions in tl.extra.cuda. 1. experimental_device_tensormap_create1d 2. experimental_device_tensormap_create2d 3. experimental_tensormap_fenceproxy_acquire which match up with the existing host-side tensormap creation API.1 年前