Re-enable linear layouts. (#3883)
Revert "Disable linear layouts. (#3867)"
This reverts commit 987dd045c0af345d26f98dcd451189ef47646030, with a fix
for
the breakage (which was caused by the fact that order of argument
evaluation in
C++ is undefined).
[backend] Set target in the passes and IR backend agnostically (#3741)
This commit changes the triton_gpu.compute-capability into
triton_gpu.target to accomodate backends other than cuda.
triton_gpu.target now accepts a string attribute in the form of
<backend>:<arch> as values, for example, cuda:80, hip:gfx942.
This commit also removes the direct computat capability reference in the
OptimizeDotOperandsPass--it only needs to know whether we want to
inject an additional pattern, which can be controlled by a bool
parameter.
[BE] Accumulator init optimization (#4680)
Adding a transformation pass that skips filling the accumulator with
zero value if the HW supports accumulator scale or init flag. In such
case flag value is created and maintained, and passed to the MMA op
indicating if accumulator should be taken into an account when
calculating the dot product.
The pass is intended to be generic enough to be reusable between
different HW platforms, therefore it is not placed in the Nvidia
specific folder, even though it is supporting only Hopper MMA for the
moment.
[backend] Set target in the passes and IR backend agnostically (#3741)
This commit changes the triton_gpu.compute-capability into
triton_gpu.target to accomodate backends other than cuda.
triton_gpu.target now accepts a string attribute in the form of
<backend>:<arch> as values, for example, cuda:80, hip:gfx942.
This commit also removes the direct computat capability reference in the
OptimizeDotOperandsPass--it only needs to know whether we want to
inject an additional pattern, which can be controlled by a bool
parameter.
[Backend] Bypass conversion for suitable blocked to dotOperand layout (#4538)
This PR extends shared memory bypass for blocked->dotOperand
conversions and adds bypass check in DecomposeUnsupportedConversions
and ReduceDataDuplication.
This commit is a preparation step towards improving CodeGen and
efficiency of skinny dot cases.
[backend] Set target in the passes and IR backend agnostically (#3741)
This commit changes the triton_gpu.compute-capability into
triton_gpu.target to accomodate backends other than cuda.
triton_gpu.target now accepts a string attribute in the form of
<backend>:<arch> as values, for example, cuda:80, hip:gfx942.
This commit also removes the direct computat capability reference in the
OptimizeDotOperandsPass--it only needs to know whether we want to
inject an additional pattern, which can be controlled by a bool
parameter.
[Frontend][Backend] Add device-side tma descriptor update API (#4633)
This adds two new triton IR operators:
1. ExperimentalTensormapCreateOp which creates a descriptor and stores it in global memory
2. ExperimentalTensormapFenceproxyAcquireOp which produces the required fence to use the updated descriptor
I then use these to expose 3 new functions in tl.extra.cuda.
1. experimental_device_tensormap_create1d
2. experimental_device_tensormap_create2d
3. experimental_tensormap_fenceproxy_acquire
which match up with the existing host-side tensormap creation API.