[backend] Set target in the passes and IR backend agnostically (#3741)
This commit changes the triton_gpu.compute-capability into
triton_gpu.target to accomodate backends other than cuda.
triton_gpu.target now accepts a string attribute in the form of
<backend>:<arch> as values, for example, cuda:80, hip:gfx942.
This commit also removes the direct computat capability reference in the
OptimizeDotOperandsPass--it only needs to know whether we want to
inject an additional pattern, which can be controlled by a bool
parameter.
[Triton] Use UnitAttr in tt.reshape definition (#4947)
Make allow_reorder and efficient_layoutUnitAttr for a cleaner
interface.
This way, the operation exposes a bool getEfficientLayout() member to
check for that attribute and a constructor receiving bool arguments
for both of these attributes (defaulted to false).
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ]
with
[x] to indicate you have done them.
- [X] I am not making a trivial change, such as fixing a typo in a
comment.
- [X] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [X] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [X] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
Signed-off-by: victor-eds <victor.perez@codeplay.com>
[BACKEND] Fix memory side effects of tt.dot (#4033)
1. Replaced triton_nvidia_gpu.async_dot with
triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe
warp_group_dot is a better name?
2. Removed memdesc from tt.dot because tt.dot should be pure,
without any side effects
3. Removed hacks in Membar analysis.
4. Unified wgmma code generation in the backend.
5. Introduced the DotLike trait for tt.dot and
triton_nvidia_gpu.group_dot.
6. Updated comments in matmul loop pipeline (maybe incomplete).
7. Removed the ConvertDotConvert pattern
[BACKEND] Fix memory side effects of tt.dot (#4033)
1. Replaced triton_nvidia_gpu.async_dot with
triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe
warp_group_dot is a better name?
2. Removed memdesc from tt.dot because tt.dot should be pure,
without any side effects
3. Removed hacks in Membar analysis.
4. Unified wgmma code generation in the backend.
5. Introduced the DotLike trait for tt.dot and
triton_nvidia_gpu.group_dot.
6. Updated comments in matmul loop pipeline (maybe incomplete).
7. Removed the ConvertDotConvert pattern
[AMD] Support WMMAv2 in AccelerateAMDMatmulPass (#4452)
- Specify kWidth parameter according to the version
- For the first iteration fp8 operands are unsupported, no new operand
configuration are added for now
- Added lit tests
Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
Verify the tt.dot operation thru the dot verification interface of the dialect which defines the C layout. (#4630)
The Triton tt.dot operation maybe materialized by MMA encoding
defined by the out-of-tree dialect.
The verifier of the tt.dot operation dispatches the verification to
the dialect of the A and B operands which are mostly DotOp layout
defined in-tree TritonGPU dialect.
The TritonGPU dialect only verifies the tt.dot operation for in-tree
backend.
To use the dialect of the C operand layout verifying the tt.dot
operation for out-of-tree backends. (And also works for NV and AMD's
MMA which are defined in-tree TritonGPU dialect.)
- [X] I am not making a trivial change, such as fixing a typo in a
comment.
- [X] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [X] I have added tests.
- /test for lit tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [X] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
[AMD] Rewrite transpose ops in pipeliner to mutable memory (#4969)
add_optimize_dot_operands may introduce a immutable shared buffer for
transposed dot operands. Our stream-pipeliner then replaces the
immutable buffer with a mutable buffer to be able to reuse it across
iterations (pre-fetching). This will then produce incorrect transOps
because the input is mutable but the result is immutable.
This PR rewrites those transOps to output a mutable layout.
[SWP] When num_stages = 2, do not pipeline indirect loads (#4721)
For indirect loads, we try to assign them to later stages
```
unsigned stagesBetweenLoads =
ceil<unsigned>(numStages - 2, maxIndirectionLevel + 1);
int stage = (maxIndirectionLevel - indLevel) * stagesBetweenLoads;
schedule.insert(loadOp, stage, loadsClusters[indLevel]);
```
If numStages is 2, there is no later stage to assign the indirect loads
to. The fix is to not pipeline the indirect loads.
We also generalize to not pipeline an indirect load if the indirection
level >= numStages - 1
[Triton] Use UnitAttr in tt.reshape definition (#4947)
Make allow_reorder and efficient_layoutUnitAttr for a cleaner
interface.
This way, the operation exposes a bool getEfficientLayout() member to
check for that attribute and a constructor receiving bool arguments
for both of these attributes (defaulted to false).
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ]
with
[x] to indicate you have done them.
- [X] I am not making a trivial change, such as fixing a typo in a
comment.
- [X] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [X] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [X] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
Signed-off-by: victor-eds <victor.perez@codeplay.com>
[backend] Set target in the passes and IR backend agnostically (#3741)
This commit changes the triton_gpu.compute-capability into
triton_gpu.target to accomodate backends other than cuda.
triton_gpu.target now accepts a string attribute in the form of
<backend>:<arch> as values, for example, cuda:80, hip:gfx942.
This commit also removes the direct computat capability reference in the
OptimizeDotOperandsPass--it only needs to know whether we want to
inject an additional pattern, which can be controlled by a bool
parameter.
[AMD] Support WMMAv2 in AccelerateAMDMatmulPass (#4452)
- Specify kWidth parameter according to the version
- For the first iteration fp8 operands are unsupported, no new operand
configuration are added for now
- Added lit tests
Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
[Triton] Use UnitAttr in tt.reshape definition (#4947)
Make allow_reorder and efficient_layoutUnitAttr for a cleaner
interface.
This way, the operation exposes a bool getEfficientLayout() member to
check for that attribute and a constructor receiving bool arguments
for both of these attributes (defaulted to false).
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ]
with
[x] to indicate you have done them.
- [X] I am not making a trivial change, such as fixing a typo in a
comment.
- [X] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [X] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [X] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
Signed-off-by: victor-eds <victor.perez@codeplay.com>
[IR] Rework textual IR for load, store, dot, and other ops (#3587)
The initial motivation for this change was that tt.dot's inputPrecision
value was printed in textual IR not as a string (e.g. "tf32") but as an
opaque number. The trick to fixing this is to explicitly list the attr
in the assemblyFormat, instead of relying on attr-dict. attr-dict prints
the attr as an integer, but if we list it explicitly, it will be printed
as a string.
But when I fixed this, I realized that many other ops, such as load and
store, also had enums which were printed as numbers rather than strings.
So I had to fix those, too. I also added reasonable defaults for most of
the arguments to make the textual IR more concise.
I then noticed that load and store had complicated asm parsers to handle
the fact usually the pointer type was not specified, but sometimes it
was necessary. This is because given e.g. load : tensor<128xf32> the
pointer may be either tensor<128x!tt.ptr<f32>> (the default) or
!tt.ptr<tensor<128xf32>> (for block pointers).
So I fixed this too. Now load and store specify the *pointer* type
instead of the value type. This lets us use the built-in asm parser and
delete a bunch of code.
I also noticed that we had a bunch of dead ttgir tests (actually it
seems like these were never enabled); rather than attempt to update
them, without any way to test it, I just deleted them.
Thanks to @joker-eph for suggesting the solution here -- I never would
have figured this out on my own.
[BACKEND] Fix memory side effects of tt.dot (#4033)
1. Replaced triton_nvidia_gpu.async_dot with
triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe
warp_group_dot is a better name?
2. Removed memdesc from tt.dot because tt.dot should be pure,
without any side effects
3. Removed hacks in Membar analysis.
4. Unified wgmma code generation in the backend.
5. Introduced the DotLike trait for tt.dot and
triton_nvidia_gpu.group_dot.
6. Updated comments in matmul loop pipeline (maybe incomplete).
7. Removed the ConvertDotConvert pattern
[AMD] Support mfma layout in the prefetch pass (#4771)
We've investigated tritongpu-prefetch on the amdgpu and it shows
positive result in some cases.
This change allows the prefetch pass to rewrite the loop with the
tt.dot using amd_mfma with a fix in the sharedmem offset.
Current change doesn't insert the pass to the compilation pipeline
yet. The pass is supposed to be placed just after the pipelining pass.
[Backend] Bypass conversion for suitable blocked to dotOperand layout (#4538)
This PR extends shared memory bypass for blocked->dotOperand
conversions and adds bypass check in DecomposeUnsupportedConversions
and ReduceDataDuplication.
This commit is a preparation step towards improving CodeGen and
efficiency of skinny dot cases.
[backend] Set target in the passes and IR backend agnostically (#3741)
This commit changes the triton_gpu.compute-capability into
triton_gpu.target to accomodate backends other than cuda.
triton_gpu.target now accepts a string attribute in the form of
<backend>:<arch> as values, for example, cuda:80, hip:gfx942.
This commit also removes the direct computat capability reference in the
OptimizeDotOperandsPass--it only needs to know whether we want to
inject an additional pattern, which can be controlled by a bool
parameter.
[IR] Rework textual IR for load, store, dot, and other ops (#3587)
The initial motivation for this change was that tt.dot's inputPrecision
value was printed in textual IR not as a string (e.g. "tf32") but as an
opaque number. The trick to fixing this is to explicitly list the attr
in the assemblyFormat, instead of relying on attr-dict. attr-dict prints
the attr as an integer, but if we list it explicitly, it will be printed
as a string.
But when I fixed this, I realized that many other ops, such as load and
store, also had enums which were printed as numbers rather than strings.
So I had to fix those, too. I also added reasonable defaults for most of
the arguments to make the textual IR more concise.
I then noticed that load and store had complicated asm parsers to handle
the fact usually the pointer type was not specified, but sometimes it
was necessary. This is because given e.g. load : tensor<128xf32> the
pointer may be either tensor<128x!tt.ptr<f32>> (the default) or
!tt.ptr<tensor<128xf32>> (for block pointers).
So I fixed this too. Now load and store specify the *pointer* type
instead of the value type. This lets us use the built-in asm parser and
delete a bunch of code.
I also noticed that we had a bunch of dead ttgir tests (actually it
seems like these were never enabled); rather than attempt to update
them, without any way to test it, I just deleted them.
Thanks to @joker-eph for suggesting the solution here -- I never would
have figured this out on my own.