[AMD] Enable dot_scaled on gfx11 (#7954)
I saw some occasional test failures, but after disabling True16, tests
seem to pass fine.
---------
Co-authored-by: Paul Trojahn <paul.trojahn@amd.com>
[AMD] Enable Pingpong by default on gfx950 arch (#7697)
List of enabling conditions
- FP/BF16 GEMM with M,N>64 tilesize when num_stages=3 and num_warps=8
- GEMM using dot_scaled with M=N=256 tile size when num_stages=2 and
num_warps=8
- FA with num_stages=4
Only with using async_copy.
[AMD] Use single LDS for both transposed and non-transposed access (#7813)
This commit introduces a pass for detecting a pair of tt.dot ops that
both use the same tt.load result, one directly and one via tt.trans and
creates the same shared memory allocation. This allows the pipeliner to
pick a single LDS layout, and enables pipeline of the loads.
[AMD] Fix pointer canonicalizer when propagating discardable attrs (#7242)
Propagating divisibility and other discardable atts needs to make sure
that the rank for source/destination instructions matches. When the
ranks doesn't match it isn't possible to trivially propagate those
attributes.
Allow propagation of attributes for same rank and also between rank 1
and scalar.
[AMD] Do not pipeline via AsyncCopy for unsupported vec sizes (#7676)
This PR makes canBeConvertedToAsyncLoad more general and strict to
only allow cases where there is a supported direct-to-lds vecSize
smaller or equal to the vecSize based on contiguity.
This catches cases where we load less than 32bits, which was already
rejected before this PR. Additionally it catches case where we cannot
lower the vecSize to a supported size. In such cases we can also not use
ttg.async_copy_global_to_local since we cannot split contiguous
elements owned by a thread into multiple load instructions. e.g. fp64
with vecSize==1 does not work with ttg.async_copy_global_to_local on
GFX9.
[AMD] Support 4x64 and 64x4 MFMA layout for dot (#7576)
Support 4x64 and 64x4 MFMA layout for dot, where `D(4x64) = A(4x64) *
B(64x64) + C(4x64) or D(64x4) = A(64x64) * B(64x4) + C(64x4)`. This
allows to use small M/N size GEMM with better performance. Under the
hood it uses a group of 16 4x4 mfma instructions.
Other related changes:
- Refactor mfma input and output linear layout implementation.
- Remove unsupported mfma instructions: 1) do not support 4x4 tile as it
requires manual reduction in the end, which has a bad performance; 2) do
not support 4x16 and 16x4 tile for fp64 - mfma fp64 does not support
broadcast for operand A.
[AMD] Use single LDS for both transposed and non-transposed access (#7813)
This commit introduces a pass for detecting a pair of tt.dot ops that
both use the same tt.load result, one directly and one via tt.trans and
creates the same shared memory allocation. This allows the pipeliner to
pick a single LDS layout, and enables pipeline of the loads.
[AMD] Count vmcnt instructions for AsyncWait (#6426)
Adds UpdateAsyncWaitCountPass to adjusts the wait counts of
AsyncWait ops to reflect the number of interleaved direct to lds
assembly instructions. The LLVM backend cannot infer the dependency
between the AsyncCopies and the local_reads so we emit it from
Triton as we have the dependency information via tracing the
AsyncToken.
The pass ignores global/buffer loads because the actual number of
assembly instructions is determined by the LLVM backend. Note that an
underestimation does only affect performance but not correctness.
findMinPathCountInDefChain is in separate file because we might reuse
it for combining AsyncWaits in the StreamPipeliner.
[AMD] Count vmcnt instructions for AsyncWait (#6426)
Adds UpdateAsyncWaitCountPass to adjusts the wait counts of
AsyncWait ops to reflect the number of interleaved direct to lds
assembly instructions. The LLVM backend cannot infer the dependency
between the AsyncCopies and the local_reads so we emit it from
Triton as we have the dependency information via tracing the
AsyncToken.
The pass ignores global/buffer loads because the actual number of
assembly instructions is determined by the LLVM backend. Note that an
underestimation does only affect performance but not correctness.
findMinPathCountInDefChain is in separate file because we might reuse
it for combining AsyncWaits in the StreamPipeliner.
[AMD][gfx12] WMMA AMD16x16x32 support for i4 operands (#7012)
# New contributor declaration
- [x ] I am not making a trivial change, such as fixing a typo in a
comment.
- [ x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ x] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
# PR description
Previous WMMA support was hardcoded for 16x16x16 tiles, which matched
older AMD GPU capabilities. Starting with gfx1200, AMD supports 16x32
input for matrix A and 32x16 for matrix B (for i4 types).
To support this, we introduce a mapping from the dot operation's
configuration (i.e., shape and element type information) to the
corresponding WMMA instruction. This abstraction allows the backend to
dynamically determine the key instruction parameters—kDim and
kWidth—which are exactly what's needed to enable support for varying K
dimensions in WMMA instructions.