[WS] reorder partition-loops and lower-aref (#7927)
* Reorder partition-loops and lower-aref passes
* Split lower-aref to lower-aref and assign-stage-phase passes
* the split separates concerns, as assign-stage-phase is more complex
pass and testing/debugging can be focused on correctness stage/phase
assignment w/o added complexity of aref->mbarrier lowering.
* assign-stage-phase uses enterOp token to assign same stage variable
that enterOp uses to exitOp, instead of previously having separate stage
for enterOps/exitOps.
* lower-aref testing verifies correctness of aref->mbarrier
lowerings
* in load-mma-specialization don't place final waitOp inside
ws-region, revert to original behavior before
https://github.com/triton-lang/triton/pull/7757, as that change causes
perf regression with this pR. Keeping ws.tag to differentiate partitions
in different loops as it will be relied upon in aref-tmem-insertion
(WIP).
This is prep PR needed for aref-tmem-insertion WIP (will be submitted
after this one)
# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [x] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
[LAYOUTS] Implement generalized swizzling for convert_layout (#7565)
We generalize the swizzling algorithm to consider the instructions
ldmatrix/stmatrix and their transpose versions.
To do this, we now require having a dedicated allocator for nvidia, as
the required shmem for a convert_layout will now depend on the
instructions we can emit.
After cleaning up the stmatrix path from the common convert_layout
lowering, it became clear that we always take the swizzling path. I
changed
the allocator to reflect this, and I had to change a ton of tests that
used it and
now don't require padding.
We also implement an improved lowering for the indexing of
ldmatrix/stmatrix
following the optimisations from ld.shared/st.shared.
[backend][nvidia] move NVGPU dialect to third_party/nvidia/ (#3773)
This commit moves the NVGPU dialect to the third_party/nvidia
directory. To make it happen we need to clear some unnecessary
dependency on NVGPU dialect in the AMD backend and move
some utilities into the third_party/nvidia directory too. Overall this
gives us a better structure and creates a stricter boundary between
core and backends.