[WS] reorder partition-loops and lower-aref (#7927)
* Reorder partition-loops and lower-aref passes
* Split lower-aref to lower-aref and assign-stage-phase passes
* the split separates concerns, as assign-stage-phase is more complex
pass and testing/debugging can be focused on correctness stage/phase
assignment w/o added complexity of aref->mbarrier lowering.
* assign-stage-phase uses enterOp token to assign same stage variable
that enterOp uses to exitOp, instead of previously having separate stage
for enterOps/exitOps.
* lower-aref testing verifies correctness of aref->mbarrier
lowerings
* in load-mma-specialization don't place final waitOp inside
ws-region, revert to original behavior before
https://github.com/triton-lang/triton/pull/7757, as that change causes
perf regression with this pR. Keeping ws.tag to differentiate partitions
in different loops as it will be relied upon in aref-tmem-insertion
(WIP).
This is prep PR needed for aref-tmem-insertion WIP (will be submitted
after this one)
# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [x] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
[backend][nvidia] move NVGPU dialect to third_party/nvidia/ (#3773)
This commit moves the NVGPU dialect to the third_party/nvidia
directory. To make it happen we need to clear some unnecessary
dependency on NVGPU dialect in the AMD backend and move
some utilities into the third_party/nvidia directory too. Overall this
gives us a better structure and creates a stricter boundary between
core and backends.