[WS] reorder partition-loops and lower-aref (#7927)
* Reorder partition-loops and lower-aref passes
* Split lower-aref to lower-aref and assign-stage-phase passes
* the split separates concerns, as assign-stage-phase is more complex
pass and testing/debugging can be focused on correctness stage/phase
assignment w/o added complexity of aref->mbarrier lowering.
* assign-stage-phase uses enterOp token to assign same stage variable
that enterOp uses to exitOp, instead of previously having separate stage
for enterOps/exitOps.
* lower-aref testing verifies correctness of aref->mbarrier
lowerings
* in load-mma-specialization don't place final waitOp inside
ws-region, revert to original behavior before
https://github.com/triton-lang/triton/pull/7757, as that change causes
perf regression with this pR. Keeping ws.tag to differentiate partitions
in different loops as it will be relied upon in aref-tmem-insertion
(WIP).
This is prep PR needed for aref-tmem-insertion WIP (will be submitted
after this one)
# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [x] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
Add an Nvidia Warp Specialization Dialect with an Async Reference Type and Operations (#6288)
Per Conversations between OpenAI and Nvidia, this represents a first
step in integrating some of the internally developed Warp Specialization
abstractions for integration with ongoing backend work.
The Aref abstraction was developed by @vinodgro, @3gx, @acollins3,
@BinFan, and @chhzh123 , with design input from @masahi and myself.
This PR adds the dialect for warp specialization analysis and
abstractions and IR defintions for Aref. Future work will focus on
lowering to standard ttg/ttng types along with higher level passes for
warp specialization analysis.
Thanks to @Mogball for helping me get over the last couple of humps of
dialect implementation, I'd forgotten how much boilerplate is involved
:sweat_smile:
This also adds a new lit target for third_party/nvidia/test, I'm not
sure if anything else needs to happen to get that target to run in CI.