文件最后提交记录最后更新时间
[WS] reorder partition-loops and lower-aref (#7927) * Reorder partition-loops and lower-aref passes * Split lower-aref to lower-aref and assign-stage-phase passes * the split separates concerns, as assign-stage-phase is more complex pass and testing/debugging can be focused on correctness stage/phase assignment w/o added complexity of aref->mbarrier lowering. * assign-stage-phase uses enterOp token to assign same stage variable that enterOp uses to exitOp, instead of previously having separate stage for enterOps/exitOps. * lower-aref testing verifies correctness of aref->mbarrier lowerings * in load-mma-specialization don't place final waitOp inside ws-region, revert to original behavior before https://github.com/triton-lang/triton/pull/7757, as that change causes perf regression with this pR. Keeping ws.tag to differentiate partitions in different loops as it will be relied upon in aref-tmem-insertion (WIP). This is prep PR needed for aref-tmem-insertion WIP (will be submitted after this one) # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [x] I have added tests. - /test for lit tests - /unittest for C++ tests - /python/test for end-to-end tests - [ ] This PR does not need a test because FILL THIS IN. - Select one of the following. - [ ] I have not added any lit tests. - [x] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)9 个月前
[NFC] Remove remaining uses of deprecated GEN_PASS_CLASSES (#6922) Continuation of #6898, #6785, #3971 --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>1 年前
[LAYOUTS] Implement generalized swizzling for convert_layout (#7565) We generalize the swizzling algorithm to consider the instructions ldmatrix/stmatrix and their transpose versions. To do this, we now require having a dedicated allocator for nvidia, as the required shmem for a convert_layout will now depend on the instructions we can emit. After cleaning up the stmatrix path from the common convert_layout lowering, it became clear that we always take the swizzling path. I changed the allocator to reflect this, and I had to change a ton of tests that used it and now don't require padding. We also implement an improved lowering for the indexing of ldmatrix/stmatrix following the optimisations from ld.shared/st.shared.9 个月前
[backend][nvidia] move NVGPU dialect to third_party/nvidia/ (#3773) This commit moves the NVGPU dialect to the third_party/nvidia directory. To make it happen we need to clear some unnecessary dependency on NVGPU dialect in the AMD backend and move some utilities into the third_party/nvidia directory too. Overall this gives us a better structure and creates a stricter boundary between core and backends.2 年前
[BENCH] Various fixes on bench_mlp.py (#7926) 9 个月前
Make cublas_types.h standalone (#6383) The header relies on these libraries. Include them so it can be compiled on its own for cleanliness.1 年前