triton-ascend/third_party/nvidia/lib/Dialect · Ascend/triton-ascend - AtomGit

GGitHub[WS] reorder partition-loops and lower-aref (#7927 )

ba3ec665创建于 2025年8月22日历史提交

文件	最后提交记录	最后更新时间
NVGPU	Refactor shared load/store utilities. (#4141) Refactor shared load/store utilities. (This commit message is written about loads, but everything also applies to stores.) Previous to this PR we had two ways of loading from shared memory within the same CTA. 1. LLVM::LoadOp. This supports vector loads, but not predication. 2. TargetInfo::loadShared. This supported predication, but not vector loads. Loads from shared memory in different CTAs were accessible only through an nvidia-specific header. These did not support predication, and although they supported vector loads, it worked slightly differently than LLVM::LoadOp (namely, you have to know you're loading a vector and unwrap the type before passing to the function). This PR reworks all this. Now 1. TargetInfo::loadShared and TargetInfo::loadDShared have the same API. 2. They both support predication and vectors, and the vectors work like LLVM::LoadOp. 3. They share code; they both just emit PTX. 4. Because we're emitting PTX directly from loadDShared, we can delete the NVIDIA::LoadDSmem op. In general I think a logical operation should have either A. A function createFoo() that emits one or more MLIR operations, or B. An MLIR op FooOp that lowers to one or more MLIR operations. But for distributed shmem loads, we had both (A) and (B). This was a redundant layer of indirection. This is used in a future LLs patch.	1 年前
NVWS	[WS] reorder partition-loops and lower-aref (#7927) * Reorder `partition-loops` and `lower-aref` passes * Split `lower-aref` to `lower-aref` and `assign-stage-phase` passes * the split separates concerns, as `assign-stage-phase` is more complex pass and testing/debugging can be focused on correctness stage/phase assignment w/o added complexity of `aref->mbarrier` lowering. * `assign-stage-phase` uses enterOp token to assign same stage variable that enterOp uses to exitOp, instead of previously having separate stage for enterOps/exitOps. * `lower-aref` testing verifies correctness of `aref->mbarrier` lowerings * in `load-mma-specialization` don't place final `waitOp` inside ws-region, revert to original behavior before https://github.com/triton-lang/triton/pull/7757, as that change causes perf regression with this pR. Keeping ws.tag to differentiate partitions in different loops as it will be relied upon in `aref-tmem-insertion` (WIP). This is prep PR needed for `aref-tmem-insertion` WIP (will be submitted after this one) # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)	9 个月前
CMakeLists.txt	Add an Nvidia Warp Specialization Dialect with an Async Reference Type and Operations (#6288) Per Conversations between OpenAI and Nvidia, this represents a first step in integrating some of the internally developed Warp Specialization abstractions for integration with ongoing backend work. The Aref abstraction was developed by @vinodgro, @3gx, @acollins3, @BinFan, and @chhzh123 , with design input from @masahi and myself. This PR adds the dialect for warp specialization analysis and abstractions and IR defintions for Aref. Future work will focus on lowering to standard ttg/ttng types along with higher level passes for warp specialization analysis. Thanks to @Mogball for helping me get over the last couple of humps of dialect implementation, I'd forgotten how much boilerplate is involved :sweat_smile: This also adds a new lit target for third_party/nvidia/test, I'm not sure if anything else needs to happen to get that target to run in CI.	1 年前