triton-ascend/third_party/nvidia/include/TritonNVIDIAGPUToLLVM · Ascend/triton-ascend - AtomGit

GGitHub[LAYOUTS] Implement generalized swizzling for convert_layout (#7565 )

b3b9931c创建于 2025年7月30日历史提交

文件	最后提交记录	最后更新时间
CMakeLists.txt	[BACKEND] Make backends more uniform and extract some backend agnostic utilities/helpers. (#3125) Making backends more uniform, extracting backend agnostic utilities/helpers, while the backend-specific ones are moved to `third_party/`. 1. Split the top-level `TritonGPUToLLVM/Utility.h` between LLVM and PTX helpers. Keep the LLVM helpers in top-level, whereas the PTX helpers are moved to `third_party/nvidia`. 2. Move TritonGPUToLLVM passes to `third_party/nvidia`, along with the `PTXAsmFormat` 3. Delete any code in `third_party/amd` that depends on `PTXAsmFormat` 4. Delete all LLVM helpers from `third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h` and use the top-level `Utility.h`. 5. Link `unittest/PTXAsmFormatTest.cpp` to `third_party/nvidia` library. 6. Move `NVGPUToLLVM` passes to `third_party/nvidia` Final directory structure: ``` ├── include/ │ └── triton/ │ └── Conversion/ │ └── TritonGPUToLLVM/ │ └── AsmFormat.h ├── lib/ │ └── Conversion/ │ └── TritonGPUToLLVM/ │ ├── TypeConverter.h │ ├── TypeConverter.cpp │ ├── Utility.h │ └── Utility.cpp (LLVM-only helpers) └── third_party/ └── amd/ └── nvidia/ ├── include/ │ ├── NVGPUToLLVM/ │ └── TritonNVIDIAGPUToLLVM/ └── lib/ ├── NVGPUToLLVM/ └── TritonNVIDIAGPUToLLVM/ ```	2 年前
PTXAsmFormat.h	[BACKEND] Rework load-store redundant data masking (#5432) This splits `getRedundantDataMask` into two functions, `getFreeVariableMasks` and `emitRedundantThreadPredicate`. The returned predicate doesn't include the register index, and instead you use the free variable mask to de-duplicate the registers while looping over them (i.e. we don't emit the instruction at all). This also allows us to fix predication for `AsyncCopyGlobalToLocal`, as we can explicitly zero out the block dim mask before calling `emitRedundantThreadPredicate`. I also return null values if the predicate is always true, which allows us to omit the predicate entirely if there is no redundant data.	1 年前
Passes.h	[LAYOUTS] Implement generalized swizzling for convert_layout (#7565) We generalize the swizzling algorithm to consider the instructions `ldmatrix/stmatrix` and their transpose versions. To do this, we now require having a dedicated allocator for nvidia, as the required shmem for a convert_layout will now depend on the instructions we can emit. After cleaning up the stmatrix path from the common `convert_layout` lowering, it became clear that we always take the swizzling path. I changed the allocator to reflect this, and I had to change a ton of tests that used it and now don't require padding. We also implement an improved lowering for the indexing of `ldmatrix/stmatrix` following the optimisations from `ld.shared/st.shared`.	9 个月前
Passes.td	[LAYOUTS] Implement generalized swizzling for convert_layout (#7565) We generalize the swizzling algorithm to consider the instructions `ldmatrix/stmatrix` and their transpose versions. To do this, we now require having a dedicated allocator for nvidia, as the required shmem for a convert_layout will now depend on the instructions we can emit. After cleaning up the stmatrix path from the common `convert_layout` lowering, it became clear that we always take the swizzling path. I changed the allocator to reflect this, and I had to change a ton of tests that used it and now don't require padding. We also implement an improved lowering for the indexing of `ldmatrix/stmatrix` following the optimisations from `ld.shared/st.shared`.	9 个月前
Utility.h	[BACKEND] Allow backend to specify special rules for membar insertion (#4675) With block level kind of operations like TMA it is possible that some ops access the shared memory but don't require barriers. This adds a lambda that backends can pass to explicitly skip barriers in between some ops.	1 年前