文件最后提交记录最后更新时间
[BACKEND] Make backends more uniform and extract some backend agnostic utilities/helpers. (#3125) Making backends more uniform, extracting backend agnostic utilities/helpers, while the backend-specific ones are moved to third_party/. 1. Split the top-level TritonGPUToLLVM/Utility.h between LLVM and PTX helpers. Keep the LLVM helpers in top-level, whereas the PTX helpers are moved to third_party/nvidia. 2. Move TritonGPUToLLVM passes to third_party/nvidia, along with the PTXAsmFormat 3. Delete any code in third_party/amd that depends on PTXAsmFormat 4. Delete all LLVM helpers from third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h and use the top-level Utility.h. 5. Link unittest/PTXAsmFormatTest.cpp to third_party/nvidia library. 6. Move NVGPUToLLVM passes to third_party/nvidia Final directory structure: ``` ├── include/ │ └── triton/ │ └── Conversion/ │ └── TritonGPUToLLVM/ │ └── AsmFormat.h ├── lib/ │ └── Conversion/ │ └── TritonGPUToLLVM/ │ ├── TypeConverter.h │ ├── TypeConverter.cpp │ ├── Utility.h │ └── Utility.cpp (LLVM-only helpers) └── third_party/ └── amd/ └── nvidia/ ├── include/ │ ├── NVGPUToLLVM/ │ └── TritonNVIDIAGPUToLLVM/ └── lib/ ├── NVGPUToLLVM/ └── TritonNVIDIAGPUToLLVM/ ```2 年前
[BACKEND] Rework load-store redundant data masking (#5432) This splits getRedundantDataMask into two functions, getFreeVariableMasks and emitRedundantThreadPredicate. The returned predicate doesn't include the register index, and instead you use the free variable mask to de-duplicate the registers while looping over them (i.e. we don't emit the instruction at all). This also allows us to fix predication for AsyncCopyGlobalToLocal, as we can explicitly zero out the block dim mask before calling emitRedundantThreadPredicate. I also return null values if the predicate is always true, which allows us to omit the predicate entirely if there is no redundant data.1 年前
[LAYOUTS] Implement generalized swizzling for convert_layout (#7565) We generalize the swizzling algorithm to consider the instructions ldmatrix/stmatrix and their transpose versions. To do this, we now require having a dedicated allocator for nvidia, as the required shmem for a convert_layout will now depend on the instructions we can emit. After cleaning up the stmatrix path from the common convert_layout lowering, it became clear that we always take the swizzling path. I changed the allocator to reflect this, and I had to change a ton of tests that used it and now don't require padding. We also implement an improved lowering for the indexing of ldmatrix/stmatrix following the optimisations from ld.shared/st.shared.9 个月前
[LAYOUTS] Implement generalized swizzling for convert_layout (#7565) We generalize the swizzling algorithm to consider the instructions ldmatrix/stmatrix and their transpose versions. To do this, we now require having a dedicated allocator for nvidia, as the required shmem for a convert_layout will now depend on the instructions we can emit. After cleaning up the stmatrix path from the common convert_layout lowering, it became clear that we always take the swizzling path. I changed the allocator to reflect this, and I had to change a ton of tests that used it and now don't require padding. We also implement an improved lowering for the indexing of ldmatrix/stmatrix following the optimisations from ld.shared/st.shared.9 个月前
[BACKEND] Allow backend to specify special rules for membar insertion (#4675) With block level kind of operations like TMA it is possible that some ops access the shared memory but don't require barriers. This adds a lambda that backends can pass to explicitly skip barriers in between some ops.1 年前