triton-ascend/third_party/amd/include/TritonAMDGPUTransforms · Ascend/triton-ascend - AtomGit

GGitHub[AMD] Use single LDS for both transposed and non-transposed access (#7813 )

84808eba创建于 2025年8月15日历史提交

文件	最后提交记录	最后更新时间
CMakeLists.txt	[AMD] move AMD backend in tree (#2967) * this pr moves the AMD backend submodule into this repo as a folder	2 年前
MfmaGroup.h	[AMD] Emulate Float8E4M3FN with Float16 on CDNA3 and below (#7186) The fact that gfx942 has its own FP8 variants, not the OCP ones, is a common pitfall. Also starting gfx950, we switch to OCP FP8 variants. So it means we have a one-generation special case here. This commit enables emulating Float8E4M3FN with FP16 like what we already do for Float8E5M2 for better portability, with a performance remark.	11 个月前
Passes.h	[AMD] Use single LDS for both transposed and non-transposed access (#7813) This commit introduces a pass for detecting a pair of tt.dot ops that both use the same tt.load result, one directly and one via tt.trans and creates the same shared memory allocation. This allows the pipeliner to pick a single LDS layout, and enables pipeline of the loads.	9 个月前
Passes.td	[AMD] Use single LDS for both transposed and non-transposed access (#7813) This commit introduces a pass for detecting a pair of tt.dot ops that both use the same tt.load result, one directly and one via tt.trans and creates the same shared memory allocation. This allows the pipeliner to pick a single LDS layout, and enables pipeline of the loads.	9 个月前
TritonGPUConversion.h	[AMD] NFC: Unified header guard in third_party/amd (#5244) This commit unified the names of header guards in third_party/amd.	1 年前
WmmaGroup.h	[AMD][gfx12] WMMA AMD16x16x32 support for i4 operands (#7012) # New contributor declaration - [x ] I am not making a trivial change, such as fixing a typo in a comment. - [ x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) # PR description Previous WMMA support was hardcoded for 16x16x16 tiles, which matched older AMD GPU capabilities. Starting with gfx1200, AMD supports 16x32 input for matrix A and 32x16 for matrix B (for i4 types). To support this, we introduce a mapping from the dot operation's configuration (i.e., shape and element type information) to the corresponding WMMA instruction. This abstraction allows the backend to dynamically determine the key instruction parameters—kDim and kWidth—which are exactly what's needed to enable support for varying K dimensions in WMMA instructions.	11 个月前