triton-ascend/third_party/amd/lib/TritonAMDGPUTransforms · Ascend/triton-ascend - AtomGit

GGitHub[AMD] Enable dot_scaled on gfx11 (#7954 )

49f77ab3创建于 2025年8月27日历史提交

文件	最后提交记录	最后更新时间
AccelerateAMDMatmul.cpp	[AMD] Enable dot_scaled on gfx11 (#7954) I saw some occasional test failures, but after disabling True16, tests seem to pass fine. --------- Co-authored-by: Paul Trojahn <paul.trojahn@amd.com>	9 个月前
BlockPingpong.cpp	[AMD] Enable Pingpong by default on gfx950 arch (#7697) List of enabling conditions - FP/BF16 GEMM with M,N>64 tilesize when num_stages=3 and num_warps=8 - GEMM using `dot_scaled` with M=N=256 tile size when num_stages=2 and num_warps=8 - FA with num_stages=4 Only with using async_copy.	9 个月前
CMakeLists.txt	[AMD] Use single LDS for both transposed and non-transposed access (#7813) This commit introduces a pass for detecting a pair of tt.dot ops that both use the same tt.load result, one directly and one via tt.trans and creates the same shared memory allocation. This allows the pipeliner to pick a single LDS layout, and enables pipeline of the loads.	9 个月前
CanonicalizePointers.cpp	[AMD] Fix pointer canonicalizer when propagating discardable attrs (#7242) Propagating divisibility and other discardable atts needs to make sure that the rank for source/destination instructions matches. When the ranks doesn't match it isn't possible to trivially propagate those attributes. Allow propagation of attributes for same rank and also between rank 1 and scalar.	11 个月前
CoalesceAsyncCopy.cpp	[AMD] Do not pipeline via AsyncCopy for unsupported vec sizes (#7676) This PR makes `canBeConvertedToAsyncLoad` more general and strict to only allow cases where there is a supported direct-to-lds `vecSize` smaller or equal to the `vecSize` based on contiguity. This catches cases where we load less than 32bits, which was already rejected before this PR. Additionally it catches case where we cannot lower the vecSize to a supported size. In such cases we can also not use `ttg.async_copy_global_to_local` since we cannot split contiguous elements owned by a thread into multiple load instructions. e.g. fp64 with vecSize==1 does not work with `ttg.async_copy_global_to_local` on GFX9.	10 个月前
ConvertToBufferOps.cpp	[AMD] Fix buffer op mask operand removal (#7963) Buffer operations don't have to take mask operand iff the mask is an all-true-predicate vector.	9 个月前
FoldTrueCmpIOp.cpp	[NFC] Remove remaining uses of deprecated `GEN_PASS_CLASSES` (#6922) Continuation of #6898, #6785, #3971 --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>	1 年前
HoistLayoutConversions.cpp	[NFC] Remove remaining uses of deprecated `GEN_PASS_CLASSES` (#6922) Continuation of #6898, #6785, #3971 --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>	1 年前
InThreadTranspose.cpp	[TritonGPU] Split MemDescSubview into MemDescIndex and MemDescSubslice (#7622) The first one will be used just for pipelining and it's equivalent to `x[i]`, the second one takes a full slice of constant shape `x[:i1, :i2]`, for example.	10 个月前
MfmaGroup.cpp	[AMD] Support 4x64 and 64x4 MFMA layout for dot (#7576) Support 4x64 and 64x4 MFMA layout for dot, where `D(4x64) = A(4x64) * B(64x64) + C(4x64) `or` D(64x4) = A(64x64) * B(64x4) + C(64x4)`. This allows to use small M/N size GEMM with better performance. Under the hood it uses a group of 16 4x4 mfma instructions. Other related changes: - Refactor mfma input and output linear layout implementation. - Remove unsupported mfma instructions: 1) do not support 4x4 tile as it requires manual reduction in the end, which has a bad performance; 2) do not support 4x16 and 16x4 tile for fp64 - mfma fp64 does not support broadcast for operand A.	10 个月前
OptimizeDotOperands.cpp	[AMD] Use single LDS for both transposed and non-transposed access (#7813) This commit introduces a pass for detecting a pair of tt.dot ops that both use the same tt.load result, one directly and one via tt.trans and creates the same shared memory allocation. This allows the pipeliner to pick a single LDS layout, and enables pipeline of the loads.	9 个月前
OptimizeEpilogue.cpp	[NFC] Use RankedTensorType's clone and cloneWithEncoding member functions (#7464) Credit to Jeff for pointing out these exist.	10 个月前
ReorderInstructions.cpp	[Backend] Bump to llvm/llvm-project@8957e64a20fc (#7138) This picks up a bug fix for AMDGPU scaled mfma: * https://github.com/llvm/llvm-project/pull/142493 Along the way we need to fix API changes: * `getBackwardSlice` now returns `LogicalResult`; adding `(void)` to ignore for now. * `LLVM::InlineAsmOp` now has a TailCallKind parameter. * A few conversion ops in AMD backend changed return types and parameters.	11 个月前
StreamPipeline.cpp	[NFC] Just take one shape in `toLinearLayout` (#7778) We don't need to pass both allocationShape and shape after https://github.com/triton-lang/triton/pull/7515	9 个月前
UpdateAsyncWaitCount.cpp	[AMD] NFC: simplify pass/pattern constructor declaration (#7665)	10 个月前
Utility.cpp	[AMD] Count vmcnt instructions for AsyncWait (#6426) Adds `UpdateAsyncWaitCountPass` to adjusts the wait counts of `AsyncWait` ops to reflect the number of interleaved direct to lds assembly instructions. The LLVM backend cannot infer the dependency between the `AsyncCopies` and the `local_reads` so we emit it from Triton as we have the dependency information via tracing the `AsyncToken`. The pass ignores global/buffer loads because the actual number of assembly instructions is determined by the LLVM backend. Note that an underestimation does only affect performance but not correctness. `findMinPathCountInDefChain` is in separate file because we might reuse it for combining `AsyncWaits` in the `StreamPipeliner`.	1 年前
Utility.h	[AMD] Count vmcnt instructions for AsyncWait (#6426) Adds `UpdateAsyncWaitCountPass` to adjusts the wait counts of `AsyncWait` ops to reflect the number of interleaved direct to lds assembly instructions. The LLVM backend cannot infer the dependency between the `AsyncCopies` and the `local_reads` so we emit it from Triton as we have the dependency information via tracing the `AsyncToken`. The pass ignores global/buffer loads because the actual number of assembly instructions is determined by the LLVM backend. Note that an underestimation does only affect performance but not correctness. `findMinPathCountInDefChain` is in separate file because we might reuse it for combining `AsyncWaits` in the `StreamPipeliner`.	1 年前
WmmaGroup.cpp	[AMD][gfx12] WMMA AMD16x16x32 support for i4 operands (#7012) # New contributor declaration - [x ] I am not making a trivial change, such as fixing a typo in a comment. - [ x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) # PR description Previous WMMA support was hardcoded for 16x16x16 tiles, which matched older AMD GPU capabilities. Starting with gfx1200, AMD supports 16x32 input for matrix A and 32x16 for matrix B (for i4 types). To support this, we introduce a mapping from the dot operation's configuration (i.e., shape and element type information) to the corresponding WMMA instruction. This abstraction allows the backend to dynamically determine the key instruction parameters—kDim and kWidth—which are exactly what's needed to enable support for varying K dimensions in WMMA instructions.	11 个月前