文件最后提交记录最后更新时间
[release/3.2.x] [CHERRY PICK] [AMD] Fix issue with rank=1 in tryFitCvtIntoLDS (#5084) (#5453) Co-authored-by: Sam Ginzburg <ginzburg@fb.com> (cherry picked from commit f9cdf58276a42b31151debf219c2de1471ed4422) Co-authored-by: Samuel Ginzburg <ginzburg@meta.com>1 年前
[BACKEND] Fix and document logic for creating warp shapes in MMAv3 (#5441) Cherry pick of #5277 for 3.2 release Co-authored-by: Keren Zhou <kerenzhou@openai.com>1 年前
[cherrypick release/3.2][BACKEND] Fix accumulator init optimization for integer matmuls (#5192) (#5624) This fixes https://github.com/pytorch/pytorch/issues/144705 Co-authored-by: peterbell10 <peterbell10@openai.com>1 年前
[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the triton_gpu.compute-capability into triton_gpu.target to accomodate backends other than cuda. triton_gpu.target now accepts a string attribute in the form of <backend>:<arch> as values, for example, cuda:80, hip:gfx942. This commit also removes the direct computat capability reference in the OptimizeDotOperandsPass--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.2 年前
[Triton] Use UnitAttr in tt.reshape definition (#4947) Make allow_reorder and efficient_layout UnitAttr for a cleaner interface. This way, the operation exposes a bool getEfficientLayout() member to check for that attribute and a constructor receiving bool arguments for both of these attributes (defaulted to false). The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace [ ] with [x] to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [X] I have added tests. - /test for lit tests - /unittest for C++ tests - /python/test for end-to-end tests - [ ] This PR does not need a test because FILL THIS IN. - Select one of the following. - [ ] I have not added any lit tests. - [X] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>1 年前
[BACKEND] Fix ProgramPoint passing in AxisInfoAnalysis (#5181) Fixes #5122. The ProgramPoint [here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087) is created on the stack. Then its address is [passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089) to the MLIR SparseAnalysis code, where it is [added as a dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311) and later [dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90). By the time the ProramPoint is dereferenced in the AbstractSparseForwardDataFlowAnalysis::visit, the AxisInfoAnalysis::visitForOpInductionVar will have finished and the ProgramPoint stack variable destroyed. This leads to a segfault (which can be reproed on the base rev with the lit test added in this PR). The code modified in this PR was originally added in #4927, in conjunction with updating the llvm-project hash to b5cc222d7429. However, as noted in https://github.com/llvm/llvm-project/pull/110344 (the llvm-project PR that has made the refactoring prompting the AxisInfo.cpp change in #4927): > For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by getProgramPointAfter(op) As the AxisInfoAnalysis (in Triton) inherits from SparseForwardDataFlowAnalysis (in MLIR), in this PR we follow the above which resolves the segfault issue (as the ProgramPoint is now stored in the instance-level state of the pass). P.S. The lit test added in this PR is not exactly minimal. However, I did my best to minimize it starting from the 400-line repro TTGIR in #5122. Further minimization does not seem to expose the segfault.1 年前
[BACKEND] Fix crash in combine select and if op (#3990) When multiple selects gets merged to the same ifop we need to make sure we don't use a stale IfOp during transformation.1 年前
[AMD] Enable shared->MFMA dot operand conversion through LinearLayout (#4983) This PR: - Introduces fallback from normal TTG->LLVM converter in case it does not support given local_load. - Enables conversion of MFMA dot layout to Linear Layout in local_load pattern.1 年前
[BACKEND] Fix memory side effects of tt.dot (#4033) 1. Replaced triton_nvidia_gpu.async_dot with triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe warp_group_dot is a better name? 2. Removed memdesc from tt.dot because tt.dot should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the DotLike trait for tt.dot and triton_nvidia_gpu.group_dot. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the ConvertDotConvert pattern1 年前
[BACKEND] Fix memory side effects of tt.dot (#4033) 1. Replaced triton_nvidia_gpu.async_dot with triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe warp_group_dot is a better name? 2. Removed memdesc from tt.dot because tt.dot should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the DotLike trait for tt.dot and triton_nvidia_gpu.group_dot. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the ConvertDotConvert pattern1 年前
[AMD] Support WMMAv2 in AccelerateAMDMatmulPass (#4452) - Specify kWidth parameter according to the version - For the first iteration fp8 operands are unsupported, no new operand configuration are added for now - Added lit tests Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>1 年前
Verify the tt.dot operation thru the dot verification interface of the dialect which defines the C layout. (#4630) The Triton tt.dot operation maybe materialized by MMA encoding defined by the out-of-tree dialect. The verifier of the tt.dot operation dispatches the verification to the dialect of the A and B operands which are mostly DotOp layout defined in-tree TritonGPU dialect. The TritonGPU dialect only verifies the tt.dot operation for in-tree backend. To use the dialect of the C operand layout verifying the tt.dot operation for out-of-tree backends. (And also works for NV and AMD's MMA which are defined in-tree TritonGPU dialect.) - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [X] I have added tests. - /test for lit tests - [ ] This PR does not need a test because FILL THIS IN. - Select one of the following. - [ ] I have not added any lit tests. - [X] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)1 年前
[test] NFC: split loop pipeline test file to prepare sharing (#4399) 1 年前
[AMD] Rewrite transpose ops in pipeliner to mutable memory (#4969) add_optimize_dot_operands may introduce a immutable shared buffer for transposed dot operands. Our stream-pipeliner then replaces the immutable buffer with a mutable buffer to be able to reuse it across iterations (pre-fetching). This will then produce incorrect transOps because the input is mutable but the result is immutable. This PR rewrites those transOps to output a mutable layout.1 年前
[AMD] Enable dynamic peeling for stream pipeliner (#4650) * Added predication for tt.store for tests * Moved ttng.warp_group_dot for tests1 年前
[SWP] When num_stages = 2, do not pipeline indirect loads (#4721) For indirect loads, we try to assign them to later stages ``` unsigned stagesBetweenLoads = ceil<unsigned>(numStages - 2, maxIndirectionLevel + 1); int stage = (maxIndirectionLevel - indLevel) * stagesBetweenLoads; schedule.insert(loadOp, stage, loadsClusters[indLevel]); ``` If numStages is 2, there is no later stage to assign the indirect loads to. The fix is to not pipeline the indirect loads. We also generalize to not pipeline an indirect load if the indirection level >= numStages - 11 年前
[Triton] Use UnitAttr in tt.reshape definition (#4947) Make allow_reorder and efficient_layout UnitAttr for a cleaner interface. This way, the operation exposes a bool getEfficientLayout() member to check for that attribute and a constructor receiving bool arguments for both of these attributes (defaulted to false). The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace [ ] with [x] to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [X] I have added tests. - /test for lit tests - /unittest for C++ tests - /python/test for end-to-end tests - [ ] This PR does not need a test because FILL THIS IN. - Select one of the following. - [ ] I have not added any lit tests. - [X] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>1 年前
[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the triton_gpu.compute-capability into triton_gpu.target to accomodate backends other than cuda. triton_gpu.target now accepts a string attribute in the form of <backend>:<arch> as values, for example, cuda:80, hip:gfx942. This commit also removes the direct computat capability reference in the OptimizeDotOperandsPass--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.2 年前
[AMD] Support WMMAv2 in AccelerateAMDMatmulPass (#4452) - Specify kWidth parameter according to the version - For the first iteration fp8 operands are unsupported, no new operand configuration are added for now - Added lit tests Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>1 年前
[Triton] Use UnitAttr in tt.reshape definition (#4947) Make allow_reorder and efficient_layout UnitAttr for a cleaner interface. This way, the operation exposes a bool getEfficientLayout() member to check for that attribute and a constructor receiving bool arguments for both of these attributes (defaulted to false). The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace [ ] with [x] to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [X] I have added tests. - /test for lit tests - /unittest for C++ tests - /python/test for end-to-end tests - [ ] This PR does not need a test because FILL THIS IN. - Select one of the following. - [ ] I have not added any lit tests. - [X] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>1 年前
[IR] Rework textual IR for load, store, dot, and other ops (#3587) The initial motivation for this change was that tt.dot's inputPrecision value was printed in textual IR not as a string (e.g. "tf32") but as an opaque number. The trick to fixing this is to explicitly list the attr in the assemblyFormat, instead of relying on attr-dict. attr-dict prints the attr as an integer, but if we list it explicitly, it will be printed as a string. But when I fixed this, I realized that many other ops, such as load and store, also had enums which were printed as numbers rather than strings. So I had to fix those, too. I also added reasonable defaults for most of the arguments to make the textual IR more concise. I then noticed that load and store had complicated asm parsers to handle the fact usually the pointer type was not specified, but sometimes it was necessary. This is because given e.g. load : tensor<128xf32> the pointer may be either tensor<128x!tt.ptr<f32>> (the default) or !tt.ptr<tensor<128xf32>> (for block pointers). So I fixed this too. Now load and store specify the *pointer* type instead of the value type. This lets us use the built-in asm parser and delete a bunch of code. I also noticed that we had a bunch of dead ttgir tests (actually it seems like these were never enabled); rather than attempt to update them, without any way to test it, I just deleted them. Thanks to @joker-eph for suggesting the solution here -- I never would have figured this out on my own.2 年前
[BACKEND] Fix memory side effects of tt.dot (#4033) 1. Replaced triton_nvidia_gpu.async_dot with triton_nvidia_gpu.group_dot which has a isAsync attribute. Maybe warp_group_dot is a better name? 2. Removed memdesc from tt.dot because tt.dot should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the DotLike trait for tt.dot and triton_nvidia_gpu.group_dot. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the ConvertDotConvert pattern1 年前
[AMD] Support mfma layout in the prefetch pass (#4771) We've investigated tritongpu-prefetch on the amdgpu and it shows positive result in some cases. This change allows the prefetch pass to rewrite the loop with the tt.dot using amd_mfma with a fix in the sharedmem offset. Current change doesn't insert the pass to the compilation pipeline yet. The pass is supposed to be placed just after the pipelining pass.1 年前
[Backend] Bypass conversion for suitable blocked to dotOperand layout (#4538) This PR extends shared memory bypass for blocked->dotOperand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication. This commit is a preparation step towards improving CodeGen and efficiency of skinny dot cases.1 年前
[BACKEND] Add verifiers for LocalAlloc and other memdesc ops (#4305) Add verifiers for LocalAlloc, LocalStore and CopyToLocal to make sure the mutable attribute is set consistently --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>1 年前
[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the triton_gpu.compute-capability into triton_gpu.target to accomodate backends other than cuda. triton_gpu.target now accepts a string attribute in the form of <backend>:<arch> as values, for example, cuda:80, hip:gfx942. This commit also removes the direct computat capability reference in the OptimizeDotOperandsPass--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.2 年前
[IR] Rework textual IR for load, store, dot, and other ops (#3587) The initial motivation for this change was that tt.dot's inputPrecision value was printed in textual IR not as a string (e.g. "tf32") but as an opaque number. The trick to fixing this is to explicitly list the attr in the assemblyFormat, instead of relying on attr-dict. attr-dict prints the attr as an integer, but if we list it explicitly, it will be printed as a string. But when I fixed this, I realized that many other ops, such as load and store, also had enums which were printed as numbers rather than strings. So I had to fix those, too. I also added reasonable defaults for most of the arguments to make the textual IR more concise. I then noticed that load and store had complicated asm parsers to handle the fact usually the pointer type was not specified, but sometimes it was necessary. This is because given e.g. load : tensor<128xf32> the pointer may be either tensor<128x!tt.ptr<f32>> (the default) or !tt.ptr<tensor<128xf32>> (for block pointers). So I fixed this too. Now load and store specify the *pointer* type instead of the value type. This lets us use the built-in asm parser and delete a bunch of code. I also noticed that we had a bunch of dead ttgir tests (actually it seems like these were never enabled); rather than attempt to update them, without any way to test it, I just deleted them. Thanks to @joker-eph for suggesting the solution here -- I never would have figured this out on my own.2 年前