triton-ascend/test/TritonGPU · yangchenhua/triton-ascend - AtomGit

GGitHub[cherrypick release/3.2][BACKEND] Fix accumulator init optimization for integer matmuls (#5192 ) (#5624 )

6771065c创建于 2025年1月16日历史提交

文件	最后提交记录	最后更新时间
amd	[release/3.2.x] [CHERRY PICK] [AMD] Fix issue with rank=1 in tryFitCvtIntoLDS (#5084) (#5453) Co-authored-by: Sam Ginzburg <ginzburg@fb.com> (cherry picked from commit f9cdf58276a42b31151debf219c2de1471ed4422) Co-authored-by: Samuel Ginzburg <ginzburg@meta.com>	1 年前
accelerate-matmul.mlir	[BACKEND] Fix and document logic for creating warp shapes in MMAv3 (#5441) Cherry pick of #5277 for 3.2 release Co-authored-by: Keren Zhou <kerenzhou@openai.com>	1 年前
accumulator-init.mlir	[cherrypick release/3.2][BACKEND] Fix accumulator init optimization for integer matmuls (#5192) (#5624) This fixes https://github.com/pytorch/pytorch/issues/144705 Co-authored-by: peterbell10 <peterbell10@openai.com>	1 年前
atomic-cas.mlir	[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the `triton_gpu.compute-capability` into `triton_gpu.target` to accomodate backends other than cuda. `triton_gpu.target` now accepts a string attribute in the form of `<backend>:<arch>` as values, for example, `cuda:80`, `hip:gfx942`. This commit also removes the direct computat capability reference in the `OptimizeDotOperandsPass`--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.	2 年前
canonicalize.mlir	[Triton] Use `UnitAttr` in `tt.reshape` definition (#4947) Make `allow_reorder` and `efficient_layout` `UnitAttr` for a cleaner interface. This way, the operation exposes a `bool getEfficientLayout()` member to check for that attribute and a constructor receiving `bool` arguments for both of these attributes (defaulted to `false`). The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [X] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [X] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>	1 年前
coalesce.mlir	[BACKEND] Fix ProgramPoint passing in AxisInfoAnalysis (#5181) Fixes #5122. The `ProgramPoint` [here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087) is created on the stack. Then its address is [passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089) to the MLIR `SparseAnalysis` code, where it is [added as a dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311) and later [dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90). By the time the `ProramPoint` is dereferenced in the `AbstractSparseForwardDataFlowAnalysis::visit`, the `AxisInfoAnalysis::visitForOpInductionVar` will have finished and the `ProgramPoint` stack variable destroyed. This leads to a segfault (which can be reproed on the base rev with the lit test added in this PR). The code modified in this PR was originally added in #4927, in conjunction with updating the `llvm-project` hash to `b5cc222d7429`. However, as noted in https://github.com/llvm/llvm-project/pull/110344 (the `llvm-project` PR that has made the refactoring prompting the `AxisInfo.cpp` change in #4927): > For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by `getProgramPointAfter(op)` As the `AxisInfoAnalysis` (in Triton) inherits from `SparseForwardDataFlowAnalysis` (in MLIR), in this PR we follow the above which resolves the segfault issue (as the `ProgramPoint` is now stored in the instance-level state of the pass). P.S. The lit test added in this PR is not exactly minimal. However, I did my best to minimize it starting from the 400-line repro TTGIR in #5122. Further minimization does not seem to expose the segfault.	1 年前
combine-select-if.mlir	[BACKEND] Fix crash in combine select and if op (#3990) When multiple selects gets merged to the same ifop we need to make sure we don't use a stale IfOp during transformation.	1 年前
combine.mlir	[AMD] Enable shared->MFMA dot operand conversion through LinearLayout (#4983) This PR: - Introduces fallback from normal TTG->LLVM converter in case it does not support given local_load. - Enables conversion of MFMA dot layout to Linear Layout in local_load pattern.	1 年前
dot-operands.mlir	[BACKEND] Fix memory side effects of `tt.dot` (#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern	1 年前
fence-inserstion.mlir	[BACKEND] Fix memory side effects of `tt.dot` (#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern	1 年前
invalid-attributes.mlir	[AMD] Support WMMAv2 in AccelerateAMDMatmulPass (#4452) - Specify kWidth parameter according to the version - For the first iteration fp8 operands are unsupported, no new operand configuration are added for now - Added lit tests Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>	1 年前
invalid.mlir	Verify the `tt.dot` operation thru the dot verification interface of the dialect which defines the C layout. (#4630) The Triton `tt.dot` operation maybe materialized by `MMA` encoding defined by the out-of-tree dialect. The verifier of the `tt.dot` operation dispatches the verification to the dialect of the A and B operands which are mostly `DotOp` layout defined in-tree TritonGPU dialect. The TritonGPU dialect only verifies the `tt.dot` operation for in-tree backend. To use the dialect of the C operand layout verifying the `tt.dot` operation for out-of-tree backends. (And also works for NV and AMD's `MMA` which are defined in-tree TritonGPU dialect.) - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [X] I have added tests. - `/test` for `lit` tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [X] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)	1 年前
loop-pipeline-cuda.mlir	[test] NFC: split loop pipeline test file to prepare sharing (#4399)	1 年前
loop-pipeline-hip.mlir	[AMD] Rewrite transpose ops in pipeliner to mutable memory (#4969) `add_optimize_dot_operands` may introduce a immutable shared buffer for transposed dot operands. Our stream-pipeliner then replaces the immutable buffer with a mutable buffer to be able to reuse it across iterations (pre-fetching). This will then produce incorrect transOps because the input is mutable but the result is immutable. This PR rewrites those transOps to output a mutable layout.	1 年前
loop-pipeline-hopper.mlir	[AMD] Enable dynamic peeling for stream pipeliner (#4650) * Added predication for tt.store for tests * Moved ttng.warp_group_dot for tests	1 年前
loop-pipeline-indirect-load.mlir	[SWP] When num_stages = 2, do not pipeline indirect loads (#4721) For indirect loads, we try to assign them to later stages ``` unsigned stagesBetweenLoads = ceil<unsigned>(numStages - 2, maxIndirectionLevel + 1); int stage = (maxIndirectionLevel - indLevel) * stagesBetweenLoads; schedule.insert(loadOp, stage, loadsClusters[indLevel]); ``` If numStages is 2, there is no later stage to assign the indirect loads to. The fix is to not pipeline the indirect loads. We also generalize to not pipeline an indirect load if the indirection level >= numStages - 1	1 年前
loop-pipeline.mlir	[Triton] Use `UnitAttr` in `tt.reshape` definition (#4947) Make `allow_reorder` and `efficient_layout` `UnitAttr` for a cleaner interface. This way, the operation exposes a `bool getEfficientLayout()` member to check for that attribute and a constructor receiving `bool` arguments for both of these attributes (defaulted to `false`). The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [X] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [X] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>	1 年前
matmul.mlir	[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the `triton_gpu.compute-capability` into `triton_gpu.target` to accomodate backends other than cuda. `triton_gpu.target` now accepts a string attribute in the form of `<backend>:<arch>` as values, for example, `cuda:80`, `hip:gfx942`. This commit also removes the direct computat capability reference in the `OptimizeDotOperandsPass`--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.	2 年前
ops.mlir	[AMD] Support WMMAv2 in AccelerateAMDMatmulPass (#4452) - Specify kWidth parameter according to the version - For the first iteration fp8 operands are unsupported, no new operand configuration are added for now - Added lit tests Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>	1 年前
optimize-locality.mlir	[Triton] Use `UnitAttr` in `tt.reshape` definition (#4947) Make `allow_reorder` and `efficient_layout` `UnitAttr` for a cleaner interface. This way, the operation exposes a `bool getEfficientLayout()` member to check for that attribute and a constructor receiving `bool` arguments for both of these attributes (defaulted to `false`). The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [X] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [X] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>	1 年前
optimize_epilogue.mlir	[IR] Rework textual IR for load, store, dot, and other ops (#3587) The initial motivation for this change was that tt.dot's inputPrecision value was printed in textual IR not as a string (e.g. "tf32") but as an opaque number. The trick to fixing this is to explicitly list the attr in the assemblyFormat, instead of relying on attr-dict. attr-dict prints the attr as an integer, but if we list it explicitly, it will be printed as a string. But when I fixed this, I realized that many other ops, such as load and store, also had enums which were printed as numbers rather than strings. So I had to fix those, too. I also added reasonable defaults for most of the arguments to make the textual IR more concise. I then noticed that load and store had complicated asm parsers to handle the fact usually the pointer type was not specified, but sometimes it was necessary. This is because given e.g. `load : tensor<128xf32>` the pointer may be either `tensor<128x!tt.ptr<f32>>` (the default) or `!tt.ptr<tensor<128xf32>>` (for block pointers). So I fixed this too. Now load and store specify the pointer type instead of the value type. This lets us use the built-in asm parser and delete a bunch of code. I also noticed that we had a bunch of dead ttgir tests (actually it seems like these were never enabled); rather than attempt to update them, without any way to test it, I just deleted them. Thanks to @joker-eph for suggesting the solution here -- I never would have figured this out on my own.	2 年前
pipeline-hopper-remove-wait.mlir	[BACKEND] Fix memory side effects of `tt.dot` (#4033) 1. Replaced `triton_nvidia_gpu.async_dot` with `triton_nvidia_gpu.group_dot` which has a `isAsync` attribute. Maybe `warp_group_dot` is a better name? 2. Removed `memdesc` from `tt.dot` because `tt.dot` should be pure, without any side effects 3. Removed hacks in Membar analysis. 4. Unified wgmma code generation in the backend. 5. Introduced the `DotLike` trait for `tt.dot` and `triton_nvidia_gpu.group_dot`. 6. Updated comments in matmul loop pipeline (maybe incomplete). 7. Removed the `ConvertDotConvert` pattern	1 年前
prefetch.mlir	[AMD] Support mfma layout in the prefetch pass (#4771) We've investigated tritongpu-prefetch on the amdgpu and it shows positive result in some cases. This change allows the prefetch pass to rewrite the loop with the `tt.dot` using `amd_mfma` with a fix in the sharedmem offset. Current change doesn't insert the pass to the compilation pipeline yet. The pass is supposed to be placed just after the pipelining pass.	1 年前
reduce-data-duplication.mlir	[Backend] Bypass conversion for suitable blocked to dotOperand layout (#4538) This PR extends shared memory bypass for blocked->dotOperand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication. This commit is a preparation step towards improving CodeGen and efficiency of skinny dot cases.	1 年前
reorder-instructions.mlir	[BACKEND] Add verifiers for LocalAlloc and other memdesc ops (#4305) Add verifiers for LocalAlloc, LocalStore and CopyToLocal to make sure the mutable attribute is set consistently --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>	1 年前
tritongpu_ops.mlir	[backend] Set target in the passes and IR backend agnostically (#3741) This commit changes the `triton_gpu.compute-capability` into `triton_gpu.target` to accomodate backends other than cuda. `triton_gpu.target` now accepts a string attribute in the form of `<backend>:<arch>` as values, for example, `cuda:80`, `hip:gfx942`. This commit also removes the direct computat capability reference in the `OptimizeDotOperandsPass`--it only needs to know whether we want to inject an additional pattern, which can be controlled by a bool parameter.	2 年前
verify-blocked-layout.mlir	[IR] Rework textual IR for load, store, dot, and other ops (#3587) The initial motivation for this change was that tt.dot's inputPrecision value was printed in textual IR not as a string (e.g. "tf32") but as an opaque number. The trick to fixing this is to explicitly list the attr in the assemblyFormat, instead of relying on attr-dict. attr-dict prints the attr as an integer, but if we list it explicitly, it will be printed as a string. But when I fixed this, I realized that many other ops, such as load and store, also had enums which were printed as numbers rather than strings. So I had to fix those, too. I also added reasonable defaults for most of the arguments to make the textual IR more concise. I then noticed that load and store had complicated asm parsers to handle the fact usually the pointer type was not specified, but sometimes it was necessary. This is because given e.g. `load : tensor<128xf32>` the pointer may be either `tensor<128x!tt.ptr<f32>>` (the default) or `!tt.ptr<tensor<128xf32>>` (for block pointers). So I fixed this too. Now load and store specify the pointer type instead of the value type. This lets us use the built-in asm parser and delete a bunch of code. I also noticed that we had a bunch of dead ttgir tests (actually it seems like these were never enabled); rather than attempt to update them, without any way to test it, I just deleted them. Thanks to @joker-eph for suggesting the solution here -- I never would have figured this out on my own.	2 年前