triton-ascend/third_party/ascend/lib/TritonToStructured · Ascend/triton-ascend - AtomGit

ascend-robotfeat(Cannonicalizer)：add if converter to extract add expr from if body; support constant dense select in linalg.parseSelect

cbaaa1af创建于 3月25日历史提交

文件	最后提交记录	最后更新时间
CMakeLists.txt	fix(MaskAnalysis): support int1 cmp analysis Co-authored-by: luobaiqing<luobaiqing1@huawei.com> # message auto-generated for no-merge-commit merge: !1207 merge fixCmpInt1 into main fix(MaskAnalysis): support int1 cmp analysis Created-by: luobaiqing Commit-by: luobaiqing Merged-by: ascend-robot Description: 此前掩码分析在处理int1时将int1作为intScalar处理，导致掩码信息丢失，造成精度问题或者访存越界，导致diagonal_backward算子报错（该算子中，先做了一次cmp得到i1，再splat成tensor，parseIntScalar时，只添加scalar而没有stateInfo，这个splat会认为该tensor不包含有效信息。生成掩码时丢失该维度） ``` %49 = arith.cmpi slt, %42, %2 : i64 loc(#loc30) %54 = tt.splat %49 : i1 -> tensor<1x32x1xi1> loc(#loc32) %55 = arith.andi %54, %53 : tensor<1x32x1xi1> loc(#loc32) ``` 此pr针对该场景进行修复，在int1时不进入intScalar，而是获取其definingOp进行parse，并解除cmp不支持对scalar进行分析的限制。目前仅支持对int1的cmpOp做分析，在其他可能产生int1运算的op中做了fallback处理。在正常修复掩码分析后，发现另一问题，在以上情景的基础上，当掩码维度为1时，掩码的生成过程会被优化成这样： ``` %splat_lhs = tt.splat %val1 : tensor<128xi32> %splat_rhs = tt.splat %val2 : tensor<128xi32> %cmp = arith.cmpi slt, %splat_lhs, %splat_rhs : tensor<128xi32> ``` 在tolinalg的掩码分析中同样不支持scalar的cmp。在tolinalg的掩码分析中增加scalar cmp的方法，经讨论，不如新增一个converter去处理这种情况，为此在线性化pass后增加一个converter，用于提升binary操作，转换后： ``` %cmp_scalar = arith.cmpi slt, %val1, %val2 %splat_cmp = tt.splat %cmp_scalar : tensor<128xi1> ``` 目前该converter只提升cmpi，但实际上可以增加任何的二元运算符去优化splat再运算的场景，这样也能减少ub占用。 The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1207	4 个月前
CannonicalizerConverter.cpp	feat(Cannonicalizer)：add if converter to extract add expr from if body; support constant dense select in linalg.parseSelect Co-authored-by: luobaiqing<luobaiqing1@huawei.com> # message auto-generated for no-merge-commit merge: !1397 merge ifConverter into main feat(Cannonicalizer)：add if converter to extract add expr from if body; support constant dense select in linalg.parseSelect Created-by: luobaiqing Commit-by: luobaiqing Merged-by: ascend-robot Description: 1. 增加一个对ifOp进行优化的converter，将参数迭代运算抽取到ifOp外部，简化ifOp ``` %arg = xxx %if = scf.if %cond { %thenYield = arith.addi %arg, %other scf.yield %thenYield, %xxx, ... } else { scf.yield %arg, %xxx, ... } -------> %arg = xxx %cst0 = arith.constant dense <0> %if = scf.if %cond { ... scf.yield %other, %xxx, ... } else { scf.yield %cst0, %xxx, ... } %newIfRes = arith.addi %arg, %if#0 ----> 在当前处理的特例中 %other是外部的另一个constant dense，实际上会被优化成： %arg = xxx %cst0 = arith.constant dense <0> %updateVal = arith.select %cond, %other, %cst0 %if = scf.if %cond { ... scf.yield %xxx, ... } else { scf.yield %xxx, ... } %newIfRes = arith.addi %arg, %updateVal ``` 2. 修改离散访存parseIf/select，如果几个分支结果都是scalarlike，设状态为scalarlike，否则视为unstructure，（因为这代表着他仍是连续访存，只是offset可能变化）注：放开后遇到特殊场景，当shape的维度为1时，且在for循环内，（且是个间接访存），虽然他也是scalarlike，但在rewriteloop时，rewriteTerminator会失败，暂时规避这一场景不放开，即当select的结果shape维度均为1时，仍按照离散访存处理。后续明确原因后再放开 3. 修改linalg pass的parseSelect，支持selectOp的两个选项均为constant dense的情况，创建一个scalar select作为scalar offset See merge request: Ascend/triton-ascend!1397	2 个月前
MaskAnalysis.cpp	fix(MaskAnalysis): support int1 cmp analysis Co-authored-by: luobaiqing<luobaiqing1@huawei.com> # message auto-generated for no-merge-commit merge: !1207 merge fixCmpInt1 into main fix(MaskAnalysis): support int1 cmp analysis Created-by: luobaiqing Commit-by: luobaiqing Merged-by: ascend-robot Description: 此前掩码分析在处理int1时将int1作为intScalar处理，导致掩码信息丢失，造成精度问题或者访存越界，导致diagonal_backward算子报错（该算子中，先做了一次cmp得到i1，再splat成tensor，parseIntScalar时，只添加scalar而没有stateInfo，这个splat会认为该tensor不包含有效信息。生成掩码时丢失该维度） ``` %49 = arith.cmpi slt, %42, %2 : i64 loc(#loc30) %54 = tt.splat %49 : i1 -> tensor<1x32x1xi1> loc(#loc32) %55 = arith.andi %54, %53 : tensor<1x32x1xi1> loc(#loc32) ``` 此pr针对该场景进行修复，在int1时不进入intScalar，而是获取其definingOp进行parse，并解除cmp不支持对scalar进行分析的限制。目前仅支持对int1的cmpOp做分析，在其他可能产生int1运算的op中做了fallback处理。在正常修复掩码分析后，发现另一问题，在以上情景的基础上，当掩码维度为1时，掩码的生成过程会被优化成这样： ``` %splat_lhs = tt.splat %val1 : tensor<128xi32> %splat_rhs = tt.splat %val2 : tensor<128xi32> %cmp = arith.cmpi slt, %splat_lhs, %splat_rhs : tensor<128xi32> ``` 在tolinalg的掩码分析中同样不支持scalar的cmp。在tolinalg的掩码分析中增加scalar cmp的方法，经讨论，不如新增一个converter去处理这种情况，为此在线性化pass后增加一个converter，用于提升binary操作，转换后： ``` %cmp_scalar = arith.cmpi slt, %val1, %val2 %splat_cmp = tt.splat %cmp_scalar : tensor<128xi1> ``` 目前该converter只提升cmpi，但实际上可以增加任何的二元运算符去优化splat再运算的场景，这样也能减少ub占用。 The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1207	4 个月前
MemOpConverter.cpp	feat(refactor): unify implicit permute logic Co-authored-by: candyhong<1102229410@qq.com> Co-authored-by: LH_123L<liuhuan261@huawei.com> Co-authored-by: KanuaK<zhouyihan1@huawei.com> # message auto-generated for no-merge-commit merge: !1245 merge main/feat-implicit-permute-candy-test into main feat(refactor): unify implicit permute logic Created-by: candyhong Commit-by: candyhong;KanuaK;LH_123L Merged-by: ascend-robot Description: ## Background ### Related ISSUE：[#305](https://gitcode.com/Ascend/triton-ascend/issues/305) The current implicit permute logic in Triton-Ascend (TA) has flaws that impact maintainability, hardware utilization, and scenario coverage: \| Issue Category \| Key Symptoms \| Impact \| \| --------------------- \| ----------------------------------------------------------------- \| ----------------------------------------------- \| \| Dispersed Logic \| No unified entry for adaptation logic \| High maintenance cost, risk of logic conflicts \| \| Missing HW Adaptation \| No differentiated processing logic for hardware types \| Failed to leverage hardware capabilities(NDDMA) \| \| Incomplete Capability \| No support for variable stride analysis/For-loop pointer analysis \| Poor coverage of scenarios \| ## Optimization Goals 1. Unify Logic: Converge all implicit permute logic to a single entry, define unified rules for hardware/scenarios. 2. Complete Capabilities: Support constant/variable stride analysis, For-loop pointer analysis, and unify ptr/mask analysis logic. 3. HW Adaptation: Implement hardware-specific/software fallback implicit permute solutions. ## Verification \| Test Suite \| Test Count \| Main Branch Result \| PR Result \| Status \| \| -------------------- \| ---------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ \| ------ \| \| generalization(main) \| 91373 \| [pre-smoke-108](https://devcloud.cn-north-4.huaweicloud.com/codeci/project/1e20b309fcb34b00a0043a87e461c95a/codeci/detail/workspace/15834d5b21a14cd99ef2ea8651c54f7d/111)<br><br>81773 pass \| [per-smoke-111](https://devcloud.cn-north-4.huaweicloud.com/cicd/project/1e20b309fcb34b00a0043a87e461c95a/pipeline/detail/945b705f6f1d4087a1a08d834fbc2c51/cdbf3a4922e241e8a0b1a6dadb3d5a82?v=1)<br><br>81773 pass \| Pass \| ## Checklist - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1245	3 个月前
PtrAnalysis.cpp	feat(refactor): unify implicit permute logic Co-authored-by: candyhong<1102229410@qq.com> Co-authored-by: LH_123L<liuhuan261@huawei.com> Co-authored-by: KanuaK<zhouyihan1@huawei.com> # message auto-generated for no-merge-commit merge: !1245 merge main/feat-implicit-permute-candy-test into main feat(refactor): unify implicit permute logic Created-by: candyhong Commit-by: candyhong;KanuaK;LH_123L Merged-by: ascend-robot Description: ## Background ### Related ISSUE：[#305](https://gitcode.com/Ascend/triton-ascend/issues/305) The current implicit permute logic in Triton-Ascend (TA) has flaws that impact maintainability, hardware utilization, and scenario coverage: \| Issue Category \| Key Symptoms \| Impact \| \| --------------------- \| ----------------------------------------------------------------- \| ----------------------------------------------- \| \| Dispersed Logic \| No unified entry for adaptation logic \| High maintenance cost, risk of logic conflicts \| \| Missing HW Adaptation \| No differentiated processing logic for hardware types \| Failed to leverage hardware capabilities(NDDMA) \| \| Incomplete Capability \| No support for variable stride analysis/For-loop pointer analysis \| Poor coverage of scenarios \| ## Optimization Goals 1. Unify Logic: Converge all implicit permute logic to a single entry, define unified rules for hardware/scenarios. 2. Complete Capabilities: Support constant/variable stride analysis, For-loop pointer analysis, and unify ptr/mask analysis logic. 3. HW Adaptation: Implement hardware-specific/software fallback implicit permute solutions. ## Verification \| Test Suite \| Test Count \| Main Branch Result \| PR Result \| Status \| \| -------------------- \| ---------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ \| ------ \| \| generalization(main) \| 91373 \| [pre-smoke-108](https://devcloud.cn-north-4.huaweicloud.com/codeci/project/1e20b309fcb34b00a0043a87e461c95a/codeci/detail/workspace/15834d5b21a14cd99ef2ea8651c54f7d/111)<br><br>81773 pass \| [per-smoke-111](https://devcloud.cn-north-4.huaweicloud.com/cicd/project/1e20b309fcb34b00a0043a87e461c95a/pipeline/detail/945b705f6f1d4087a1a08d834fbc2c51/cdbf3a4922e241e8a0b1a6dadb3d5a82?v=1)<br><br>81773 pass \| Pass \| ## Checklist - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1245	3 个月前
TritonToStructuredPass.cpp	Merge Triton-Ascend 425236de into release/3.5.x	2 个月前