fix(MaskAnalysis): support int1 cmp analysis
Co-authored-by: luobaiqing<luobaiqing1@huawei.com>
# message auto-generated for no-merge-commit merge:
!1207 merge fixCmpInt1 into main
fix(MaskAnalysis): support int1 cmp analysis
Created-by: luobaiqing
Commit-by: luobaiqing
Merged-by: ascend-robot
Description: 此前掩码分析在处理int1时将int1作为intScalar处理,导致掩码信息丢失,造成精度问题或者访存越界,导致diagonal_backward算子报错(该算子中,先做了一次cmp得到i1,再splat成tensor,parseIntScalar时,只添加scalar而没有stateInfo,这个splat会认为该tensor不包含有效信息。生成掩码时丢失该维度)
```
%49 = arith.cmpi slt, %42, %2 : i64 loc(#loc30)
%54 = tt.splat %49 : i1 -> tensor<1x32x1xi1> loc(#loc32)
%55 = arith.andi %54, %53 : tensor<1x32x1xi1> loc(#loc32)
```
此pr针对该场景进行修复,在int1时不进入intScalar,而是获取其definingOp进行parse,并解除cmp不支持对scalar进行分析的限制。目前仅支持对int1的cmpOp做分析,在其他可能产生int1运算的op中做了fallback处理。
在正常修复掩码分析后,发现另一问题,在以上情景的基础上,当掩码维度为1时,掩码的生成过程会被优化成这样:
```
%splat_lhs = tt.splat %val1 : tensor<128xi32>
%splat_rhs = tt.splat %val2 : tensor<128xi32>
%cmp = arith.cmpi slt, %splat_lhs, %splat_rhs : tensor<128xi32>
```
在tolinalg的掩码分析中同样不支持scalar的cmp。在tolinalg的掩码分析中增加scalar cmp的方法,经讨论,不如新增一个converter去处理这种情况,为此在线性化pass后增加一个converter,用于提升binary操作,转换后:
```
%cmp_scalar = arith.cmpi slt, %val1, %val2
%splat_cmp = tt.splat %cmp_scalar : tensor<128xi1>
```
目前该converter只提升cmpi,但实际上可以增加任何的二元运算符去优化splat再运算的场景,这样也能减少ub占用。
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1207
fix(MaskAnalysis): support int1 cmp analysis
Co-authored-by: luobaiqing<luobaiqing1@huawei.com>
# message auto-generated for no-merge-commit merge:
!1207 merge fixCmpInt1 into main
fix(MaskAnalysis): support int1 cmp analysis
Created-by: luobaiqing
Commit-by: luobaiqing
Merged-by: ascend-robot
Description: 此前掩码分析在处理int1时将int1作为intScalar处理,导致掩码信息丢失,造成精度问题或者访存越界,导致diagonal_backward算子报错(该算子中,先做了一次cmp得到i1,再splat成tensor,parseIntScalar时,只添加scalar而没有stateInfo,这个splat会认为该tensor不包含有效信息。生成掩码时丢失该维度)
```
%49 = arith.cmpi slt, %42, %2 : i64 loc(#loc30)
%54 = tt.splat %49 : i1 -> tensor<1x32x1xi1> loc(#loc32)
%55 = arith.andi %54, %53 : tensor<1x32x1xi1> loc(#loc32)
```
此pr针对该场景进行修复,在int1时不进入intScalar,而是获取其definingOp进行parse,并解除cmp不支持对scalar进行分析的限制。目前仅支持对int1的cmpOp做分析,在其他可能产生int1运算的op中做了fallback处理。
在正常修复掩码分析后,发现另一问题,在以上情景的基础上,当掩码维度为1时,掩码的生成过程会被优化成这样:
```
%splat_lhs = tt.splat %val1 : tensor<128xi32>
%splat_rhs = tt.splat %val2 : tensor<128xi32>
%cmp = arith.cmpi slt, %splat_lhs, %splat_rhs : tensor<128xi32>
```
在tolinalg的掩码分析中同样不支持scalar的cmp。在tolinalg的掩码分析中增加scalar cmp的方法,经讨论,不如新增一个converter去处理这种情况,为此在线性化pass后增加一个converter,用于提升binary操作,转换后:
```
%cmp_scalar = arith.cmpi slt, %val1, %val2
%splat_cmp = tt.splat %cmp_scalar : tensor<128xi1>
```
目前该converter只提升cmpi,但实际上可以增加任何的二元运算符去优化splat再运算的场景,这样也能减少ub占用。
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1207
feat(refactor): unify implicit permute logic
Co-authored-by: candyhong<1102229410@qq.com>
Co-authored-by: LH_123L<liuhuan261@huawei.com>
Co-authored-by: KanuaK<zhouyihan1@huawei.com>
# message auto-generated for no-merge-commit merge:
!1245 merge main/feat-implicit-permute-candy-test into main
feat(refactor): unify implicit permute logic
Created-by: candyhong
Commit-by: candyhong;KanuaK;LH_123L
Merged-by: ascend-robot
Description: ## Background
### Related ISSUE:[#305](https://gitcode.com/Ascend/triton-ascend/issues/305)
The current implicit permute logic in Triton-Ascend (TA) has flaws that impact maintainability, hardware utilization, and scenario coverage:
| Issue Category | Key Symptoms | Impact |
| --------------------- | ----------------------------------------------------------------- | ----------------------------------------------- |
| Dispersed Logic | No unified entry for adaptation logic | High maintenance cost, risk of logic conflicts |
| Missing HW Adaptation | No differentiated processing logic for hardware types | Failed to leverage hardware capabilities(NDDMA) |
| Incomplete Capability | No support for variable stride analysis/For-loop pointer analysis | Poor coverage of scenarios |
## Optimization Goals
1. **Unify Logic**: Converge all implicit permute logic to a single entry, define unified rules for hardware/scenarios.
2. **Complete Capabilities**: Support constant/variable stride analysis, For-loop pointer analysis, and unify ptr/mask analysis logic.
3. **HW Adaptation**: Implement hardware-specific/software fallback implicit permute solutions.
## Verification
| Test Suite | Test Count | Main Branch Result | PR Result | Status |
| -------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| generalization(main) | 91373 | [pre-smoke-108](https://devcloud.cn-north-4.huaweicloud.com/codeci/project/1e20b309fcb34b00a0043a87e461c95a/codeci/detail/workspace/15834d5b21a14cd99ef2ea8651c54f7d/111)<br><br>81773 pass | [per-smoke-111](https://devcloud.cn-north-4.huaweicloud.com/cicd/project/1e20b309fcb34b00a0043a87e461c95a/pipeline/detail/945b705f6f1d4087a1a08d834fbc2c51/cdbf3a4922e241e8a0b1a6dadb3d5a82?v=1)<br><br>81773 pass | Pass |
## Checklist
- [x] I am not making a trivial change, such as fixing a typo in a comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1245
feat(refactor): unify implicit permute logic
Co-authored-by: candyhong<1102229410@qq.com>
Co-authored-by: LH_123L<liuhuan261@huawei.com>
Co-authored-by: KanuaK<zhouyihan1@huawei.com>
# message auto-generated for no-merge-commit merge:
!1245 merge main/feat-implicit-permute-candy-test into main
feat(refactor): unify implicit permute logic
Created-by: candyhong
Commit-by: candyhong;KanuaK;LH_123L
Merged-by: ascend-robot
Description: ## Background
### Related ISSUE:[#305](https://gitcode.com/Ascend/triton-ascend/issues/305)
The current implicit permute logic in Triton-Ascend (TA) has flaws that impact maintainability, hardware utilization, and scenario coverage:
| Issue Category | Key Symptoms | Impact |
| --------------------- | ----------------------------------------------------------------- | ----------------------------------------------- |
| Dispersed Logic | No unified entry for adaptation logic | High maintenance cost, risk of logic conflicts |
| Missing HW Adaptation | No differentiated processing logic for hardware types | Failed to leverage hardware capabilities(NDDMA) |
| Incomplete Capability | No support for variable stride analysis/For-loop pointer analysis | Poor coverage of scenarios |
## Optimization Goals
1. **Unify Logic**: Converge all implicit permute logic to a single entry, define unified rules for hardware/scenarios.
2. **Complete Capabilities**: Support constant/variable stride analysis, For-loop pointer analysis, and unify ptr/mask analysis logic.
3. **HW Adaptation**: Implement hardware-specific/software fallback implicit permute solutions.
## Verification
| Test Suite | Test Count | Main Branch Result | PR Result | Status |
| -------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| generalization(main) | 91373 | [pre-smoke-108](https://devcloud.cn-north-4.huaweicloud.com/codeci/project/1e20b309fcb34b00a0043a87e461c95a/codeci/detail/workspace/15834d5b21a14cd99ef2ea8651c54f7d/111)<br><br>81773 pass | [per-smoke-111](https://devcloud.cn-north-4.huaweicloud.com/cicd/project/1e20b309fcb34b00a0043a87e461c95a/pipeline/detail/945b705f6f1d4087a1a08d834fbc2c51/cdbf3a4922e241e8a0b1a6dadb3d5a82?v=1)<br><br>81773 pass | Pass |
## Checklist
- [x] I am not making a trivial change, such as fixing a typo in a comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1245