fix:Fix the error caused by missing cache_dir attribute & prioritize using the torch_npu getCurrentRawStreamNoWait API.
Co-authored-by: wangzhanpeng5<wangzhanpeng5@huawei.com>
# message auto-generated for no-merge-commit merge:
!1476 merge 0328_main_FileCache_getCurrentRawStreamNoWait into main
fix:Fix the error caused by missing cache_dir attribute & prioritize using the torch_npu getCurrentRawStreamNoWait API.
Created-by: wangzhanpeng5
Commit-by: wangzhanpeng5
Merged-by: ascend-robot
Description: ## Fix issue:
[https://gitcode.com/Ascend/triton-ascend/issues/365](https://gitcode.com/Ascend/triton-ascend/issues/365)
### Issue Description
When compilation fails, the current exception handler attempts to access fn_cache_manager.cache_dir to include the cache path in the error message. However, if fn_cache_manager is not an instance of FileCacheManager (e.g., another cache implementation), the object lacks the cache_dir attribute, causing an AttributeError. This masks the original compilation error and prevents proper error reporting.
#### Fix
Before adding cache information, check the object type using isinstance(fn_cache_manager, FileCacheManager):
If it is a FileCacheManager, use the cache_dir attribute to display the cache directory.
Otherwise, fall back to showing {file_name}.{ext} to ensure at least the filename is provided.
## Fix issue:
[https://gitcode.com/Ascend/triton-ascend/issues/369](https://gitcode.com/Ascend/triton-ascend/issues/369)
### Issue Description
Replace get_current_stream with getCurrentRawStreamNoWait to resolve hangs when calling get_current_stream during asynchronous communication combined with Triton kernel launches.
### Fix
Check for the existence of the new interface using hasattr(torch_npu._C, "_npu_getCurrentRawStreamNoWait").
If available, import and call _npu_getCurrentRawStreamNoWait; otherwise, fall back to the original _npu_getCurrentRawStream.
Maintain consistent interface behavior to ensure backward compatibility.
See merge request: Ascend/triton-ascend!1476
feat: Optimized the CV ssbuf feature
Co-authored-by: wuzw_05<wuzhiwei37@huawei.com>
# message auto-generated for no-merge-commit merge:
!1415 merge ssbuf into main
feat: Optimized the CV ssbuf feature
Created-by: wuzw_05
Commit-by: wuzw_05
Merged-by: ascend-robot
Description: # Background
The ssbuf scalar state is used to store various states in the CV pipeline. Based on the data transmission state, a greedy algorithm is used to minimize the cavitation of the Cube computation.
**core goals**
1. Construct a "first-in, first-out" buffer queue to ensure orderly memory allocation.
2. Control is based on buffernum to prevent precision issues caused by control flow.
# CheckList
- [x] I am not making a trivial change, such as fixing a typo in a comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [x] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1415
feat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh
Co-authored-by: jeshd<chengmaofan@huawei.com>
# message auto-generated for no-merge-commit merge:
!1486 merge remove_tanh into main
feat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh
Created-by: jeshd
Commit-by: jeshd
Merged-by: ascend-robot
Description: ### Summary
This PR removes tanh from triton.language.math to keep the fork aligned with upstream, where math does not provide a tanh op, moves tanh usage to triton.language.extra.cann.libdevice.tanh and adds explicit bf16 handling in the libdevice implementation.
To preserve existing user-facing behavior, tanh is now provided by triton.language.extra.cann.libdevice, while the call pattern remains unchanged. In other words, users can still write:
```
from triton.language.math import tanh
```
but the actual implementation is resolved to libdevice.tanh.
### Motivation
Upstream Triton does not define tanh under triton.language.math. Keeping a fork-specific math.tanh introduces unnecessary divergence and increases maintenance cost.
This change removes that divergence and keeps the frontend behavior compatible for existing callers. At the same time, bf16 tanh still needs a workable lowering path. This change makes the call sites explicit and adds a bf16 fallback that computes tanh in fp32 and casts the result back to bf16.
### What Changed
Removed the tanh op from triton.language.math
Removed the related builder-side path that depended on the old math.tanh implementation
Switched the underlying implementation to triton.language.extra.cann.libdevice.tanh
Kept the import and call style unchanged for users
Extended libdevice.tanh to support bf16 by casting bf16 inputs to fp32, calling the existing fp32 tanh extern path, and casting the result back to bf16
Added tanh test coverage for fp16 and bf16 in addition to fp32
### CheckList
- [x] I am not making a trivial change, such as fixing a typo in a comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [x] This PR does not need a test because corresponding test cases already exist.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1486
fix(jit): fix incompatibility problem caused by premature introduction
Co-authored-by: luobaiqing<luobaiqing1@huawei.com>
# message auto-generated for no-merge-commit merge:
!1395 merge fix_bug_main into main
fix(jit): fix incompatibility problem caused by premature introduction
Created-by: luobaiqing
Commit-by: luobaiqing
Merged-by: ascend-robot
Description: preload提前引入了一些3.5.1的方法,但适配有问题。此pr修复之
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1395
fix(bug): allow using npu-smi info to get the device info
Co-authored-by: 刘风昇<liufengsheng2@huawei.com>
# message auto-generated for no-merge-commit merge:
!1338 merge acl2 into main
fix(bug): allow using npu-smi info to get the device info
Created-by: meloliu12327
Commit-by: 刘风昇
Merged-by: ascend-robot
Description: 添加使用npu-smi info获取硬件型号信息的方式
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1338
feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning
Co-authored-by: Xuan Peng<pengxuan9@huawei.com>
# message auto-generated for no-merge-commit merge:
!1268 merge feat/async-compile-0210 into main
feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning
Created-by: HinPeng
Commit-by: Xuan Peng
Merged-by: ascend-robot
Description:
## PR description
1. Introduce async compile mode from triton v3.5.1 (with little modification to be compatible with current branch and torch-2.7.1)
2. Refactor autotuner to compile triton kernel in parallel
## Notice
1. Introduce MLIR_DISABLE_MULTITHREADING environment variable ahead from triton v3.5.1
2. Add TRITON_AUTOTUNE_PARALLEL_COMPILE to control whether compiling kernels in parallel in autotuner, default to '1'
See merge request: Ascend/triton-ascend!1268
[Frontend] [BC breaking] Implement PyTorch/JAX/NumPy 2.0 typecast semantics for scalars (#4613)
The idea here is that if you have a tensor t of dtype uint8 and you
want
to do t << 2, the result should be of dtype uint8, not int32!
We do this for all dunder ops that don't output booleans.
This follows roughly the semantics of PyTorch, JAX and NumPy 2.0.
I would like to document this behaviour, but it's not clear to me where
is the best place to say so.
The PR has much more churn than I would like, as I had to move the
to_tensor method to semantic (which is where it belongs anyway).
For reviewers, the only two relevant changes are in
computation_type_impl and
in bitwise_op_type_checking_impl, where we say that we do perform
casting
for bitwise ops.