CustomOp indexing_map and extra_buffers
Co-authored-by: c00866834<chen.pei.chi@huawei.com>
# message auto-generated for no-merge-commit merge:
!1471 merge py-affine-map into main
CustomOp indexing_map and extra_buffers
Created-by: Zackc
Commit-by: c00866834
Merged-by: ascend-robot
Description: The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
https://gitcode.com/Ascend/AscendNPU-IR/issues/183
See merge request: Ascend/triton-ascend!1471
feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning
Co-authored-by: Xuan Peng<pengxuan9@huawei.com>
# message auto-generated for no-merge-commit merge:
!1268 merge feat/async-compile-0210 into main
feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning
Created-by: HinPeng
Commit-by: Xuan Peng
Merged-by: ascend-robot
Description:
## PR description
1. Introduce async compile mode from triton v3.5.1 (with little modification to be compatible with current branch and torch-2.7.1)
2. Refactor autotuner to compile triton kernel in parallel
## Notice
1. Introduce MLIR_DISABLE_MULTITHREADING environment variable ahead from triton v3.5.1
2. Add TRITON_AUTOTUNE_PARALLEL_COMPILE to control whether compiling kernels in parallel in autotuner, default to '1'
See merge request: Ascend/triton-ascend!1268
feat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh
Co-authored-by: jeshd<chengmaofan@huawei.com>
# message auto-generated for no-merge-commit merge:
!1486 merge remove_tanh into main
feat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh
Created-by: jeshd
Commit-by: jeshd
Merged-by: ascend-robot
Description: ### Summary
This PR removes tanh from triton.language.math to keep the fork aligned with upstream, where math does not provide a tanh op, moves tanh usage to triton.language.extra.cann.libdevice.tanh and adds explicit bf16 handling in the libdevice implementation.
To preserve existing user-facing behavior, tanh is now provided by triton.language.extra.cann.libdevice, while the call pattern remains unchanged. In other words, users can still write:
```
from triton.language.math import tanh
```
but the actual implementation is resolved to libdevice.tanh.
### Motivation
Upstream Triton does not define tanh under triton.language.math. Keeping a fork-specific math.tanh introduces unnecessary divergence and increases maintenance cost.
This change removes that divergence and keeps the frontend behavior compatible for existing callers. At the same time, bf16 tanh still needs a workable lowering path. This change makes the call sites explicit and adds a bf16 fallback that computes tanh in fp32 and casts the result back to bf16.
### What Changed
Removed the tanh op from triton.language.math
Removed the related builder-side path that depended on the old math.tanh implementation
Switched the underlying implementation to triton.language.extra.cann.libdevice.tanh
Kept the import and call style unchanged for users
Extended libdevice.tanh to support bf16 by casting bf16 inputs to fp32, calling the existing fp32 tanh extern path, and casting the result back to bf16
Added tanh test coverage for fp16 and bf16 in addition to fp32
### CheckList
- [x] I am not making a trivial change, such as fixing a typo in a comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [x] This PR does not need a test because corresponding test cases already exist.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1486
Automatic Warp Specialization Optimization (#5622)
Warp specialization enhances kernel performance by utilizing an
asynchronous execution model, where different parts of the kernel are
handled by separate hardware units. The data communication between these
units, via shared memory on the H100, operates with high efficiency.
With this in mind, we’ve developed an automatic warp specialization
optimization that partitions a user kernel into asynchronous tasks
(which map to warp groups on NVIDIA GPU), which naturally execute
concurrently, leveraging the hardware’s multitasking warp scheduler.
To enable warp specialization, user just needs to specify certain
autotune flags, i.e., num_consumer_groups and num_buffers_warp_spec.
For example, a warp-specialized GEMM implementation might look like
below. You can find a complete example in 09-persistent-matmul.py.
```python
@triton.autotune(
configs=[
triton.Config(
{
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 8,
},
num_stages=2,
num_warps=4,
num_consumer_groups=2,
num_buffers_warp_spec=3,
),
],
key=["M", "N", "K"],
)
@triton.jit
def matmul_persistent_ws_kernel(
a_ptr, b_ptr, c_ptr, M, N, K,
stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
pid = tl.program_id(axis=0)
num_pid_m = tl.cdiv(M, BLOCK_M)
num_pid_n = tl.cdiv(N, BLOCK_N)
pid_m = pid // num_pid_m
pid_n = pid % num_pid_n
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak)
b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn)
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_K)):
a = tl.load(a_ptrs)
b = tl.load(b_ptrs)
acc += tl.dot(a, b)
a_ptrs += BLOCK_K * stride_ak
b_ptrs += BLOCK_K * stride_bk
c = acc.to(tl.float16)
c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :]
tl.store(c_ptrs, c)
```
[BUILD] Use build-system requires to install pybind11 (#4450)
pybind11 is now a build-system requirement. The package is no longer
downloaded in setup.py. Instead the build system automatically
installs the build dependency pybind11 before it runs setup.py. This
simplifies offline builds and makes the build requirement visible in
standard packaging tools.
I kept support support for PYBIND11_SYSPATH. The code can be
simplified even more if the feature is no longer needed:
```python
cmake_args = [
...
f"-DPYBIND11_INCLUDE_DIR={pybind11.get_include()}",
...
]
```
See: #4414
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ]
with
[x] to indicate you have done them.
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [x] This PR does not need a test because it modifies the build system
in a backwards compatible way.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
Signed-off-by: Christian Heimes <christian@python.org>
fix(ci): update the commit of the submodule
Co-authored-by: wuzw_05<wuzhiwei37@huawei.com>
# message auto-generated for no-merge-commit merge:
!1418 merge ci into main
fix(ci): update the commit of the submodule
Created-by: wuzw_05
Commit-by: wuzw_05
Merged-by: ascend-robot
Description: # BackGround
This PR is to fix compatibility issues between Triton Ascend and Ascend NPU IR LLVM
# CheckList
- [x] I am not making a trivial change, such as fixing a typo in a comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [x] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [x] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1418