feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning
Co-authored-by: Xuan Peng<pengxuan9@huawei.com>
# message auto-generated for no-merge-commit merge:
!1268 merge feat/async-compile-0210 into main
feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning
Created-by: HinPeng
Commit-by: Xuan Peng
Merged-by: ascend-robot
Description:
## PR description
1. Introduce async compile mode from triton v3.5.1 (with little modification to be compatible with current branch and torch-2.7.1)
2. Refactor autotuner to compile triton kernel in parallel
## Notice
1. Introduce MLIR_DISABLE_MULTITHREADING environment variable ahead from triton v3.5.1
2. Add TRITON_AUTOTUNE_PARALLEL_COMPILE to control whether compiling kernels in parallel in autotuner, default to '1'
See merge request: Ascend/triton-ascend!1268
fix interpret bug of cast op rtz mode in overflow case
Co-authored-by: zhuxuejie<zhuxuejie8@huawei.com>
# message auto-generated for no-merge-commit merge:
!1336 merge int_1 into main
fix interpret bug of cast op rtz mode in overflow case
Created-by: zhuxuejie
Commit-by: zhuxuejie
Merged-by: ascend-robot
Description: 1、修复解释器模式下 cast op在rtz模式下上溢场景的bug,转为inf而非nan
2、新增对scope op的支持,直接pass
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1336
feat(autotune): add autotune function
Co-authored-by: hua_yc<huayanchun@h-partners.com>
# message auto-generated for no-merge-commit merge:
!1329 merge 0303 into main
feat(autotune): add autotune function
Created-by: hua_yc
Commit-by: hua_yc
Merged-by: ascend-robot
Description: add autotune function
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1329
[nvidia] Support passing TMA descriptors by-value (#4498)
## Motivation
Currently, Triton passes TMA descriptors by-ref through global memory.
This has a number of problems:
* Significant launch overhead (5-10us) for the host-to-device memcpy
* Users must insert fences for TMA descriptor cache flush (see
https://github.com/triton-lang/triton/pull/4342). When users don't
insert these fences correctly, they run into very strange bugs:
https://github.com/triton-lang/triton/issues/4332
* The memcpy makes it nearly impossible to use cudagraphs
There are two possible solutions:
* [Pass the descriptor
by-value](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays)
* [Create the descriptor
on-device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device)
Because of the tricky memory model for TMA descriptors on H100, creating
a descriptor on-device requires moving data back and forth from L2
cache. This is relatively expensive (100s of cycles at least) and
requires the user or compiler to correctly insert release/acquire
fences.
In some cases, there is no way to avoid creating the descriptor
on-device. But for many use-cases, it's perfectly fine to set up the
descriptor on the host and pass by-value, avoiding both performance and
correctness issues. This PR implements the by-value functionality.
## User-level API
Whenever the user provides a kernel param which implements the method
tma_desc_cpu_ptr(), Triton will lower that argument to a
__grid_constant__ by-value param. The existing helper methods
create_[1d/2d]_tma_descriptor were modified to return such a type, so
existing code does not need any changes to take advantage of the new
feature.
## Implementation details
When a kernel param with tma_desc_cpu_ptr() is detected, we attach an
attribute to that param at the TTIR level. The attribute is passed
through to TTGIR. When lowering TTGIR to LLIR, we use code ported from
Mosaic (https://github.com/google/jax/pull/22175) to set up the correct
LLVM attributes. The runtime is also modified to pass by-value TMA
descriptors properly.
## Limitations
This feature is currently broken when compiling an IRSource directly
(which is useful for editing IR and re-compiling). That would require
updating some
[regexes](https://github.com/triton-lang/triton/blob/edcc2bcb8dd2e9224c94b689df9cbb7d2986ebea/python/triton/compiler/compiler.py#L52-L53)
which infer the function signature from the IR. IRSource compilation
still works fine for kernels which do not use the new feature.
Once the approach I'm taking here is reviewed, I plan to fix that
limitation, either in this PR or in a follow-up PR.
[CACHE] Use base64 for shorter cache directories (#4553)
Not sure this is worthy to make it? I was annoyed by long sha256-based
cache directory names, mostly 64 chars. So I quickly added base64-based
shorter cache directory names.
Instead of fixing a dozen places that use hashlib.sha256, I patched
the cache manager. 64-char names are mostly reduced to 43-44 chars.
A comparison:
```
> % ls -l $TRITON_CACHE_DIR
total 0
drwxr-xr-x 1 minjang users 40 Aug 21 19:02 44ae4aee7ef0ee0dd54e860cf44627e3b6cedabe87a228ac75988301b8a6bf60
drwxr-xr-x 1 minjang users 26 Aug 21 19:02 82dc2c9a5508bf07c72e02353c1e751dc54aae85666f139b2867b0a1e95e0e7b
drwxr-xr-x 1 minjang users 226 Aug 21 19:02 b8e240968a85711ba57b17bf8450f1ffbc85a8de8cd1f47aa87b241b53f9bf60
drwxr-xr-x 1 minjang users 26 Aug 21 19:03 gtwsmlUIvwfHLgI1PB51HcVKroVmbxObKGewoeleDns
drwxr-xr-x 1 minjang users 40 Aug 21 19:03 RK5K7n7w7g3VToYM9EYn47bO2r6HoiisdZiDAbimv2A
drwxr-xr-x 1 minjang users 226 Aug 21 19:03 uOJAloqFcRulexe_hFDx_7yFqN6M0fR6qHskG1P5v2A
```
test_core.py runs without any errors, and the cache directory has all
base64-based shorter names.
fix interpret bug of cast op rtz mode in overflow case
Co-authored-by: zhuxuejie<zhuxuejie8@huawei.com>
# message auto-generated for no-merge-commit merge:
!1336 merge int_1 into main
fix interpret bug of cast op rtz mode in overflow case
Created-by: zhuxuejie
Commit-by: zhuxuejie
Merged-by: ascend-robot
Description: 1、修复解释器模式下 cast op在rtz模式下上溢场景的bug,转为inf而非nan
2、新增对scope op的支持,直接pass
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1336
fix(jit): fix incompatibility problem caused by premature introduction
Co-authored-by: luobaiqing<luobaiqing1@huawei.com>
# message auto-generated for no-merge-commit merge:
!1395 merge fix_bug_main into main
fix(jit): fix incompatibility problem caused by premature introduction
Created-by: luobaiqing
Commit-by: luobaiqing
Merged-by: ascend-robot
Description: preload提前引入了一些3.5.1的方法,但适配有问题。此pr修复之
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1395