triton-ascend/python · wxlong_ustc/triton-ascend - AtomGit

ascend-robotfeat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh

文件	最后提交记录	最后更新时间
examples	Fix example based on #2701. (#3340)	2 年前
src	CustomOp indexing_map and extra_buffers Co-authored-by: c00866834<chen.pei.chi@huawei.com> # message auto-generated for no-merge-commit merge: !1471 merge py-affine-map into main CustomOp indexing_map and extra_buffers Created-by: Zackc Commit-by: c00866834 Merged-by: ascend-robot Description: The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) https://gitcode.com/Ascend/AscendNPU-IR/issues/183 See merge request: Ascend/triton-ascend!1471	2 个月前
test	feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning Co-authored-by: Xuan Peng<pengxuan9@huawei.com> # message auto-generated for no-merge-commit merge: !1268 merge feat/async-compile-0210 into main feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning Created-by: HinPeng Commit-by: Xuan Peng Merged-by: ascend-robot Description: ## PR description 1. Introduce async compile mode from triton v3.5.1 (with little modification to be compatible with current branch and torch-2.7.1) 2. Refactor autotuner to compile triton kernel in parallel ## Notice 1. Introduce `MLIR_DISABLE_MULTITHREADING` environment variable ahead from triton v3.5.1 2. Add TRITON_AUTOTUNE_PARALLEL_COMPILE to control whether compiling kernels in parallel in autotuner, default to '1' See merge request: Ascend/triton-ascend!1268	3 个月前
triton	feat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh Co-authored-by: jeshd<chengmaofan@huawei.com> # message auto-generated for no-merge-commit merge: !1486 merge remove_tanh into main feat: Remove tanh op to align with the upstream, and add bf16 fallback to libdevice.tanh Created-by: jeshd Commit-by: jeshd Merged-by: ascend-robot Description: ### Summary This PR removes tanh from triton.language.math to keep the fork aligned with upstream, where math does not provide a tanh op, moves tanh usage to triton.language.extra.cann.libdevice.tanh and adds explicit bf16 handling in the libdevice implementation. To preserve existing user-facing behavior, tanh is now provided by triton.language.extra.cann.libdevice, while the call pattern remains unchanged. In other words, users can still write: ``` from triton.language.math import tanh ``` but the actual implementation is resolved to libdevice.tanh. ### Motivation Upstream Triton does not define tanh under triton.language.math. Keeping a fork-specific math.tanh introduces unnecessary divergence and increases maintenance cost. This change removes that divergence and keeps the frontend behavior compatible for existing callers. At the same time, bf16 tanh still needs a workable lowering path. This change makes the call sites explicit and adds a bf16 fallback that computes tanh in fp32 and casts the result back to bf16. ### What Changed Removed the tanh op from triton.language.math Removed the related builder-side path that depended on the old math.tanh implementation Switched the underlying implementation to triton.language.extra.cann.libdevice.tanh Kept the import and call style unchanged for users Extended libdevice.tanh to support bf16 by casting bf16 inputs to fp32, calling the existing fp32 tanh extern path, and casting the result back to bf16 Added tanh test coverage for fp16 and bf16 in addition to fp32 ### CheckList - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `corresponding test cases already exist`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1486	2 个月前
tutorials	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
MANIFEST.in	[FRONTEND] use standard plugin interface for CUDA (#2887)	2 年前
pyproject.toml	[BUILD] Use build-system requires to install pybind11 (#4450) `pybind11` is now a build-system requirement. The package is no longer downloaded in `setup.py`. Instead the build system automatically installs the build dependency `pybind11` before it runs `setup.py`. This simplifies offline builds and makes the build requirement visible in standard packaging tools. I kept support support for `PYBIND11_SYSPATH`. The code can be simplified even more if the feature is no longer needed: ```python cmake_args = [ ... f"-DPYBIND11_INCLUDE_DIR={pybind11.get_include()}", ... ] ``` See: #4414 The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it modifies the build system in a backwards compatible way. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: Christian Heimes <christian@python.org>	1 年前
setup.py	fix(ci): update the commit of the submodule Co-authored-by: wuzw_05<wuzhiwei37@huawei.com> # message auto-generated for no-merge-commit merge: !1418 merge ci into main fix(ci): update the commit of the submodule Created-by: wuzw_05 Commit-by: wuzw_05 Merged-by: ascend-robot Description: # BackGround This PR is to fix compatibility issues between Triton Ascend and Ascend NPU IR LLVM # CheckList - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1418	2 个月前