triton-ascend/python/src · wxlong_ustc/triton-ascend - AtomGit

ascend-robotCustomOp indexing_map and extra_buffers

文件	最后提交记录	最后更新时间
.clang-format	fix(triton): Support buffer language Co-authored-by: liheng1234<liheng134@huawei.com> # message auto-generated for no-merge-commit merge: !1049 merge buffer_language into main fix(triton): Support buffer language Created-by: liheng1234 Commit-by: liheng1234 Merged-by: ascend-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> ## checklist <!-- [x] 表示选中 --> - [x] 是否通过本地IDE对代码进行静态检查 - [x] 是否通过本地IDE对代码进行格式化处理 - [x] 是否进行空指针校验 - [x] 是否进行返回值校验 - [x] 是否正确释放new/malloc申请的内存 - [x] 是否充分考虑接口的异常场景 - [ ] 是否正确记录错误日志 See merge request: Ascend/triton-ascend!1049	4 个月前
interpreter.cc	[INTERPRETER] Support atomic and other miscellaneous operations (#3519)	2 年前
ir.cc	CustomOp indexing_map and extra_buffers Co-authored-by: c00866834<chen.pei.chi@huawei.com> # message auto-generated for no-merge-commit merge: !1471 merge py-affine-map into main CustomOp indexing_map and extra_buffers Created-by: Zackc Commit-by: c00866834 Merged-by: ascend-robot Description: The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) https://gitcode.com/Ascend/AscendNPU-IR/issues/183 See merge request: Ascend/triton-ascend!1471	2 个月前
ir.h	New TritonOpBuilder for SIMD/SIMT mode to build triton ir and verify ops Co-authored-by: chongweizhi<chongweizhi@huawei.com> # message auto-generated for no-merge-commit merge: !1192 merge sync into main New TritonOpBuilder for SIMD/SIMT mode to build triton ir and verify ops Created-by: chongweizhi Commit-by: chongweizhi Merged-by: ascend-robot Description: The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1192	4 个月前
llvm.cc	feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning Co-authored-by: Xuan Peng<pengxuan9@huawei.com> # message auto-generated for no-merge-commit merge: !1268 merge feat/async-compile-0210 into main feat(runtime/autotune): add AsyncCompileMode to support parallel compilation in autotuning Created-by: HinPeng Commit-by: Xuan Peng Merged-by: ascend-robot Description: ## PR description 1. Introduce async compile mode from triton v3.5.1 (with little modification to be compatible with current branch and torch-2.7.1) 2. Refactor autotuner to compile triton kernel in parallel ## Notice 1. Introduce `MLIR_DISABLE_MULTITHREADING` environment variable ahead from triton v3.5.1 2. Add TRITON_AUTOTUNE_PARALLEL_COMPILE to control whether compiling kernels in parallel in autotuner, default to '1' See merge request: Ascend/triton-ascend!1268	3 个月前
main.cc	fix(triton): Support buffer language Co-authored-by: liheng1234<liheng134@huawei.com> # message auto-generated for no-merge-commit merge: !1049 merge buffer_language into main fix(triton): Support buffer language Created-by: liheng1234 Commit-by: liheng1234 Merged-by: ascend-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> ## checklist <!-- [x] 表示选中 --> - [x] 是否通过本地IDE对代码进行静态检查 - [x] 是否通过本地IDE对代码进行格式化处理 - [x] 是否进行空指针校验 - [x] 是否进行返回值校验 - [x] 是否正确释放new/malloc申请的内存 - [x] 是否充分考虑接口的异常场景 - [ ] 是否正确记录错误日志 See merge request: Ascend/triton-ascend!1049	4 个月前
passes.cc	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
passes.h	[NFC] Remove uses of deprecated GEN_PASS_CLASSES (#3971) The `GEN_PASS_CLASSES` macro was deprecated in MLIR. Remove uses of this macro in the TritonGPU dialect transform passes. With the new macro, explicit declaration and definition of constructor functions is not necessary anymore. Also moves the passes to the `mlir::triton::gpu` namespace and uses base constructors with pass options. --------- Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>	1 年前