GGitHubAutomatic Warp Specialization Optimization (#5622 )

ebffad5a创建于 2025年1月16日历史提交

文件	最后提交记录	最后更新时间
01-vector-add.py	Align resulting computation of GBs, TFLOPs in tutorials (#4752) Closes #4751. Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>	1 年前
02-fused-softmax.py	Fix typo: Correct 'piepling' to 'pipelining' in kernel comments for clarity in software optimization. (#4897) This PR fixes a typo in the comments of the softmax kernel function, where "piepling" was mistakenly used instead of "pipelining". The change ensures that the term accurately reflects the correct meaning of "pipelining" in software stages. ### Checklist - [x] I am not making a trivial change, such as fixing a typo in a comment. - (Note: While this change fixes a typo, it corrects an important technical term—'pipelining'—which is crucial for understanding the kernel's functionality.) - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - [x] This PR does not need a test because it only fixes a typo in a comment. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section.	1 年前
03-matrix-multiplication.py	[AMD] Turn stream pipeline v2 as the default (#4665) This commit turns on the v2 pipeliner as the default. We still keep v1 for some extended time to make perf debugging easier; but expect to remove it soon.	1 年前
04-low-memory-dropout.py	[TUTORIALS] Update link and path in tutorial 04 (#3974)	1 年前
05-layer-norm.py	Align resulting computation of GBs, TFLOPs in tutorials (#4752) Closes #4751. Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>	1 年前
06-fused-attention.py	[AUTOTUNER] Make autotuner take `do_bench` as a parameter (#4496) This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in #4417. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>	1 年前
07-extern-functions.py	[TUTORIAL] Multiple improvements to the tutorials, especially to `09-persistent-matmul.py` (#4802) - Format the introduction section in some tutorials. - Add instructions for running the persistent matmul tutorial, as well as instructions for using `proton-viewer`. - Replace `torch.zeros` with `torch.empty` to remove unnecessary GPU kernels. - Add brackets `[` and `]` around shapes to improve the output formatting. - Remove redundant metric accumulation, as the Triton hook already handles metric accumulation.	1 年前
08-grouped-gemm.py	[CI] remove experimental tutorials; revive docs nightly build (#3371)	2 年前
09-persistent-matmul.py	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
README.rst	[DOC] Fix syntax errors, typos, formatting; increase consistency (#1357) This PR; - Fixes syntax errors like `.type values: dict[str, Callable[[list[Any]], Any]] `to` :type values: dict[str, Callable[[list[Any]], Any]]`, - Fixes typos, - Fixes formatting like `k ++` to `k++`, - Increases consistency (e.g. by transforming the minority `cd dir/` to the majority `cd dir`).	3 年前

Tutorials

Below is a gallery of tutorials for writing various basic operations with Triton. It is recommended that you read through the tutorials in order, starting with the simplest one.

To install the dependencies for the tutorials:

.. code-block:: bash

cd triton
pip install -e './python[tutorials]'