triton-ascend/lib/Analysis · wxlong_ustc/triton-ascend - AtomGit

GGitHubAutomatic Warp Specialization Optimization (#5622 )

ebffad5a创建于 2025年1月16日历史提交

文件	最后提交记录	最后更新时间
Alias.cpp	[BACKEND] Update llvm to llvm/llvm-project@56152fa377 (#4625) This bumps to llvm/llvm-project@56152fa377 to further include the following fix * https://github.com/llvm/llvm-project/pull/105871 Which was fixing issues in a previous LLVM bump * https://github.com/triton-lang/triton/pull/4624	1 年前
Allocation.cpp	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
AxisInfo.cpp	[BACKEND] Fix ProgramPoint passing in AxisInfoAnalysis (#5181) Fixes #5122. The `ProgramPoint` [here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087) is created on the stack. Then its address is [passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089) to the MLIR `SparseAnalysis` code, where it is [added as a dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311) and later [dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90). By the time the `ProramPoint` is dereferenced in the `AbstractSparseForwardDataFlowAnalysis::visit`, the `AxisInfoAnalysis::visitForOpInductionVar` will have finished and the `ProgramPoint` stack variable destroyed. This leads to a segfault (which can be reproed on the base rev with the lit test added in this PR). The code modified in this PR was originally added in #4927, in conjunction with updating the `llvm-project` hash to `b5cc222d7429`. However, as noted in https://github.com/llvm/llvm-project/pull/110344 (the `llvm-project` PR that has made the refactoring prompting the `AxisInfo.cpp` change in #4927): > For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by `getProgramPointAfter(op)` As the `AxisInfoAnalysis` (in Triton) inherits from `SparseForwardDataFlowAnalysis` (in MLIR), in this PR we follow the above which resolves the segfault issue (as the `ProgramPoint` is now stored in the instance-level state of the pass). P.S. The lit test added in this PR is not exactly minimal. However, I did my best to minimize it starting from the 400-line repro TTGIR in #5122. Further minimization does not seem to expose the segfault.	1 年前
CMakeLists.txt	[BUILD][FRONTEND] working 3P backend (#2896) AMD is enabled by default, but not ripe for usage (not tested). Lots of work will be necessary to make everything robust and maintainable.	2 年前
Membar.cpp	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
Utility.cpp	Add LL::quotient and remove uses of divideRight and sublayoutIsIdentity (#4968) We add a new abstraction `LL::quotient` that abstracts the idea of "a linear layout does not permute certain dimensions". Doing so, allows us to remove `divideRight` and subsume them into this higher-level abstraction. We also fix a bug in `isCrossCTAConversion`. We also remove some code duplication from `transferWithinThreads` and `cvtReorderRegisters` in favour of a more generic approach. We fix a bug in `sublayout` that meant that `sublayout` would reorder `outDims` at will by using a set instead of a vector. I am missing adding tests for LL::quotient, will do in a minute.	1 年前