triton-ascend/third_party/amd · wxlong_ustc/triton-ascend - AtomGit

GGitHubAutomatic Warp Specialization Optimization (#5622 )

ebffad5a创建于 2025年1月16日历史提交

文件	最后提交记录	最后更新时间
backend	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
include	[AMD] Add pass to convert tt.load/tt.store to buffer operations (#4966) This PR is only introducing a ttgir pass to convert `tt.load`/`tt.store` to `amdgpu.buffer_load`/`amdgpu.buffer_load`, _when this is possible_ : this means we need to check for 3 conditions: 1. The pointer arithmetic has been canonicalized (`scalarPtr->splat->addptr->load/store`) 2. The offsets are 32-bits 3. The offsets are non-negative. We use a mix of analysis and assumptions to verify this condition Right now the functionality is gated behind an `AMDGCN_USE_BUFFER_OPS`, which now also covers the pointer canonicalization pass which is mostly meant to handle this.	1 年前
language	[AMD] Add fast_expf to libdevice (#4937) This PR added `fast_expf` operator under libdevice for AMD hardwares. Aligning with other operators in the exp family, the way to deal with denorm inputs is controled by `__HIP_FTZ`, which currently is fixed to be True. - If `__HIP_FTZ = 1`, the operator uses `llvm.amdgcn.exp2.f32`, which will flush denorms in inputs and outputs; - If `__HIP_FTZ = 0`, the operator uses `llvm.exp2.f32`, which will not flush denorms. Ref: https://github.com/ROCm/llvm-project/blob/amd-staging/amd/device-libs/cuda2gcn/src/precision.cl Fixes https://github.com/ROCm/triton-internal/issues/314	1 年前
lib	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
python	[AMD] Add pass to convert tt.load/tt.store to buffer operations (#4966) This PR is only introducing a ttgir pass to convert `tt.load`/`tt.store` to `amdgpu.buffer_load`/`amdgpu.buffer_load`, _when this is possible_ : this means we need to check for 3 conditions: 1. The pointer arithmetic has been canonicalized (`scalarPtr->splat->addptr->load/store`) 2. The offsets are 32-bits 3. The offsets are non-negative. We use a mix of analysis and assumptions to verify this condition Right now the functionality is gated behind an `AMDGCN_USE_BUFFER_OPS`, which now also covers the pointer canonicalization pass which is mostly meant to handle this.	1 年前
unittest	[AMD] Introduce an OptimizeLDSUsage pass (#3730) This PR inroduces OptimizeLDSUsage pass which generalizes LDS optimization,which was part of the DecomposeUnsupportedLayouts pass. Overall it tries to reduce LDS usage of convert op by adding intermediate layout in conversion. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>	1 年前
CMakeLists.txt	[AMD] Add TritonAMDGPU dialect scaffolding (#4685) This PR adds an TritonAMDGPU dialect to host future AMD specific ops to help with AMD backend CodeGen. --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com>	1 年前