triton-ascend/lib/Conversion · wxlong_ustc/triton-ascend - AtomGit

GGitHubAutomatic Warp Specialization Optimization (#5622 )

ebffad5a创建于 2025年1月16日历史提交

文件	最后提交记录	最后更新时间
TritonGPUToLLVM	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
TritonToTritonGPU	[release/3.2.x] [CHERRY PICK] Add gfx950 target definition (#5452) This PR brings in required LLVM bumps and additional targets for gfx950 support. - https://github.com/triton-lang/triton/pull/5040 - https://github.com/triton-lang/triton/pull/5064 - https://github.com/triton-lang/triton/pull/5180 - https://github.com/triton-lang/triton/pull/5242 - https://github.com/triton-lang/triton/pull/5392 Reverts: - #5347 - #5191	1 年前
CMakeLists.txt	[BACKEND] Make backends more uniform and extract some backend agnostic utilities/helpers. (#3125) Making backends more uniform, extracting backend agnostic utilities/helpers, while the backend-specific ones are moved to `third_party/`. 1. Split the top-level `TritonGPUToLLVM/Utility.h` between LLVM and PTX helpers. Keep the LLVM helpers in top-level, whereas the PTX helpers are moved to `third_party/nvidia`. 2. Move TritonGPUToLLVM passes to `third_party/nvidia`, along with the `PTXAsmFormat` 3. Delete any code in `third_party/amd` that depends on `PTXAsmFormat` 4. Delete all LLVM helpers from `third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h` and use the top-level `Utility.h`. 5. Link `unittest/PTXAsmFormatTest.cpp` to `third_party/nvidia` library. 6. Move `NVGPUToLLVM` passes to `third_party/nvidia` Final directory structure: ``` ├── include/ │ └── triton/ │ └── Conversion/ │ └── TritonGPUToLLVM/ │ └── AsmFormat.h ├── lib/ │ └── Conversion/ │ └── TritonGPUToLLVM/ │ ├── TypeConverter.h │ ├── TypeConverter.cpp │ ├── Utility.h │ └── Utility.cpp (LLVM-only helpers) └── third_party/ └── amd/ └── nvidia/ ├── include/ │ ├── NVGPUToLLVM/ │ └── TritonNVIDIAGPUToLLVM/ └── lib/ ├── NVGPUToLLVM/ └── TritonNVIDIAGPUToLLVM/ ```	2 年前