文件最后提交记录最后更新时间
Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., num_consumer_groups and num_buffers_warp_spec. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```1 年前
Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., num_consumer_groups and num_buffers_warp_spec. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```1 年前
change triton-ascend to fork mode Co-authored-by: zhang-chunli01<zhangchunli19@huawei.com> Co-authored-by: Xuan Peng<pengxuan9@huawei.com> Co-authored-by: luobaiqing<luobaiqing1@huawei.com> Co-authored-by: zhuxuejie<zhuxuejie8@huawei.com> Co-authored-by: candyhong<1102229410@qq.com> Co-authored-by: wutianyao<wutianyao1@huawei.com> # message auto-generated for no-merge-commit merge: !1000 merge out-of-tree-refactor into main change triton-ascend to fork mode Created-by: zhang-chunli01 Commit-by: candyhong;luobaiqing;zhang-chunli01;zhuxuejie;wutianyao;Xuan Peng Merged-by: ascend-robot Description: ![image.png](https://raw.gitcode.com/user-images/assets/7623225/d78ef132-1952-4d6e-8d0e-420bd62d0332/image.png 'image.png') See merge request: Ascend/triton-ascend!10005 个月前
[BACKEND] Update to llvm/llvm-project@b5cc222d7429 (#4927) Upgrading LLVM to pick up the following changes for AMD backend: * https://github.com/llvm/llvm-project/pull/112237 Changes made: - changed the signature of visit method in ConstantAnalysis - i.e., accepts an instance of the ProgramPoint. - updated calls to getLatticeElementFor -i.e., accepts an instance of the ProgramPoint. - added the required last parameter to LLVM::DISubprogramAttr::get - i.e., an empty annotations.1 年前
Add LL::quotient and remove uses of divideRight and sublayoutIsIdentity (#4968) We add a new abstraction LL::quotient that abstracts the idea of "a linear layout does not permute certain dimensions". Doing so, allows us to remove divideRight and subsume them into this higher-level abstraction. We also fix a bug in isCrossCTAConversion. We also remove some code duplication from transferWithinThreads and cvtReorderRegisters in favour of a more generic approach. We fix a bug in sublayout that meant that sublayout would reorder outDims at will by using a set instead of a vector. I am missing adding tests for LL::quotient, will do in a minute.1 年前
fix(copyright):Remove the Huawei copyright notices from the extension, runtime, libentry files and OpInterface.h. Co-authored-by: jeshd<chengmaofan@huawei.com> # message auto-generated for no-merge-commit merge: !1346 merge recover-community-copyright into main fix(copyright):Remove the Huawei copyright notices from the extension, runtime, libentry files and OpInterface.h. Created-by: jeshd Commit-by: jeshd Merged-by: ascend-robot Description: 描述 移除extension,runtime和libentry里的Huawei copyright,移除OpInterface.h里的Huawei copyright 修改原因 extension,runtime和libentry中的代码文件为TA新添加的文件,基于开源代码片段的修改,OpInterface.h从triton 3.4.0版本引入,移除对应的Huawei copyright See merge request: Ascend/triton-ascend!13462 个月前
Linear layouts (#3794) Today we have many different layout objects, representing e.g. MMAv2 operands in registers, MMAv2 results in registers (different thing!), AMD tensor core operands in registers, shared memory swizzled Just Right for Hopper MMAv3, and so on. In CUTLASS v2, they used to have the same problem. In v3, they introduced the notion of a CuTe layout, which unifies all of these special cases into one programmatic thing. I want to do the same thing for Triton, because 1. we have a bunch of [known bugs](https://github.com/openai/triton/blob/0b46687895f0bc7c4d5216150d8d5cfeb5b4e254/python/test/unit/language/test_core.py#L4771) around layout conversions that have been very hard to fix, 2. there are certain operations (like some reshape + transpose + reshape combinations) that cannot be represented efficiently with today's layouts, and 3. the code for handling layouts is already very complex, and I'm concerned that Blackwell is going to make the problem worse. One approach I considered is using CuTe inside Triton. But I concluded it's not a great fit for various reasons. As an alternative, @apgoucher proposed this idea of "linear layouts" that seems to work really well, and is a lot simpler. This PR is currently a first pass of linear layouts. It Appears To Work (tm). The way this PR uses linear layouts is that before we generate the indices for a Triton BlockedLayout, we convert it to a linear layout and use that to generate indices instead. The implementation plan is to do the same thing for the other Triton layouts (i.e. make codegen only use linear layouts). Once that is working, we can start using linear layouts in the Triton middle-end. Eventually the goal is to replace all layouts with just this one. There are a few questions still outstanding which need to be resolved before we can land this. 1. Are linear layouts actually flexible enough to represent all the layouts we care about? 2. What will the textual IR look like for linear layouts? Can we make it as easy to read as the current IR?2 年前