triton-ascend/lib/Dialect · wxlong_ustc/triton-ascend - AtomGit

ascend-robotchange triton-ascend to fork mode

313dccec创建于 2025年12月29日历史提交

文件	最后提交记录	最后更新时间
Triton	change triton-ascend to fork mode Co-authored-by: zhang-chunli01<zhangchunli19@huawei.com> Co-authored-by: Xuan Peng<pengxuan9@huawei.com> Co-authored-by: luobaiqing<luobaiqing1@huawei.com> Co-authored-by: zhuxuejie<zhuxuejie8@huawei.com> Co-authored-by: candyhong<1102229410@qq.com> Co-authored-by: wutianyao<wutianyao1@huawei.com> # message auto-generated for no-merge-commit merge: !1000 merge out-of-tree-refactor into main change triton-ascend to fork mode Created-by: zhang-chunli01 Commit-by: candyhong;luobaiqing;zhang-chunli01;zhuxuejie;wutianyao;Xuan Peng Merged-by: ascend-robot Description: ![image.png](https://raw.gitcode.com/user-images/assets/7623225/d78ef132-1952-4d6e-8d0e-420bd62d0332/image.png 'image.png') See merge request: Ascend/triton-ascend!1000	5 个月前
TritonGPU	[cherrypick release/3.2][BACKEND] Fix accumulator init optimization for integer matmuls (#5192) (#5624) This fixes https://github.com/pytorch/pytorch/issues/144705 Co-authored-by: peterbell10 <peterbell10@openai.com>	1 年前
TritonNvidiaGPU	Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., `num_consumer_groups` and `num_buffers_warp_spec`. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```	1 年前
CMakeLists.txt	[backend][nvidia] move NVGPU dialect to third_party/nvidia/ (#3773) This commit moves the `NVGPU` dialect to the `third_party/nvidia` directory. To make it happen we need to clear some unnecessary dependency on `NVGPU` dialect in the AMD backend and move some utilities into the `third_party/nvidia` directory too. Overall this gives us a better structure and creates a stricter boundary between core and backends.	2 年前