文件最后提交记录最后更新时间
[Backend] Implement scaled_dot(mxfp4, fp8) (#4904) This PR includes https://github.com/triton-lang/triton/pull/4891 and https://github.com/triton-lang/triton/pull/4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in DotOperandEncodingAttr. All these are marked as FIXME [Dot LL] to be easy to grep for. @Jokeren is working on a comprehensive revamp of DotOperandEncodingAttr which will get rid of all these. https://github.com/triton-lang/triton/pull/4895 is the first step in this direction.1 年前
Revert "[BACKEND] Optimize code generation for load with other arg (#4582)" This reverts commit 78af5c9fd5dfa89c3479473f97d3c4a31aef2601. 1 年前
[BACKEND] Add LLVM pre-processing pass to break struct types (#2285) Add infrastructure to be able to add and test custom LLVM passes in the backend. This will allow use to apply some low level optimizations and cleanup on LLVM IR. Add a first pass that breaks up phi of struct created by lowering to LLVM. Those can often pessimise the optimizer as it would block optimizations going through phi nodes.2 年前
[triton][tool] A CLI Tool for Tensor Layout Printing (#4486) A CLI tool to print the layout of a tensor. Currently, only triton_gpu's DistributedEncoding (no SharedEncoding) tensor layout print is supported via the exposed getLayoutStr API from the dialect library. In the future, we could also add more tensor layout print from other backend HW targets (e.g., CPU). Example usage: ``` triton-tensor-layout -l "#triton_gpu.nvidia_mma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [8, 1], CTAsPerCGA = [1, 1], CTASplitNum = [1, 1], CTAOrder = [1, 0], instrShape = [16, 256, 32]}>" -t "tensor<128x256xf16>" triton-tensor-layout -i input.mlir -t "tensor<1x128x128xf16>" -o output.txt triton-tensor-layout -i input.mlir -t "tensor<1x128x128xf16>" -o output.txt -alias-names="blocked,mma" -use-hw-view ``` An input file usually looks like: ``` #mma = #triton_gpu.amd_mfma<{versionMajor = 2, versionMinor = 0, warpsPerCTA = [1, 1, 8], instrShape = [32, 32], isTransposed = false}> #blocked = #triton_gpu.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}> ``` The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace [ ] with [x] to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [x] I have added tests. - /test for lit tests - /unittest for C++ tests - /python/test for end-to-end tests - [] This PR does not need a test because FILL THIS IN. - Select one of the following. - [] I have not added any lit tests. - [x] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Yuanwei Fang <fywkevin@fb.com>1 年前
[Triton] Verify all tt.reduce operands have the same shape (#4957) Add SameOperandsShape to tt.reduce to verify all operands have the same shape. This matches triton.language.reduce (and similar) semantics. This change may enable further optimizations and even may help simplify the code dealing with this operation. Followup PRs will tackle this. The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace [ ] with [x] to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - Select one of the following. - [X] I have added tests. - /test for lit tests - /unittest for C++ tests - /python/test for end-to-end tests - [ ] This PR does not need a test because FILL THIS IN. - Select one of the following. - [ ] I have not added any lit tests. - [X] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: victor-eds <victor.perez@codeplay.com>1 年前
[cherrypick release/3.2][BACKEND] Fix accumulator init optimization for integer matmuls (#5192) (#5624) This fixes https://github.com/pytorch/pytorch/issues/144705 Co-authored-by: peterbell10 <peterbell10@openai.com>1 年前
Automatic Warp Specialization Optimization (#5622) Warp specialization enhances kernel performance by utilizing an asynchronous execution model, where different parts of the kernel are handled by separate hardware units. The data communication between these units, via shared memory on the H100, operates with high efficiency. With this in mind, we’ve developed an automatic warp specialization optimization that partitions a user kernel into asynchronous tasks (which map to warp groups on NVIDIA GPU), which naturally execute concurrently, leveraging the hardware’s multitasking warp scheduler. To enable warp specialization, user just needs to specify certain autotune flags, i.e., num_consumer_groups and num_buffers_warp_spec. For example, a warp-specialized GEMM implementation might look like below. You can find a complete example in 09-persistent-matmul.py. ```python @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, num_buffers_warp_spec=3, ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) b = tl.load(b_ptrs) acc += tl.dot(a, b) a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ```1 年前
[release/3.2.x] [CHERRY PICK] Add gfx950 target definition (#5452) This PR brings in required LLVM bumps and additional targets for gfx950 support. - https://github.com/triton-lang/triton/pull/5040 - https://github.com/triton-lang/triton/pull/5064 - https://github.com/triton-lang/triton/pull/5180 - https://github.com/triton-lang/triton/pull/5242 - https://github.com/triton-lang/triton/pull/5392 Reverts: - #5347 - #51911 年前
Allow kernel instrumentation passes to be added to pipeline (#3953) This PR adds the ability to insert passes into the Triton LLVM pass pipeline through the LLVM Pass Plugin capability. This is technically an architecture agnostic patch but as of current AMDGPU is the only backend that would make use of it.1 年前
Allow kernel instrumentation passes to be added to pipeline (#3953) This PR adds the ability to insert passes into the Triton LLVM pass pipeline through the LLVM Pass Plugin capability. This is technically an architecture agnostic patch but as of current AMDGPU is the only backend that would make use of it.1 年前
Allow kernel instrumentation passes to be added to pipeline (#3953) This PR adds the ability to insert passes into the Triton LLVM pass pipeline through the LLVM Pass Plugin capability. This is technically an architecture agnostic patch but as of current AMDGPU is the only backend that would make use of it.1 年前