[BACKEND] Add LLVM pre-processing pass to break struct types (#2285)
Add infrastructure to be able to add and test custom LLVM passes in the
backend. This will allow use to apply some low level optimizations and
cleanup on LLVM IR.
Add a first pass that breaks up phi of struct created by lowering to
LLVM. Those can often pessimise the optimizer as it would block
optimizations going through phi nodes.
[triton][tool] A CLI Tool for Tensor Layout Printing (#4486)
A CLI tool to print the layout of a tensor. Currently, only triton_gpu's
DistributedEncoding (no SharedEncoding) tensor layout print is
supported via the exposed getLayoutStr API from the dialect library.
In the future, we could also add more tensor layout print from other
backend HW targets (e.g., CPU).
Example usage:
```
triton-tensor-layout -l "#triton_gpu.nvidia_mma<{versionMajor = 3, versionMinor = 0, warpsPerCTA = [8, 1], CTAsPerCGA = [1, 1], CTASplitNum = [1, 1], CTAOrder = [1, 0], instrShape = [16, 256, 32]}>" -t "tensor<128x256xf16>"
triton-tensor-layout -i input.mlir -t "tensor<1x128x128xf16>" -o output.txt
triton-tensor-layout -i input.mlir -t "tensor<1x128x128xf16>" -o output.txt -alias-names="blocked,mma" -use-hw-view
```
An input file usually looks like:
```
#mma = #triton_gpu.amd_mfma<{versionMajor = 2, versionMinor = 0, warpsPerCTA = [1, 1, 8], instrShape = [32, 32], isTransposed = false}>
#blocked = #triton_gpu.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}>
```
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ]
with
[x] to indicate you have done them.
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [] I have not added any lit tests.
- [x] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
---------
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
[Triton] Verify all tt.reduce operands have the same shape (#4957)
Add SameOperandsShape to tt.reduce to verify all operands have the
same shape.
This matches triton.language.reduce (and similar) semantics.
This change may enable further optimizations and even may help simplify
the code dealing with this operation. Followup PRs will tackle this.
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ]
with
[x] to indicate you have done them.
- [X] I am not making a trivial change, such as fixing a typo in a
comment.
- [X] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [X] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [X] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [X] The lit tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
Signed-off-by: victor-eds <victor.perez@codeplay.com>
Automatic Warp Specialization Optimization (#5622)
Warp specialization enhances kernel performance by utilizing an
asynchronous execution model, where different parts of the kernel are
handled by separate hardware units. The data communication between these
units, via shared memory on the H100, operates with high efficiency.
With this in mind, we’ve developed an automatic warp specialization
optimization that partitions a user kernel into asynchronous tasks
(which map to warp groups on NVIDIA GPU), which naturally execute
concurrently, leveraging the hardware’s multitasking warp scheduler.
To enable warp specialization, user just needs to specify certain
autotune flags, i.e., num_consumer_groups and num_buffers_warp_spec.
For example, a warp-specialized GEMM implementation might look like
below. You can find a complete example in 09-persistent-matmul.py.
```python
@triton.autotune(
configs=[
triton.Config(
{
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 8,
},
num_stages=2,
num_warps=4,
num_consumer_groups=2,
num_buffers_warp_spec=3,
),
],
key=["M", "N", "K"],
)
@triton.jit
def matmul_persistent_ws_kernel(
a_ptr, b_ptr, c_ptr, M, N, K,
stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
pid = tl.program_id(axis=0)
num_pid_m = tl.cdiv(M, BLOCK_M)
num_pid_n = tl.cdiv(N, BLOCK_N)
pid_m = pid // num_pid_m
pid_n = pid % num_pid_n
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak)
b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn)
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_K)):
a = tl.load(a_ptrs)
b = tl.load(b_ptrs)
acc += tl.dot(a, b)
a_ptrs += BLOCK_K * stride_ak
b_ptrs += BLOCK_K * stride_bk
c = acc.to(tl.float16)
c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :]
tl.store(c_ptrs, c)
```
Allow kernel instrumentation passes to be added to pipeline (#3953)
This PR adds the ability to insert passes into the Triton LLVM pass
pipeline through the LLVM Pass Plugin capability. This is technically an
architecture agnostic patch but as of current AMDGPU is the only backend
that would make use of it.
Allow kernel instrumentation passes to be added to pipeline (#3953)
This PR adds the ability to insert passes into the Triton LLVM pass
pipeline through the LLVM Pass Plugin capability. This is technically an
architecture agnostic patch but as of current AMDGPU is the only backend
that would make use of it.
Allow kernel instrumentation passes to be added to pipeline (#3953)
This PR adds the ability to insert passes into the Triton LLVM pass
pipeline through the LLVM Pass Plugin capability. This is technically an
architecture agnostic patch but as of current AMDGPU is the only backend
that would make use of it.