[TOOLS] Add support for autotuning AOT kernel (#2123)
This PR makes the following change to AOT kernel
- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
[TOOLS] Add support for autotuning AOT kernel (#2123)
This PR makes the following change to AOT kernel
- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
Refactor compiler specializations to consider backend (#4734)
In this PR I am trying to refactor the specializations that we apply to
the signature of a given function in Triton.
Basically, given a kernel there are some argument properties that can
help compilation. E.g., divisibility by 16 and the fact that an integer
is equal to 1.
In a previous PR: https://github.com/triton-lang/triton/pull/4716, I
needed other specializations to add buffer support in the AMD backend
(and get back some performance when we were using unaligned masked
loads).
So this is my attempt to redesign the specialization support to
introduce per-backend specializations. The idea is that
AttrsDescriptor is now the class that is taking care of doing the
analysis of the parameters and adding the specialization. It also has a
function table where more specializations can be added per-backend.
[Frontend][Backend] Add device-side tma descriptor update API (#4633)
This adds two new triton IR operators:
1. ExperimentalTensormapCreateOp which creates a descriptor and stores it in global memory
2. ExperimentalTensormapFenceproxyAcquireOp which produces the required fence to use the updated descriptor
I then use these to expose 3 new functions in tl.extra.cuda.
1. experimental_device_tensormap_create1d
2. experimental_device_tensormap_create2d
3. experimental_tensormap_fenceproxy_acquire
which match up with the existing host-side tensormap creation API.
fix(bug): allow using npu-smi info to get the device info
Co-authored-by: 刘风昇<liufengsheng2@huawei.com>
# message auto-generated for no-merge-commit merge:
!1338 merge acl2 into main
fix(bug): allow using npu-smi info to get the device info
Created-by: meloliu12327
Commit-by: 刘风昇
Merged-by: ascend-robot
Description: 添加使用npu-smi info获取硬件型号信息的方式
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1338