[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
RFC [python] Rename config.py > knobs.py (#6641)
In https://github.com/triton-lang/triton/pull/6467 I didn't realize that
triton.Config exists. Having both triton.config and triton.Config
is confusing. On the one hand we could rename triton.Config to
triton.AutotunerConfig so that it's more descriptive, but that'd come
at the cost of a non-trivial API name change.
Instead, since triton.config is so new, I think it's more reasonable
to rename that module. Qualitatively I've denoted the different
variables as 'knobs' so renaming the module to knobs seems reasonable.
Of note: this is an RFC, so if this seems silly / pedantic feel free to
shut this down (and close the PR).
fix(bug): allow using npu-smi info to get the device info
Co-authored-by: 刘风昇<liufengsheng2@huawei.com>
# message auto-generated for no-merge-commit merge:
!1338 merge acl2 into main
fix(bug): allow using npu-smi info to get the device info
Created-by: meloliu12327
Commit-by: 刘风昇
Merged-by: ascend-robot
Description: 添加使用npu-smi info获取硬件型号信息的方式
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.
- [ ] I am not making a trivial change, such as fixing a typo in a comment.
- [ ] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [ ] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- Select one of the following.
- [ ] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because FILL THIS IN.
- Select one of the following.
- [ ] I have not added any lit tests.
- [ ] The lit tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
See merge request: Ascend/triton-ascend!1338
[Blackwell][TUTORIALS] Add tutorial 10-block-scaled-matmul.py (#5813)
This tutorial demos Triton support for block scaled matrix multiply on
Blackwell's 5th generation tensor core with low precision FP4 and FP8
datatypes. Planned followups include optimized TMA loads for block scale
factors, and mixed precision support.
Additional changes
* Moves MX dtype helper classes to triton/tools/mxfp.py for use in
tutorials as well as test code.
@ThomasRaoux @pawelszczerbuk @masahi @mbrookhart @binarybana
Add support for padding option to TMA loads (#7993)
Closes #7364
builds on top of #7364 from @jhapradip and addresses remaining comments,
as well as implements thepadding option in the fallback
RewriteTensorDescriptorToPointer path.
- support for passing padding = "nan" on TMA descriptor creation for
both host and device TMAs
- forwards this argument down to tma descriptor creation
- implement the NaN other value in the TMA fallback path
---------
Co-authored-by: Pradip Jha <pradipjha@hotmail.com>