triton-ascend/third_party · Ascend/triton-ascend - AtomGit

文件	最后提交记录	最后更新时间
amd	Add support for padding option to TMA loads (#7993) Closes #7364 builds on top of #7364 from @jhapradip and addresses remaining comments, as well as implements thepadding option in the fallback RewriteTensorDescriptorToPointer path. - support for passing padding = "nan" on TMA descriptor creation for both host and device TMAs - forwards this argument down to tma descriptor creation - implement the NaN other value in the TMA fallback path --------- Co-authored-by: Pradip Jha <pradipjha@hotmail.com>	8 个月前
ascend	!1790 merge sn into main fix: Enhance offset analysis and change the tensor of the all-one type in the select op to continuous memory access Created-by: wutianyao Commit-by: wutianyao Merged-by: ascend-robot Description: bugfix1: The restriction that forcibly treats tensor types with all 1（<1x1x1x...>） in the select op during pointer analysis as non-contiguous has been removed. Because under the original restriction, it would lead to a situation where, when processing tensor<1x> types, they would be expanded into discrete memory access.However, when the MLIR's built-in forloop optimization（SimplifyTrivialLoops） recognizing that the loop count is 1, it would fall back. Yet, it did not remove the discrete memory access label applied to the load op during the discrete memory access optimization（{DiscreteMemAccess}）. When the linalg pass later attempts to transform the load, if it detects that the load has a discrete memory access label, it will read the init args of the outer forloop of the load. Where the discrete memory access forloop has already been fall back by SimplifyTrivialLoops , this can lead to a series of errors (for example, if the discrete memory access forloop has been erased but the kernel itself still contains a forloop, reading the init args as empty can cause a core dump). bugfix2: When recognize a indextensor in forloop init args in the rewriteforloop, it will be convert to offset + stride，but in some scene, for example，the args is recognize as indextensor, then in blockptranalysis, it will be analysised as scalar, and in rewriteTerminator it need the stride must equal one, so an assert error occurs. bugfix3: When the mask is composed of splats, it is identified as a continuous mask in the discrete mask analysis pass. However, when it enters the Triton to Linalg pass, if all dimensions are 1, the splat op is converted into an insert op. In this case, the mask analysis cannot identify the insert op, and the mask is analyzed as discontinuous. As a result, the problem occurs. bugfix4: In the use analysis, when there is indirect memory access, the 1st load op is initially marked as a meta use. In the Post-process, this situation is handled by identifying and marking the instruction chain related to indirect memory access, and then re-marking it as mixuse to ensure it is not eliminated in subsequent conversion stages. If an op appears in a computation chain involving a set of indirect memory accesses, such as load(1st) -> computeOp -> load(2nd), and this op has been used through assert or print, it will be marked as Mixuse. In this case, the op will be clone into a mixuse op and a metause op before the Post-process phase. The mixuse op is used for assert, and the metause op is used for 2nd loads. However, since the op was initially marked as Mixuse, the 1st load op is also marked as Mixuse, thus skipping the post-process. Since the split metause op is used for 2nd loads, its elimination can cause a series of issues. Currently, a temporary modified for this situation is to disable cloning of select ops. <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description. Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) See merge request: Ascend/triton-ascend!1790	19 天前
f2reduce	fix doctools probleam	1 个月前
nvidia	[IMP][Launch Latency] reduce launch overhead for tensor descriptors (#7987) Currently (numbers below from current main), the overhead of launching kernels with `TensorDescriptor` inputs is much larger than the corresponding kernel with `Tensor` inputs. \| CI-Node \| tensor \| descriptor \| \|---------\|--------\|------------\| \| A100 \| 22.7us \| 35.0us \| \| H100 \| 16.1us \| 37.9us \| \| B200 \| 15.5us \| 36.8us \| To some extent, this is expected as `TensorDescriptor` will eventually be decomposed meaning the the underlying CUDA kernel is launched with potentially more inputs. Although the difference should not be that large. An inspection of where this difference is coming from leads to - larger overheads in argument specialization (#7771 will help here, too) - repeated creation of `TMADescKernelParam` objects, incl. using `torch.empty(TMA_DESC_SIZE)` to get 128B containers for a `CUTensorMap` and pointer extraction for `fill_tma_descriptor` The latter can be addressed by exposing `CUtensorMap` as custom Python type. `fill_tma_descriptor` then simply can create a new object of the right size and return this wrapper around `CUtensorMap` - later also simplifying extracting `CUtensorMap` in the launch code. With these changes, overhead for the benchmark with `TensorDescriptor` inputs should go down by roughly 12us. --------- Co-authored-by: Peter Bell <peterbell10@openai.com>	8 个月前
proton	[proton] Add missing #include. (#8008)	8 个月前