文件最后提交记录最后更新时间
[BACKEND] Restrict MMAV5/MMV3 usage for N < 16 and fix no-swizzle case (#7981) This prevent generating miscompiles for cases where the N dimension is too small. Technically we could support N=8 but it would require padding the shared memory.9 个月前
Fix histograms for complex replicated layouts (#7938) The current histogram code assumes that replication across a warp is done in a way that involves the first n threads having unique data. This is not a valid assumption; in fact the function it calls to get this layout, getThreadsPerWarp, describes one such layout and how it's returned, so the histogram code actually discards that information. To fix this, we actually remove the uniquing code that masks out threads possessing duplicate data. Instead we have everyone participate and adjust for the overcounting that results by computing the "replication factor". This is much easier than computing the correct mask, which is nontrivial in the general case.9 个月前
fix(unlimit loop): delete hasFolder attribute of AddPtrOp to avoid ulimit loop in compile Co-authored-by: zhang-chunli01<zhangchunli19@huawei.com> # message auto-generated for no-merge-commit merge: !43 merge release/3.3.x-upgrade-zcl-dev into release/3.5.x-upgrade fix(unlimit loop): delete hasFolder attribute of AddPtrOp to avoid ulimit loop in compile Created-by: zhang-chunli01 Commit-by: zhang-chunli01 Merged-by: zhang-chunli01 Description: ## 描述 把社区3.5版本AddPtrOp中的Folder去掉,因为它会自动优化addptr ptr, 0的场景,而LoadStoreCanonicalizer会给Scalar场景添加addptr ptr,0以保证指针分析的入口是addptr。此时两边不断增删,会造成applyPatternsGreedly的死循环。 ## checklist <!-- [x] 表示选中 --> - [ ] 是否通过本地IDE对代码进行静态检查 - [ ] 是否通过本地IDE对代码进行格式化处理 - [ ] 是否进行空指针校验 - [ ] 是否进行返回值校验 - [ ] 是否正确释放new/malloc申请的内存 - [ ] 是否充分考虑接口的异常场景 - [ ] 是否正确记录错误日志 See merge request: jeshd/triton-ascend!432 个月前
Audit file-private C++ functions to ensure correct linkage (#6237) ### Summary I’ve gone through the C++ sources and updated functions that were file-local but not declared static or placed in anonymous namespaces. These can lead to symbol visibility issues when linking multiple translation units together. This patch: * Marks helper functions in bin/triton-tensor-layout.cpp as static. * Marks internal variables and helpers in lib/Instrumentation/PrintLoadStoreMemSpaces.cpp as static. * Adds static to a number of internal free functions and templates in third_party/f2reduce/f2reduce.cpp. * Corrects linkage for a handful of AMD GPU transforms and stream utilities that were file-local. * Makes file-local helpers in third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp static. * Ensures initProton in third_party/proton/csrc/Proton.cpp is static. * Tidies up similar file-local helpers in test code. I initially made some functions static that were already declared in headers; that’s been reverted so things compile cleanly. All C++ unit tests and lit tests pass, and `pre-commit run --from-ref origin/main --to-ref HEAD` shows no outstanding issues. ### New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - [x] This PR does not need a test because this is a refactor that only adjusts linkage of file-local functions. - [x] I have not added any lit tests. All tests (make test-nogpu) pass: ``` Testing Time: 1.07s Total Discovered Tests: 131 Passed: 131 (100.00%) 100% tests passed, 0 tests failed out of 207 Total Test time (real) = 1.48 sec ```1 年前
[NVIDIA] Replace some NVGPU ops with equivalent NVVM ops (part 2) (#7471) This change replaces some NVGPU ops with the corresponding NVVM ops. It aligns with previous discussions in PR #7420. For some op like NVGPU::FenceAsyncSharedOp, there is no corresponding Intrinsic, and LLVM will also generate PTX. However, in the long run, I think it is better to hand over the responsibility of generating code to LLVM instead of hard coding PTX at the NVGPU layer. The ConvertNVVMToLLVMPass has been added to the pipeline and build system so that NVVM ops are correctly lowered to LLVM IR.10 个月前
[BACKEND] Always lower tcgen05.ld/st via generic LinearLayout code (#7862) We now add support for TensorMemoryScalesEncoding in toLinearLayout and we reuse the generic lowering to lower it.9 个月前
fix(copyright):Remove the Huawei copyright notices from the extension, runtime, libentry files and OpInterface.h. Co-authored-by: jeshd<chengmaofan@huawei.com> # message auto-generated for no-merge-commit merge: !1346 merge recover-community-copyright into main fix(copyright):Remove the Huawei copyright notices from the extension, runtime, libentry files and OpInterface.h. Created-by: jeshd Commit-by: jeshd Merged-by: ascend-robot Description: 描述 移除extension,runtime和libentry里的Huawei copyright,移除OpInterface.h里的Huawei copyright 修改原因 extension,runtime和libentry中的代码文件为TA新添加的文件,基于开源代码片段的修改,OpInterface.h从triton 3.4.0版本引入,移除对应的Huawei copyright See merge request: Ascend/triton-ascend!13462 个月前
[Instrumentation][Proton] Add MLIR/LLVM level compiler instrumentation pass support in Proton (#5067) Basic functionality to print load/store address spaces chosen by the compiler. Usage/example with matmul Proton tutorial: ``` $ proton --instrument=print-mem-spaces matmul.py 0 matmul_kernel matmul.py:180:20 SHARED STORE 1 matmul_kernel matmul.py:181:20 SHARED STORE 2 matmul_kernel matmul.py:180:20 SHARED LOAD 3 matmul_kernel matmul.py:181:20 SHARED LOAD matmul-performance: M N K cuBLAS Triton 0 256.0 256.0 256.0 2.231013 1.691252 1 384.0 384.0 384.0 5.947805 4.626071 2 512.0 512.0 512.0 12.336188 8.924051 3 640.0 640.0 640.0 26.006348 14.628980 4 768.0 768.0 768.0 36.065672 20.972006 5 896.0 896.0 896.0 51.974214 29.480457 6 1024.0 1024.0 1024.0 63.913206 27.560463 7 1152.0 1152.0 1152.0 52.790876 34.125533 ```1 年前