triton-ascend/third_party/nvidia · Ascend/triton-ascend - AtomGit

Ccandyhongfix: update third_party nv file with upstream

文件	最后提交记录	最后更新时间
backend	[IMP][Launch Latency] reduce launch overhead for tensor descriptors (#7987) Currently (numbers below from current main), the overhead of launching kernels with `TensorDescriptor` inputs is much larger than the corresponding kernel with `Tensor` inputs. \| CI-Node \| tensor \| descriptor \| \|---------\|--------\|------------\| \| A100 \| 22.7us \| 35.0us \| \| H100 \| 16.1us \| 37.9us \| \| B200 \| 15.5us \| 36.8us \| To some extent, this is expected as `TensorDescriptor` will eventually be decomposed meaning the the underlying CUDA kernel is launched with potentially more inputs. Although the difference should not be that large. An inspection of where this difference is coming from leads to - larger overheads in argument specialization (#7771 will help here, too) - repeated creation of `TMADescKernelParam` objects, incl. using `torch.empty(TMA_DESC_SIZE)` to get 128B containers for a `CUTensorMap` and pointer extraction for `fill_tma_descriptor` The latter can be addressed by exposing `CUtensorMap` as custom Python type. `fill_tma_descriptor` then simply can create a new object of the right size and return this wrapper around `CUtensorMap` - later also simplifying extracting `CUtensorMap` in the launch code. With these changes, overhead for the benchmark with `TensorDescriptor` inputs should go down by roughly 12us. --------- Co-authored-by: Peter Bell <peterbell10@openai.com>	9 个月前
hopper	[Hopper][WS] Fix UB in channel sorting (#7828) In ~50% of test runs I was getting failures in `test/Hopper/WarpSpecialization/ws_code_partition.mlir`'s `_matmul_layernorm_persistent_one_producer_one_consumer_one_epilog` test, with the following error: > assertion failed at third_party/llvm/llvm-project/llvm/include/llvm/ADT/DenseMap.h:1290 in pointer llvm::DenseMapIterator<mlir::Channel , llvm::SmallVector<mlir::Channel >, llvm::DenseMapInfo<mlir::Channel >, llvm::detail::DenseMapPair<mlir::Channel , llvm::SmallVector<mlir::Channel >>, true>::operator->() const [KeyT = mlir::Channel , ValueT = llvm::SmallVector<mlir::Channel >, KeyInfoT = llvm::DenseMapInfo<mlir::Channel >, Bucket = llvm::detail::DenseMapPair<mlir::Channel , llvm::SmallVector<mlir::Channel >>, IsConst = true]: Ptr != End && "dereferencing end() iterator" Turns out the channel order in the groups was indeterministic between runs, which caused inconsistent / unexpected group keys. I tracked this down to an UB issue in sorting the channels, where the `llvm::ilist_iterator` return by `allOps.begin() / end()` was only a bidirectional iterator, and not a random access iterator. Using `std::distance(a, b)` on non-random-access iterators [is UB if `a` comes after `b`](https://en.cppreference.com/w/cpp/iterator/distance.html). The shortest fix for this would have been to change the return condition to `std::distance(allOps.begin(), itrA) < std::distance(allOpsBegin(),itrB)`, but I found the entire implementation of that sort quite difficult to reason about due to nested lambdas and algorithms. Instead, I rewrote the sort so it's quite a bit shorter and with less nesting, and should also have lower complexity (`O(n log n + block_size) `vs` (O(n log n * block_size)` previously).	9 个月前
include	[BENCH] Various fixes on `bench_mlp.py` (#7926)	9 个月前
language	[FRONTEND] Fix PDL issue (#7379)	10 个月前
lib	[BACKEND] Add bar.sync before deallocating tmem (#7994) Without a barrier some warp may deallocate tmem while it is still in use causing some other block to override it.	9 个月前
tools	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前
unittest	[BUILD] Speedup unittests linking (#6796) Comparison of the sizes of build artifacts: Old: ```bash ls build/cmake.linux-x86_64-cpython-3.10/unittest/Dialect/TritonGPU/ -lh total 1.8G drwxr-sr-x 6 jovyan users 4.0K May 11 13:29 CMakeFiles -rw-r--r-- 1 jovyan users 3.3K May 12 21:16 CTestTestfile.cmake -rwxr-xr-x 1 jovyan users 440M May 12 21:18 Dialect -rw-r--r-- 1 jovyan users 387 May 12 21:16 'Dialect[1]_include.cmake' -rw-r--r-- 1 jovyan users 18K May 12 21:18 'Dialect[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 438M May 12 21:18 DumpLayoutTest -rw-r--r-- 1 jovyan users 415 May 12 21:16 'DumpLayoutTest[1]_include.cmake' -rw-r--r-- 1 jovyan users 2.3K May 12 21:18 'DumpLayoutTest[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 446M May 12 21:18 LinearLayoutConversions -rw-r--r-- 1 jovyan users 451 May 12 21:16 'LinearLayoutConversions[1]_include.cmake' -rw-r--r-- 1 jovyan users 60K May 12 21:18 'LinearLayoutConversions[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 439M May 12 21:18 TestSwizzling -rw-r--r-- 1 jovyan users 411 May 12 21:16 'TestSwizzling[1]_include.cmake' -rw-r--r-- 1 jovyan users 8.6K May 12 21:18 'TestSwizzling[1]_tests.cmake' -rw-r--r-- 1 jovyan users 1.7K May 12 21:16 cmake_install.cmake ``` New: ```bash ls build/cmake.linux-x86_64-cpython-3.10/unittest/Dialect/TritonGPU/ -lh total 853M drwxr-sr-x 6 jovyan users 4.0K May 11 13:29 CMakeFiles -rw-r--r-- 1 jovyan users 3.3K May 12 21:39 CTestTestfile.cmake -rwxr-xr-x 1 jovyan users 217M May 12 21:39 Dialect -rw-r--r-- 1 jovyan users 387 May 12 21:39 'Dialect[1]_include.cmake' -rw-r--r-- 1 jovyan users 18K May 12 21:39 'Dialect[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 210M May 12 21:39 DumpLayoutTest -rw-r--r-- 1 jovyan users 415 May 12 21:39 'DumpLayoutTest[1]_include.cmake' -rw-r--r-- 1 jovyan users 2.3K May 12 21:39 'DumpLayoutTest[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 217M May 12 21:39 LinearLayoutConversions -rw-r--r-- 1 jovyan users 451 May 12 21:39 'LinearLayoutConversions[1]_include.cmake' -rw-r--r-- 1 jovyan users 60K May 12 21:39 'LinearLayoutConversions[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 210M May 12 21:39 TestSwizzling -rw-r--r-- 1 jovyan users 411 May 12 21:39 'TestSwizzling[1]_include.cmake' -rw-r--r-- 1 jovyan users 8.6K May 12 21:39 'TestSwizzling[1]_tests.cmake' -rw-r--r-- 1 jovyan users 1.7K May 12 21:39 cmake_install.cmake ``` Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>	1 年前
CMakeLists.txt	[Backend][Hopper] Add a skeleton third-party warp specialization pass (#6624) Adding a skeleton third-party warp specialization pass. It currently doesn't do anything but serves as a placefolder for upcoming changes.	1 年前
triton_nvidia.cc	[BENCH] Various fixes on `bench_mlp.py` (#7926)	9 个月前