文件最后提交记录最后更新时间
[IMP][Launch Latency] reduce launch overhead for tensor descriptors (#7987) Currently (numbers below from current main), the overhead of launching kernels with TensorDescriptor inputs is much larger than the corresponding kernel with Tensor inputs. | CI-Node | tensor | descriptor | |---------|--------|------------| | A100 | 22.7us | 35.0us | | H100 | 16.1us | 37.9us | | B200 | 15.5us | 36.8us | To some extent, this is expected as TensorDescriptor will eventually be decomposed meaning the the underlying CUDA kernel is launched with potentially more inputs. Although the difference should not be that large. An inspection of where this difference is coming from leads to - larger overheads in argument specialization (#7771 will help here, too) - repeated creation of TMADescKernelParam objects, incl. using torch.empty(TMA_DESC_SIZE) to get 128B containers for a CUTensorMap and pointer extraction for fill_tma_descriptor The latter can be addressed by exposing CUtensorMap as custom Python type. fill_tma_descriptor then simply can create a new object of the right size and return this wrapper around CUtensorMap - later also simplifying extracting CUtensorMap in the launch code. With these changes, overhead for the benchmark with TensorDescriptor inputs should go down by roughly 12us. --------- Co-authored-by: Peter Bell <peterbell10@openai.com>9 个月前
[Hopper][WS] Fix UB in channel sorting (#7828) In ~50% of test runs I was getting failures in test/Hopper/WarpSpecialization/ws_code_partition.mlir's _matmul_layernorm_persistent_one_producer_one_consumer_one_epilog test, with the following error: > assertion failed at third_party/llvm/llvm-project/llvm/include/llvm/ADT/DenseMap.h:1290 in pointer llvm::DenseMapIterator<mlir::Channel *, llvm::SmallVector<mlir::Channel *>, llvm::DenseMapInfo<mlir::Channel *>, llvm::detail::DenseMapPair<mlir::Channel *, llvm::SmallVector<mlir::Channel *>>, true>::operator->() const [KeyT = mlir::Channel *, ValueT = llvm::SmallVector<mlir::Channel *>, KeyInfoT = llvm::DenseMapInfo<mlir::Channel *>, Bucket = llvm::detail::DenseMapPair<mlir::Channel *, llvm::SmallVector<mlir::Channel *>>, IsConst = true]: Ptr != End && "dereferencing end() iterator" Turns out the channel order in the groups was indeterministic between runs, which caused inconsistent / unexpected group keys. I tracked this down to an UB issue in sorting the channels, where the llvm::ilist_iterator return by allOps.begin() / end() was only a bidirectional iterator, and not a random access iterator. Using std::distance(a, b) on non-random-access iterators [is UB if a comes after b](https://en.cppreference.com/w/cpp/iterator/distance.html). The shortest fix for this would have been to change the return condition to `std::distance(allOps.begin(), itrA) < std::distance(allOpsBegin(),itrB)`, but I found the entire implementation of that sort quite difficult to reason about due to nested lambdas and algorithms. Instead, I rewrote the sort so it's quite a bit shorter and with less nesting, and should also have lower complexity (`O(n log n + block_size) vs (O(n log n * block_size)` previously).9 个月前
[BENCH] Various fixes on bench_mlp.py (#7926) 9 个月前
[FRONTEND] Fix PDL issue (#7379) 10 个月前
[BACKEND] Add bar.sync before deallocating tmem (#7994) Without a barrier some warp may deallocate tmem while it is still in use causing some other block to override it.9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[BUILD] Speedup unittests linking (#6796) Comparison of the sizes of build artifacts: Old: ```bash ls build/cmake.linux-x86_64-cpython-3.10/unittest/Dialect/TritonGPU/ -lh total 1.8G drwxr-sr-x 6 jovyan users 4.0K May 11 13:29 CMakeFiles -rw-r--r-- 1 jovyan users 3.3K May 12 21:16 CTestTestfile.cmake -rwxr-xr-x 1 jovyan users 440M May 12 21:18 Dialect -rw-r--r-- 1 jovyan users 387 May 12 21:16 'Dialect[1]_include.cmake' -rw-r--r-- 1 jovyan users 18K May 12 21:18 'Dialect[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 438M May 12 21:18 DumpLayoutTest -rw-r--r-- 1 jovyan users 415 May 12 21:16 'DumpLayoutTest[1]_include.cmake' -rw-r--r-- 1 jovyan users 2.3K May 12 21:18 'DumpLayoutTest[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 446M May 12 21:18 LinearLayoutConversions -rw-r--r-- 1 jovyan users 451 May 12 21:16 'LinearLayoutConversions[1]_include.cmake' -rw-r--r-- 1 jovyan users 60K May 12 21:18 'LinearLayoutConversions[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 439M May 12 21:18 TestSwizzling -rw-r--r-- 1 jovyan users 411 May 12 21:16 'TestSwizzling[1]_include.cmake' -rw-r--r-- 1 jovyan users 8.6K May 12 21:18 'TestSwizzling[1]_tests.cmake' -rw-r--r-- 1 jovyan users 1.7K May 12 21:16 cmake_install.cmake ``` New: ```bash ls build/cmake.linux-x86_64-cpython-3.10/unittest/Dialect/TritonGPU/ -lh total 853M drwxr-sr-x 6 jovyan users 4.0K May 11 13:29 CMakeFiles -rw-r--r-- 1 jovyan users 3.3K May 12 21:39 CTestTestfile.cmake -rwxr-xr-x 1 jovyan users 217M May 12 21:39 Dialect -rw-r--r-- 1 jovyan users 387 May 12 21:39 'Dialect[1]_include.cmake' -rw-r--r-- 1 jovyan users 18K May 12 21:39 'Dialect[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 210M May 12 21:39 DumpLayoutTest -rw-r--r-- 1 jovyan users 415 May 12 21:39 'DumpLayoutTest[1]_include.cmake' -rw-r--r-- 1 jovyan users 2.3K May 12 21:39 'DumpLayoutTest[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 217M May 12 21:39 LinearLayoutConversions -rw-r--r-- 1 jovyan users 451 May 12 21:39 'LinearLayoutConversions[1]_include.cmake' -rw-r--r-- 1 jovyan users 60K May 12 21:39 'LinearLayoutConversions[1]_tests.cmake' -rwxr-xr-x 1 jovyan users 210M May 12 21:39 TestSwizzling -rw-r--r-- 1 jovyan users 411 May 12 21:39 'TestSwizzling[1]_include.cmake' -rw-r--r-- 1 jovyan users 8.6K May 12 21:39 'TestSwizzling[1]_tests.cmake' -rw-r--r-- 1 jovyan users 1.7K May 12 21:39 cmake_install.cmake ``` Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>1 年前
[Backend][Hopper] Add a skeleton third-party warp specialization pass (#6624) Adding a skeleton third-party warp specialization pass. It currently doesn't do anything but serves as a placefolder for upcoming changes.1 年前
[BENCH] Various fixes on bench_mlp.py (#7926) 9 个月前