[IMP][Launch Latency] reduce launch overhead for tensor descriptors (#7987)
Currently (numbers below from current main), the overhead of launching
kernels with TensorDescriptor inputs is much larger than the
corresponding kernel with Tensor inputs.
| CI-Node | tensor | descriptor |
|---------|--------|------------|
| A100 | 22.7us | 35.0us |
| H100 | 16.1us | 37.9us |
| B200 | 15.5us | 36.8us |
To some extent, this is expected as TensorDescriptor will eventually
be decomposed meaning the the underlying CUDA kernel is launched with
potentially more inputs. Although the difference should not be that
large.
An inspection of where this difference is coming from leads to
- larger overheads in argument specialization (#7771 will help here,
too)
- repeated creation of TMADescKernelParam objects, incl. using
torch.empty(TMA_DESC_SIZE) to get 128B containers for a CUTensorMap
and pointer extraction for fill_tma_descriptor
The latter can be addressed by exposing CUtensorMap as custom Python
type. fill_tma_descriptor then simply can create a new object of the
right size and return this wrapper around CUtensorMap - later also
simplifying extracting CUtensorMap in the launch code. With these
changes, overhead for the benchmark with TensorDescriptor inputs
should go down by roughly 12us.
---------
Co-authored-by: Peter Bell <peterbell10@openai.com>
[Hopper][WS] Fix UB in channel sorting (#7828)
In ~50% of test runs I was getting failures in
test/Hopper/WarpSpecialization/ws_code_partition.mlir's
_matmul_layernorm_persistent_one_producer_one_consumer_one_epilog
test, with the following error:
> assertion failed at
third_party/llvm/llvm-project/llvm/include/llvm/ADT/DenseMap.h:1290 in
pointer llvm::DenseMapIterator<mlir::Channel *,
llvm::SmallVector<mlir::Channel *>, llvm::DenseMapInfo<mlir::Channel *>,
llvm::detail::DenseMapPair<mlir::Channel *,
llvm::SmallVector<mlir::Channel *>>, true>::operator->() const [KeyT =
mlir::Channel *, ValueT = llvm::SmallVector<mlir::Channel *>, KeyInfoT =
llvm::DenseMapInfo<mlir::Channel *>, Bucket =
llvm::detail::DenseMapPair<mlir::Channel *,
llvm::SmallVector<mlir::Channel *>>, IsConst = true]: Ptr != End &&
"dereferencing end() iterator"
Turns out the channel order in the groups was indeterministic between
runs, which caused inconsistent / unexpected group keys. I tracked this
down to an UB issue in sorting the channels, where the
llvm::ilist_iterator return by allOps.begin() / end() was only a
bidirectional iterator, and not a random access iterator.
Using std::distance(a, b) on non-random-access iterators [is UB if a
comes after
b](https://en.cppreference.com/w/cpp/iterator/distance.html). The
shortest fix for this would have been to change the return condition to
`std::distance(allOps.begin(), itrA) <
std::distance(allOpsBegin(),itrB)`, but I found the entire
implementation of that sort quite difficult to reason about due to
nested lambdas and algorithms.
Instead, I rewrote the sort so it's quite a bit shorter and with less
nesting, and should also have lower complexity (`O(n log n +
block_size) vs (O(n log n * block_size)` previously).
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>