文件最后提交记录最后更新时间
[Gluon][Tutorials] Add Tutorials (#7657) Add a comprehensive set of Gluon tutorials, covering basic gluon programming concepts to advanced optimized techniques. Matmul is a consistent example used to demonstrate various performance techniques. The tutorials cover Hopper and Blackwell9 个月前
Use get_current_target function to select the device to run tutorials on (#5286) This pull request contains changes for all tutorials except 09-persistent-matmul.py, as there is a lot of cuda-specific function. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>1 年前
Update 02-fused-softmax.py (#7473) docs: Align softmax tutorial benchmark with description The documentation for the softmax tutorial states that it compares three implementations: the Triton kernel, torch.softmax, and a naive_softmax implementation. However, the accompanying code example only benchmarks the Triton and Torch versions, omitting the naive implementation. This discrepancy can cause confusion for users following the tutorial. This commit updates the benchmark code to include the native_softmax comparison, ensuring the code accurately reflects the tutorial's description. The changes include: - Adding 'native_softmax' to the line_vals list. - Adding "Native Softmax" to the line_names list. - Adding the corresponding logic branch to the benchmark function. Here is my code result: <img width="571" height="432" alt="image" src="https://github.com/user-attachments/assets/bf6dc821-2ee4-4acc-8002-0cc2188d3497" />10 个月前
[TUTORIAL] Adjust rand number range for matmul tutorial (#7505) This PR fixes the "Unit Test" in 03-matrix-multiplication.py on MI300x and MI350x GPUs. `torch.randn((512, 512), device=DEVICE, dtype=torch.float16)` can generate relatively large absolute number in the input which may lead to larger sums having a larger absolute roundoff error as the exponent grows. The unit test passes on AMD MI250 and Nvidia H100 as well.10 个月前
Use get_current_target function to select the device to run tutorials on (#5286) This pull request contains changes for all tutorials except 09-persistent-matmul.py, as there is a lot of cuda-specific function. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>1 年前
[Tutorial] Add barrier before atomic in layernorm backward (#6307) The layernorm backward pass failed on the HIP backend. It generates different db and dw outputs in different runs. Reason of the problem is the atomic_xchg failed to synchronize different threads in a workgroup before its leading thread releases the lock. Specifically, the layernorm backward kernel uses a lock to create a critical section to sync multiple workgroups, so multiple workgroups can add to the same buffer sequentially. But the atomic_xchg did not sync threads of a workgroup before its leading thread releases the lock, so there is a scenario that other threads have not finished adding to the buffer when the leading thread releases the lock and this causes a problem. Initial solution is: when the value input of the atomic op is a scalar, we call __syncthreads to sync all threads in a workgroup before the leading thread can perform the atomic operation. But according to the discussion in https://github.com/triton-lang/triton/pull/4504, atomic ops are not supposed to do such synchronization for threads of a workgroup, so we will add an explicit barrier in the kernel of the backward pass. (This is the same approach to fix the unit test test_core.py::test_atomic_cas).1 年前
[Tutorial] Fix subtile flags for blackwell (#7679) 10 个月前
Use get_current_target function to select the device to run tutorials on (#5286) This pull request contains changes for all tutorials except 09-persistent-matmul.py, as there is a lot of cuda-specific function. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>1 年前
[DOC] Fix documentation pipeline (#6702) 1. Change the default build directory from python/ to the root of triton. 2. Update on the CI runners. 3. Increase the relative tolerance of 08-group-gemm tutorial due to its failure on some GPUs using mmav2. Tested https://github.com/triton-lang/triton/actions/runs/148164469531 年前
[NVIDIA] Expose cublas.gemm (#7656) Useful for test performance of a GEMM implementation.10 个月前
[KERNELS][TUTORIAL][BLACKWELL] Use optimized TMA layout for block scale factors (#7123) Improves the Triton MOE kernel performance for the mxfp4 block scaled workload with a better layout choice for TMA. This is up to a ~30% boost in performance for some shapes in the roofline sweep. Importantly, with this patch using TMA for scale factors is always faster than traditional loads so we can remove their use for SF. Tutorial 10 is updated to reflect this. * Use TMA for block scales even when performing W @ X * Use host TMA descriptors for X, W, and SF in persistent kernel when possible * Use 2x256 shape for scale factor TMAs * Also update tutorial 10 to use the faster 5d 2x256xu8 TMA for block scale factors.11 个月前
[NVIDIA] Enable Programmatic Dependent Launch in Triton (#6394) Programmatic Dependent Launch (PDL) enables kernels within the same CUDA stream to overlap while programmatically resolving inter-kernel dependencies. This allows consecutive kernels to overlap their ramp-down and ramp-up periods, efficiently hiding prologue latencies. Inter-kernel dependencies are resolved using Grid Dependency Control (GDC), which ensures that a kernel waits before reading memory written by the preceding kernel. This feature is utilized in libraries including [CUTLASS](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/dependent_kernel_launch.md). Effectively utilizing PDL in Triton requires using tl.extra.cuda.gdc_wait() to wait for the prior kernel to finish writing its results. The most straightforward approach is to execute tl.extra.cuda.gdc_wait() before any tl.load, based on the conservative assumption that the prior kernel may be launched with PDL and can write to any memory location. When using PDL, tl.extra.cuda.gdc_launch_dependents() allows for the current kernel to trigger the next kernel to start. See the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) for more information. We utilize this feature in a simple non-persistent kernel with a conservative approach to inter-kernel dependencies on Blackwell in tutorial 11. This kernel achieves up to a 15% speedup: ![pdl_performance](https://github.com/user-attachments/assets/5d0a9a3b-38f6-4ae1-9a94-7ade22099a4f) More advanced patterns with PDL we can achieve up to 33% performance benefits on back-to-back layers in LLMs (see [_LLM Inference Performance and Optimization on NVIDIA GB200 NVL72_](https://www.nvidia.com/en-us/on-demand/session/gtc25-s72503/) at GTC 2025 for more details). --------- Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com> Co-authored-by: peterbell10 <peterbell10@live.co.uk>1 年前
[DOC] Fix syntax errors, typos, formatting; increase consistency (#1357) This PR; - Fixes syntax errors like `.type values: dict[str, Callable[[list[Any]], Any]] to :type values: dict[str, Callable[[list[Any]], Any]]`, - Fixes typos, - Fixes formatting like k ++ to k++, - Increases consistency (e.g. by transforming the minority cd dir/ to the majority cd dir).3 年前
README.rst

Tutorials

Below is a gallery of tutorials for writing various basic operations with Triton. It is recommended that you read through the tutorials in order, starting with the simplest one.

To install the dependencies for the tutorials:

.. code-block:: bash

cd triton
pip install -e './python[tutorials]'