Jjeshdfix(triton):fix conflict in python/setu

d2fb96f4创建于 1月27日历史提交

文件	最后提交记录	最后更新时间
gluon	[Gluon][Tutorials] Add Tutorials (#7657) Add a comprehensive set of Gluon tutorials, covering basic gluon programming concepts to advanced optimized techniques. Matmul is a consistent example used to demonstrate various performance techniques. The tutorials cover Hopper and Blackwell	9 个月前
01-vector-add.py	Use `get_current_target` function to select the device to run tutorials on (#5286) This pull request contains changes for all tutorials except `09-persistent-matmul.py`, as there is a lot of cuda-specific function. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>	1 年前
02-fused-softmax.py	Update 02-fused-softmax.py (#7473) docs: Align softmax tutorial benchmark with description The documentation for the softmax tutorial states that it compares three implementations: the Triton kernel, torch.softmax, and a naive_softmax implementation. However, the accompanying code example only benchmarks the Triton and Torch versions, omitting the naive implementation. This discrepancy can cause confusion for users following the tutorial. This commit updates the benchmark code to include the `native_softmax` comparison, ensuring the code accurately reflects the tutorial's description. The changes include: - Adding 'native_softmax' to the `line_vals` list. - Adding "Native Softmax" to the `line_names` list. - Adding the corresponding logic branch to the `benchmark` function. Here is my code result: <img width="571" height="432" alt="image" src="https://github.com/user-attachments/assets/bf6dc821-2ee4-4acc-8002-0cc2188d3497" />	10 个月前
03-matrix-multiplication.py	[TUTORIAL] Adjust rand number range for matmul tutorial (#7505) This PR fixes the "Unit Test" in `03-matrix-multiplication.py` on MI300x and MI350x GPUs. `torch.randn((512, 512), device=DEVICE, dtype=torch.float16)` can generate relatively large absolute number in the input which may lead to larger sums having a larger absolute roundoff error as the exponent grows. The unit test passes on AMD MI250 and Nvidia H100 as well.	10 个月前
04-low-memory-dropout.py	Use `get_current_target` function to select the device to run tutorials on (#5286) This pull request contains changes for all tutorials except `09-persistent-matmul.py`, as there is a lot of cuda-specific function. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>	1 年前
05-layer-norm.py	[Tutorial] Add barrier before atomic in layernorm backward (#6307) The layernorm backward pass failed on the HIP backend. It generates different `db` and `dw` outputs in different runs. Reason of the problem is the `atomic_xchg` failed to synchronize different threads in a workgroup before its leading thread releases the lock. Specifically, the layernorm backward kernel uses a lock to create a critical section to sync multiple workgroups, so multiple workgroups can add to the same buffer sequentially. But the `atomic_xchg` did not sync threads of a workgroup before its leading thread releases the lock, so there is a scenario that other threads have not finished adding to the buffer when the leading thread releases the lock and this causes a problem. Initial solution is: when the `value` input of the atomic op is a scalar, we call `__syncthreads` to sync all threads in a workgroup before the leading thread can perform the atomic operation. But according to the discussion in https://github.com/triton-lang/triton/pull/4504, atomic ops are not supposed to do such synchronization for threads of a workgroup, so we will add an explicit barrier in the kernel of the backward pass. (This is the same approach to fix the unit test `test_core.py::test_atomic_cas`).	1 年前
06-fused-attention.py	[Tutorial] Fix subtile flags for blackwell (#7679)	10 个月前
07-extern-functions.py	Use `get_current_target` function to select the device to run tutorials on (#5286) This pull request contains changes for all tutorials except `09-persistent-matmul.py`, as there is a lot of cuda-specific function. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>	1 年前
08-grouped-gemm.py	[DOC] Fix documentation pipeline (#6702) 1. Change the default build directory from `python/` to the root of triton. 2. Update on the CI runners. 3. Increase the relative tolerance of 08-group-gemm tutorial due to its failure on some GPUs using mmav2. Tested https://github.com/triton-lang/triton/actions/runs/14816446953	1 年前
09-persistent-matmul.py	[NVIDIA] Expose `cublas.gemm` (#7656) Useful for test performance of a GEMM implementation.	10 个月前
10-block-scaled-matmul.py	[KERNELS][TUTORIAL][BLACKWELL] Use optimized TMA layout for block scale factors (#7123) Improves the Triton MOE kernel performance for the mxfp4 block scaled workload with a better layout choice for TMA. This is up to a ~30% boost in performance for some shapes in the roofline sweep. Importantly, with this patch using TMA for scale factors is always faster than traditional loads so we can remove their use for SF. Tutorial 10 is updated to reflect this. * Use TMA for block scales even when performing W @ X * Use host TMA descriptors for X, W, and SF in persistent kernel when possible * Use 2x256 shape for scale factor TMAs * Also update tutorial 10 to use the faster 5d 2x256xu8 TMA for block scale factors.	11 个月前
11-programmatic-dependent-launch.py	[NVIDIA] Enable Programmatic Dependent Launch in Triton (#6394) Programmatic Dependent Launch (PDL) enables kernels within the same CUDA stream to overlap while programmatically resolving inter-kernel dependencies. This allows consecutive kernels to overlap their ramp-down and ramp-up periods, efficiently hiding prologue latencies. Inter-kernel dependencies are resolved using Grid Dependency Control (GDC), which ensures that a kernel waits before reading memory written by the preceding kernel. This feature is utilized in libraries including [CUTLASS](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/dependent_kernel_launch.md). Effectively utilizing PDL in Triton requires using `tl.extra.cuda.gdc_wait()` to wait for the prior kernel to finish writing its results. The most straightforward approach is to execute `tl.extra.cuda.gdc_wait()` before any `tl.load`, based on the conservative assumption that the prior kernel may be launched with PDL and can write to any memory location. When using PDL, `tl.extra.cuda.gdc_launch_dependents()` allows for the current kernel to trigger the next kernel to start. See the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) for more information. We utilize this feature in a simple non-persistent kernel with a conservative approach to inter-kernel dependencies on Blackwell in tutorial 11. This kernel achieves up to a 15% speedup: ![pdl_performance](https://github.com/user-attachments/assets/5d0a9a3b-38f6-4ae1-9a94-7ade22099a4f) More advanced patterns with PDL we can achieve up to 33% performance benefits on back-to-back layers in LLMs (see [_LLM Inference Performance and Optimization on NVIDIA GB200 NVL72_](https://www.nvidia.com/en-us/on-demand/session/gtc25-s72503/) at GTC 2025 for more details). --------- Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com> Co-authored-by: peterbell10 <peterbell10@live.co.uk>	1 年前
README.rst	[DOC] Fix syntax errors, typos, formatting; increase consistency (#1357) This PR; - Fixes syntax errors like `.type values: dict[str, Callable[[list[Any]], Any]] `to` :type values: dict[str, Callable[[list[Any]], Any]]`, - Fixes typos, - Fixes formatting like `k ++` to `k++`, - Increases consistency (e.g. by transforming the minority `cd dir/` to the majority `cd dir`).	3 年前

Tutorials

Below is a gallery of tutorials for writing various basic operations with Triton. It is recommended that you read through the tutorials in order, starting with the simplest one.

To install the dependencies for the tutorials:

.. code-block:: bash

cd triton
pip install -e './python[tutorials]'