GGitHub[proton] Add missing #include. (#8008 )

35695e67创建于 2025年8月29日历史提交

文件	最后提交记录	最后更新时间
Dialect	[Backend][NFC] Switch some inline PTX to NVVM ops/intrinsics (#7725)	9 个月前
common	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前
csrc	[proton] Add missing #include. (#8008)	8 个月前
proton	[PROTON] Actively initialize main thread context (#7884) Initialize mainContextStack in the ShadowContextSource constructor and mark the thread context as the main for the current session. This guarantees a single context even when the actual main thread doesn’t encounter any scoped regions. Previously, if the main thread never entered a scope, the first worker thread to encounter a scope could “win” the race to create the main context stack. That led to an incorrect context tree and unstable ordering. Before (incorrect): threadA_0 - threadB_0 - threadB_1 threadA_0 threadA_1 After (correct): threadB_0 threadB_1 threadA_0 threadA_1	9 个月前
scripts	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前
test	[PROTON] Actively initialize main thread context (#7884) Initialize mainContextStack in the ShadowContextSource constructor and mark the thread context as the main for the current session. This guarantees a single context even when the actual main thread doesn’t encounter any scoped regions. Previously, if the main thread never entered a scope, the first worker thread to encounter a scope could “win” the race to create the main context stack. That led to an incorrect context tree and unstable ordering. Before (incorrect): threadA_0 - threadB_0 - threadB_1 threadA_0 threadA_1 After (correct): threadB_0 threadB_1 threadA_0 threadA_1	9 个月前
tutorials	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前
.gitignore	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前
CMakeLists.txt	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前
README.md	[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - `proton.start(..., mode="instrumentation", ...)` - Introduce both high- and low- level scope APIs - For Gluon DSL: `pl.scope`, `pl.enter_scope`, and `pl.exit_scope`. Profiling API for Triton DSL is disabled by default. - For TTGIR: `proton.record start` and `proton.record end` - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - `proton.start(..., data="trace", ...)` is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>	9 个月前

Proton - A Profiler for Triton

Introduction

Proton is a lightweight profiler for Triton, designed to be used for code written in Python and to invoke underlying GPU kernels. Proton provides insightful information about the program context, metadata, and hardware performance metrics of the GPU kernels invoked.

Installation

The following command installs the latest version of Proton.

git clone https://github.com/triton-lang/triton
cd triton/python
pip install .

To not build Proton, you can set the TRITON_BUILD_PROTON environment variable to OFF:

TRITON_BUILD_PROTON=OFF pip install .

Usage

Basic usage

More examples can be found in the tutorials directory.

Proton can be used to profile functions and regions in Python code.

The following examples demonstrate how to use Proton to profile a simple Python function.

import triton.profiler as proton

# name: The path to the profile data
# context: The method used to annotate the context of each GPU kernel. Currently, "shadow" and "python" are supported.
session_id = proton.profile(func, name="profile_name", context="python")(args)

The following examples demonstrate how to use Proton to profile a region in Python code.

session_id = proton.start(name="profile_name", context="python")
...
# Skip a region
proton.deactivate(session_id)
...
# Restart profiling
proton.activate(session_id)
...
# Write out the profile data and finalize the profiler
proton.finalize()

Scope

Unlike the python context that provide users with files, functions, and lines where the GPU kernels are invoked, the shadow context provides users with the annotated regions in the code. The following example demonstrates how to use the shadow context.

import triton.profiler as proton


session_id = proton.start(name="profile_name", context="shadow")

with proton.scope("test0"):
    with proton.scope("test1"):
        foo[1,](x, y)
with proton.scope("test2"):
    foo[1,](x, y)

...
proton.finalize()

The scope utility also accepts flexible metrics, provided with a dictionary that maps from a string (metric name) to a value (int or float). Proton will aggregate the metrics for each scope and write them to the profile data. It is useful for users to understand the performance of the model at a high level.

with proton.scope("test0", {"bytes": 1000}):
    with proton.scope("test1", {"bytes": 2000}):
        foo[1,](x, y)
with proton.scope("test2", {"bytes": 3000}):
    foo[1,](x, y)

Backend and mode

Proton supports three profiling backends: cupti, roctracer, and instrumentation.

cupti: Used for NVIDIA GPUs. It supports both the default profiling mode and pcsampling (instruction sampling).
roctracer: Used for AMD GPUs. It supports only the default profiling mode.
instrumentation: Available on both NVIDIA and AMD GPUs, this backend enables collection of custom metrics and advanced instrumentation.

By default, Proton automatically selects either cupti or roctracer as the backend based on your GPU driver. The instrumentation backend offers a wide range of mode options for fine-grained profiling, as detailed in the mode.py file.

Instruction Sampling

Proton supports instruction sampling on NVIDIA GPUs. You may experience ~20x end-to-end overhead when using instruction sampling, although the overhead for each individual GPU kernel is negligible. The overhead is mostly caused by data transfer and processing on the CPU. Additionally, the proton-viewer options -i <regex> -d <depth> -t <threshold> can be helpful for filtering out GPU kernels that are not of interest. The following example demonstrates how to use instruction sampling:

import triton.profiler as proton

proton.start(name="profile_name", context="shadow", backend="cupti_pcsampling")

Instrumentation

The instrumentation backend allows for detailed, fine-grained profiling of intra-kernel behavior, generating trace or tree views similar to those produced by coarse-grained profiling. By default, if no mode is specified, Proton profiles kernel cycles, which may require shared memory. If there is insufficient shared memory, profiling will abort and a warning will be displayed. Future releases will introduce additional instrumentation modes.

Host-side usage:

import triton.profiler as proton

proton.start(
    name="profile_name",
    backend="instrumentation",
    mode="<mode0>=<option0>:<mode1>=<option1>:..."
)

Kernel-side usage:

Caution: For DSL level instrumentation, only Gluon semantic is enabled by default. Instrumenting kernels written in Triton DSL is disable because Triton's higher-level IR undergoes aggressive compiler rewrites (loop pipelining, instruction re-ordering, IR duplication, etc.). These transformations can invalidate naïve instrumentation and lead to misleading results.

from triton.experimental import gluon
from triton.experimental.gluon import language as gl

import triton.profiler.language as pl

@gluon.jit
def kernel(...):
    pl.enter_scope("scope0")
    for i in range(iters):
        gl.load(...)
    pl.exit_scope("scope0")
    with pl.scope("scope1"):
        for i in range(iters):
            gl.load(...)

Advanced users can instrument either the ttir or ttgir intermediate representations for even finer-grained measurement. The relevant IR instructions are proton.record start and proton.record end. This can be combined with the environment variable TRITON_KERNEL_OVERRIDE=1 for custom kernel overrides. For detailed steps, refer to the Triton documentation under the Kernel Override Steps section. We have also assembled a tutorial that demonstrates how to use the IR-based instrumentation.

Merging profiles for postmortem analysis

We could use concurrent sessions to profile the same code region using different backends, and then merge the profiles using hatchet for postmortem analysis. In the following example, the cupti backend obtains different metrics than the instrumentation backend, and thus it makes sense to merge them using GraphFrame.add directly. Otherwise, if there are duplicate metrics, we could customize the merge logic or manipulate the dataframes.


import triton.profiler as proton

proton.start(name="profile_name0", context="shadow", backend="cupti")
proton.start(name="profile_name1", context="shadow", backend="instrumentation")

...

proton.finalize()

Hook

import triton.profiler as proton
from typing import NamedTuple

# hook: When hook="triton", it enables proton to invoke launch_metadata function before launching the GPU kernel
proton.start("profile_name", hook="triton")

def metadata_fn(
    grid: tuple,
    metadata: NamedTuple,
    args: dict
):
    return {"name": "<kernel_name>", "flops8": 1.0}

@triton.jit(launch_metadata=metadata_fn)
def foo(x, y):
    tl.store(y, tl.load(x))

The metadata_fn function is called before launching the GPU kernel to provide metadata for the GPU kernel, which returns a dictionary that maps from a string (metadata name) to a value (int or float).

Currently, only the launch hook is supported. In the dictionary returned by the metadata_fn function, we can supply the following keys:

name: str  # The name of the kernel
flops8: float  # The number of 8-bit floating-point operations
flops16: float  # The number of 16-bit floating-point operations
flops32: float  # The number of 32-bit floating-point operations
flops64: float  # The number of 64-bit floating-point operations
bytes: int  # The number of bytes expected to be transferred

Command line

Proton can be used as a command-line tool to profile Python scripts and Pytest tests. The following examples demonstrate how to use Proton command-line.

proton [options] script.py [script_args] [script_options]
proton [options] pytest [pytest_args] [script_options]
python -m triton.profiler.proton [options] script.py [script_args] [script_options]
proton --instrument=[instrumentation pass] script.py

When profiling in the command line mode, the proton.start and proton.finalize functions are automatically called before and after the script execution. Any proton.start and proton.finalize functions in the script are ignored. Also, in the command line mode, only a single session is supported. Therefore, proton.deactivate(session_id=1) is invalid, while proton.deactivate(session_id=0) is valid.

Visualizing the profile data

By default, proton profiles are in the json format and can be read by Hatchet. The following command visualizes the profile data on terminal.

pip install llnl-hatchet
proton-viewer -m time/s <profile.hatchet>

NOTE: pip install hatchet does not work because the API is slightly different.

If you want to dump the entire trace but not just the aggregated data, you should set the data option to trace when starting the profiler.

import triton.profiler as proton

proton.start(name="profile_name", data="trace")

The dumped trace will be in the chrome trace format and can be visualized using the chrome://tracing tool in Chrome or the perfetto tool.

Visualizing sorted profile data

In addition visualizing the profile data on terminal through Hatchet. A sorted list of the kernels by the first metric can be done using the --print-sorted flag with proton-viewer

proton-viewer -m time/ns,time/% <profile.hatchet> --print-sorted

prints the sorted kernels by the time/ns since it is the first listed.

More options can be found by running the following command.

proton-viewer -h

Advanced features and knowledge

Thread management

We guarantee that any call to libproton.so, such as enter_scope, is synchronized using explicit locks. For operations that do not trigger calls to libproton.so—including callbacks to CUDA/HIP APIs—we use separated locks to protect data structures that may be accessed concurrently by multiple threads. For example, the enter_op method in OpInterface can be invoked by the main thread that involves triton operators, as well as by helper threads that invoke torch operators.

`cpu_timed_scope`

cpu_timed_scope is a utility that wraps scope to measure the CPU time of a scope along with other metrics. The following example demonstrates how to use cpu_timed_scope:

import triton.profiler as proton

with proton.cpu_timed_scope("test"):
    foo[1,](x, y)

The cpu_timed_scope output metric is referred to as cpu_time, while time represents accelerator (e.g., GPU) time. The key distinction between cpu_time and time lies in their inclusivity: cpu_time is exclusive, whereas time is inclusive. This difference arises because the time spent on individual kernels represents the smallest measurable time granularity, and each kernel is mutually exclusive. This exclusivity allows time to be accurately accumulated across parent scopes for time. In contrast, cpu_time measures the time within a specific scope. Since a parent scope encompasses the time spent in its child scopes, summing cpu_time from child scope into parent scope would result in double counting. To visualize both the CPU and GPU time, we can use the following command:

proton-viewer -m time/ns,cpu_time/ns <proton.hatchet>

Metrics naming

Custom metrics should follow this format: metric_name (unit) (type). We prefer no space within the metric name. unit and type are optional fields.

There are three types of metrics in proton: inclusive, exclusive, and property metrics. By default, a metric is inclusive. The metric types are distinguished by the suffix of their names. The following table shows the suffix for each type and its meaning:

Suffix	Name	Meaning
(inc) or ""	Inclusive metric	The metric is accumulated at a scope and can be propagated to the parent scope.
(exc)	Exclusive metric	The metric is accumulated at a scope and cannot be propagated to the parent scope.
(pty)	Property metric	The metric is a property of the scope and cannot be accumulated or propagated.

State annotation

In addition to proton.scope, we can also customize the call path of each GPU operation using proton.state.

state is different from scope in several ways:

State is not recursive; each operation can have only a single state. Inner most state will overwrite the outer most state.
A states is a suffix, meaning that the original call path will append a state above the name of each kernel.
State is compatible with both Python and shadow contexts.

The following example demonstrates a basic use of state:

with proton.scope("test"):
    with proton.state("state0"):
        with proton.scope("test0"):
            foo0[1,](x, y)
        with proton.scope("test1"):
            foo1[1,](x, y)

The call path of foo1 will be test->test1->state0.

Proton vs nsys

Runtime overhead (up to 1.5x)

Proton has a lower profiling overhead than nsys. Even for workload with a large number of small GPU kernels, proton triggers less than ~1.5x overhead.

For GPU-bound workload, both proton and nsys has similar overhead, with little impact on the workload.

The lower overhead of proton is due to its less profiling metrics and callbacks compared to nsys.

Profile size (significantly smaller than nsys)

nsys traces and records every GPU kernel, while proton aggregates the metrics of GPU kernels under the same calling context.

As a result, proton's profile size can be up to thousands of times smaller than nsys's profile size, depending on the running time.

Portability (support different GPUs)

Proton is designed to be portable and can be used on AMD GPUs. nsys only supports NVIDIA GPUs.

Insights (more insightful than nsys on triton kernels)

Proton can register hooks to analyze the metadata of triton kernels, while nsys cannot. Note that the hooks do add additional overhead to proton.

Proton vs ncu

Similar to the comparison between Proton and Nsight Systems (Nsys), Proton has a lower profiling overhead than Nsight Compute (NCU). We also plan to support instruction sampling on AMD GPUs. However, Nsight Compute supports the collection of more detailed metrics than Proton, such as memory access patterns, memory transactions, and other instruction-level metrics. In contrast, Proton only supports instruction sampling and is designed to be lightweight and portable.

Known issues

CUDA graph

hooks cannot be used to accurately accumulate the number of FLOPs in CUDA graph mode profiling because kernels are captured and launched separately; metrics are not accumulated when kernels are launched in graph mode. This issue can be circumvented by using scope to supply FLOPs.

If profiling is initiated after CUDA graph capturing, there may be minor memory leak issues. This is because the number of kernels in a graph instance (i.e., cuGraphExec) is unknown, preventing the deletion of mappings between the kernel ID and the graph ID.

Instruction sampling

If you encounter permission related problems when using instruction sampling, you can lookup this page for help.

The overhead of instruction sampling on NVIDIA GPUs is about 20x using Proton because we haven't enabled continuous sampling yet. Continuous sampling can allow for more runtime optimizations, but it makes it more challenging to attribute performance data back to the GPU kernels because: (1) it enables profiling of concurrent kernels, (2) it doesn't allow profiling of time and instruction samples simultaneously, and (3) it works best if we have a separate thread dedicated to attributing instruction samples to the GPU kernels

Visible devices on AMD GPUs

Environment variables such as HIP_VISIBLE_DEVICES, and CUDA_VISIBLE_DEVICES are not supported on AMD GPUs. Once it's set, we cannot find a valid mapping between the device ID returned by RocTracer and the physical device ID. Instead, ROCR_VISIBLE_DEVICES is recommended to be used.