[Proton][AMD] Fix peak TB/s and support gfx950 specs (#7175)
Using 2 * bus_width * memory_clock_rate * 1e3 / 8
as the formula cannot deduce the proper max TB/s
on AMD devices; the method is more involved on AMD.
For now we just hardcode the TB/s result to get correct
result and unblock supporting of gfx950.
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>
[PROTON] Actively initialize main thread context (#7884)
Initialize mainContextStack in the ShadowContextSource constructor and
mark the thread context as the main for the current session. This
guarantees a single context even when the actual main thread doesn’t
encounter any scoped regions.
Previously, if the main thread never entered a scope, the first worker
thread to encounter a scope could “win” the race to create the main
context stack. That led to an incorrect context tree and unstable
ordering.
Before (incorrect):
threadA_0
- threadB_0
- threadB_1
threadA_0
threadA_1
After (correct):
threadB_0
threadB_1
threadA_0
threadA_1
[PROTON] Intra kernel profiling (#7258)
### Instrumentation & Runtime
- Introduce a dedicated instrumentation mode
- proton.start(..., mode="instrumentation", ...)
- Introduce both high- and low- level scope APIs
- For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope.
Profiling API for Triton DSL is disabled by default.
- For TTGIR: proton.record start and proton.record end
- Inject profiling buffers for each triton kernel at codegen time and
pass them to the proton runtime so kernels can push data directly from
the device to the host
### Proton Dialect & Lowering
- Add Proton → ProtonGPU → LLVM pipelines, including passes for
shared-memory allocation, profile scratch allocation, and a few
optimizations for reduced overhead or improved accuracy.
### Tracing
- proton.start(..., data="trace", ...) is supported for both fine- and
coarse-grained events.
---------
Co-authored-by: Yuanwei Fang <fywkevin@gmail.com>
Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>
Co-authored-by: peterbell10 <peterbell10@openai.com>
Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com>
Co-authored-by: Corbin Robeck <robeck@meta.com>
Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com>
Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com>
Co-authored-by: Shawn Zhong <github@shawnzhong.com>
Co-authored-by: Shawn Zhong <shawnzhong@fb.com>
Co-authored-by: 鐘天楽 <a844379248@icloud.com>