文件最后提交记录最后更新时间
[Proton][AMD] Fix peak TB/s and support gfx950 specs (#7175) Using 2 * bus_width * memory_clock_rate * 1e3 / 8 as the formula cannot deduce the proper max TB/s on AMD devices; the method is more involved on AMD. For now we just hardcode the TB/s result to get correct result and unblock supporting of gfx950.11 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Use context vars for hook name and id (#7858) 9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Use context vars for hook name and id (#7858) 9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前
[PROTON] Actively initialize main thread context (#7884) Initialize mainContextStack in the ShadowContextSource constructor and mark the thread context as the main for the current session. This guarantees a single context even when the actual main thread doesn’t encounter any scoped regions. Previously, if the main thread never entered a scope, the first worker thread to encounter a scope could “win” the race to create the main context stack. That led to an incorrect context tree and unstable ordering. Before (incorrect): threadA_0 - threadB_0 - threadB_1 threadA_0 threadA_1 After (correct): threadB_0 threadB_1 threadA_0 threadA_19 个月前
[PROTON] Intra kernel profiling (#7258) ### Instrumentation & Runtime - Introduce a dedicated instrumentation mode - proton.start(..., mode="instrumentation", ...) - Introduce both high- and low- level scope APIs - For Gluon DSL: pl.scope, pl.enter_scope, and pl.exit_scope. Profiling API for Triton DSL is disabled by default. - For TTGIR: proton.record start and proton.record end - Inject profiling buffers for each triton kernel at codegen time and pass them to the proton runtime so kernels can push data directly from the device to the host ### Proton Dialect & Lowering - Add Proton → ProtonGPU → LLVM pipelines, including passes for shared-memory allocation, profile scratch allocation, and a few optimizations for reduced overhead or improved accuracy. ### Tracing - proton.start(..., data="trace", ...) is supported for both fine- and coarse-grained events. --------- Co-authored-by: Yuanwei Fang <fywkevin@gmail.com> Co-authored-by: Yuanwei Fang <fywkevin@fb.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com> Co-authored-by: peterbell10 <peterbell10@openai.com> Co-authored-by: Corbin Robeck <corbin.robeck@gmail.com> Co-authored-by: Corbin Robeck <robeck@meta.com> Co-authored-by: robeck <robeck@devgpu284.prn2.facebook.com> Co-authored-by: Srivatsan Ramesh <srivatsan-ramesh@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Shawn Zhong <shawnzhong@fb.com> Co-authored-by: 鐘天楽 <a844379248@icloud.com>9 个月前