Fix histograms for complex replicated layouts (#7938)
The current histogram code assumes that replication across a warp is
done in a way that involves the first n threads having unique data. This
is not a valid assumption; in fact the function it calls to get this
layout, getThreadsPerWarp, describes one such layout and how it's
returned, so the histogram code actually discards that information. To
fix this, we actually remove the uniquing code that masks out threads
possessing duplicate data. Instead we have everyone participate and
adjust for the overcounting that results by computing the "replication
factor". This is much easier than computing the correct mask, which is
nontrivial in the general case.
Audit file-private C++ functions to ensure correct linkage (#6237)
### Summary
I’ve gone through the C++ sources and updated functions that were
file-local but not declared static or placed in anonymous namespaces.
These can lead to symbol visibility issues when linking multiple
translation units together. This patch:
* Marks helper functions in bin/triton-tensor-layout.cpp as static.
* Marks internal variables and helpers in
lib/Instrumentation/PrintLoadStoreMemSpaces.cpp as static.
* Adds static to a number of internal free functions and templates in
third_party/f2reduce/f2reduce.cpp.
* Corrects linkage for a handful of AMD GPU transforms and stream
utilities that were file-local.
* Makes file-local helpers in
third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp static.
* Ensures initProton in third_party/proton/csrc/Proton.cpp is
static.
* Tidies up similar file-local helpers in test code.
I initially made some functions static that were already declared in
headers; that’s been reverted so things compile cleanly.
All C++ unit tests and lit tests pass, and `pre-commit run --from-ref
origin/main --to-ref HEAD` shows no outstanding issues.
### New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
- [x] This PR does not need a test because this is a refactor that only
adjusts linkage of file-local functions.
- [x] I have not added any lit tests.
All tests (make test-nogpu) pass:
```
Testing Time: 1.07s
Total Discovered Tests: 131
Passed: 131 (100.00%)
100% tests passed, 0 tests failed out of 207
Total Test time (real) = 1.48 sec
```
[NVIDIA] Replace some NVGPU ops with equivalent NVVM ops (part 2) (#7471)
This change replaces some NVGPU ops with the corresponding NVVM ops. It
aligns with previous discussions in PR #7420.
For some op like NVGPU::FenceAsyncSharedOp, there is no corresponding
Intrinsic, and LLVM will also generate PTX. However, in the long run, I
think it is better to hand over the responsibility of generating code to
LLVM instead of hard coding PTX at the NVGPU layer.
The ConvertNVVMToLLVMPass has been added to the pipeline and build
system so that NVVM ops are correctly lowered to LLVM IR.