文件最后提交记录最后更新时间
Linear layouts (#3794) Today we have many different layout objects, representing e.g. MMAv2 operands in registers, MMAv2 results in registers (different thing!), AMD tensor core operands in registers, shared memory swizzled Just Right for Hopper MMAv3, and so on. In CUTLASS v2, they used to have the same problem. In v3, they introduced the notion of a CuTe layout, which unifies all of these special cases into one programmatic thing. I want to do the same thing for Triton, because 1. we have a bunch of [known bugs](https://github.com/openai/triton/blob/0b46687895f0bc7c4d5216150d8d5cfeb5b4e254/python/test/unit/language/test_core.py#L4771) around layout conversions that have been very hard to fix, 2. there are certain operations (like some reshape + transpose + reshape combinations) that cannot be represented efficiently with today's layouts, and 3. the code for handling layouts is already very complex, and I'm concerned that Blackwell is going to make the problem worse. One approach I considered is using CuTe inside Triton. But I concluded it's not a great fit for various reasons. As an alternative, @apgoucher proposed this idea of "linear layouts" that seems to work really well, and is a lot simpler. This PR is currently a first pass of linear layouts. It Appears To Work (tm). The way this PR uses linear layouts is that before we generate the indices for a Triton BlockedLayout, we convert it to a linear layout and use that to generate indices instead. The implementation plan is to do the same thing for the other Triton layouts (i.e. make codegen only use linear layouts). Once that is working, we can start using linear layouts in the Triton middle-end. Eventually the goal is to replace all layouts with just this one. There are a few questions still outstanding which need to be resolved before we can land this. 1. Are linear layouts actually flexible enough to represent all the layouts we care about? 2. What will the textual IR look like for linear layouts? Can we make it as easy to read as the current IR?2 年前
Linear layouts (#3794) Today we have many different layout objects, representing e.g. MMAv2 operands in registers, MMAv2 results in registers (different thing!), AMD tensor core operands in registers, shared memory swizzled Just Right for Hopper MMAv3, and so on. In CUTLASS v2, they used to have the same problem. In v3, they introduced the notion of a CuTe layout, which unifies all of these special cases into one programmatic thing. I want to do the same thing for Triton, because 1. we have a bunch of [known bugs](https://github.com/openai/triton/blob/0b46687895f0bc7c4d5216150d8d5cfeb5b4e254/python/test/unit/language/test_core.py#L4771) around layout conversions that have been very hard to fix, 2. there are certain operations (like some reshape + transpose + reshape combinations) that cannot be represented efficiently with today's layouts, and 3. the code for handling layouts is already very complex, and I'm concerned that Blackwell is going to make the problem worse. One approach I considered is using CuTe inside Triton. But I concluded it's not a great fit for various reasons. As an alternative, @apgoucher proposed this idea of "linear layouts" that seems to work really well, and is a lot simpler. This PR is currently a first pass of linear layouts. It Appears To Work (tm). The way this PR uses linear layouts is that before we generate the indices for a Triton BlockedLayout, we convert it to a linear layout and use that to generate indices instead. The implementation plan is to do the same thing for the other Triton layouts (i.e. make codegen only use linear layouts). Once that is working, we can start using linear layouts in the Triton middle-end. Eventually the goal is to replace all layouts with just this one. There are a few questions still outstanding which need to be resolved before we can land this. 1. Are linear layouts actually flexible enough to represent all the layouts we care about? 2. What will the textual IR look like for linear layouts? Can we make it as easy to read as the current IR?2 年前
fix doctools probleam 1 个月前
Linear layouts (#3794) Today we have many different layout objects, representing e.g. MMAv2 operands in registers, MMAv2 results in registers (different thing!), AMD tensor core operands in registers, shared memory swizzled Just Right for Hopper MMAv3, and so on. In CUTLASS v2, they used to have the same problem. In v3, they introduced the notion of a CuTe layout, which unifies all of these special cases into one programmatic thing. I want to do the same thing for Triton, because 1. we have a bunch of [known bugs](https://github.com/openai/triton/blob/0b46687895f0bc7c4d5216150d8d5cfeb5b4e254/python/test/unit/language/test_core.py#L4771) around layout conversions that have been very hard to fix, 2. there are certain operations (like some reshape + transpose + reshape combinations) that cannot be represented efficiently with today's layouts, and 3. the code for handling layouts is already very complex, and I'm concerned that Blackwell is going to make the problem worse. One approach I considered is using CuTe inside Triton. But I concluded it's not a great fit for various reasons. As an alternative, @apgoucher proposed this idea of "linear layouts" that seems to work really well, and is a lot simpler. This PR is currently a first pass of linear layouts. It Appears To Work (tm). The way this PR uses linear layouts is that before we generate the indices for a Triton BlockedLayout, we convert it to a linear layout and use that to generate indices instead. The implementation plan is to do the same thing for the other Triton layouts (i.e. make codegen only use linear layouts). Once that is working, we can start using linear layouts in the Triton middle-end. Eventually the goal is to replace all layouts with just this one. There are a few questions still outstanding which need to be resolved before we can land this. 1. Are linear layouts actually flexible enough to represent all the layouts we care about? 2. What will the textual IR look like for linear layouts? Can we make it as easy to read as the current IR?2 年前
Audit file-private C++ functions to ensure correct linkage (#6237) ### Summary I’ve gone through the C++ sources and updated functions that were file-local but not declared static or placed in anonymous namespaces. These can lead to symbol visibility issues when linking multiple translation units together. This patch: * Marks helper functions in bin/triton-tensor-layout.cpp as static. * Marks internal variables and helpers in lib/Instrumentation/PrintLoadStoreMemSpaces.cpp as static. * Adds static to a number of internal free functions and templates in third_party/f2reduce/f2reduce.cpp. * Corrects linkage for a handful of AMD GPU transforms and stream utilities that were file-local. * Makes file-local helpers in third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp static. * Ensures initProton in third_party/proton/csrc/Proton.cpp is static. * Tidies up similar file-local helpers in test code. I initially made some functions static that were already declared in headers; that’s been reverted so things compile cleanly. All C++ unit tests and lit tests pass, and `pre-commit run --from-ref origin/main --to-ref HEAD` shows no outstanding issues. ### New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run pre-commit run --from-ref origin/main --to-ref HEAD. - [x] This PR does not need a test because this is a refactor that only adjusts linkage of file-local functions. - [x] I have not added any lit tests. All tests (make test-nogpu) pass: ``` Testing Time: 1.07s Total Discovered Tests: 131 Passed: 131 (100.00%) 100% tests passed, 0 tests failed out of 207 Total Test time (real) = 1.48 sec ```1 年前
Linear layouts (#3794) Today we have many different layout objects, representing e.g. MMAv2 operands in registers, MMAv2 results in registers (different thing!), AMD tensor core operands in registers, shared memory swizzled Just Right for Hopper MMAv3, and so on. In CUTLASS v2, they used to have the same problem. In v3, they introduced the notion of a CuTe layout, which unifies all of these special cases into one programmatic thing. I want to do the same thing for Triton, because 1. we have a bunch of [known bugs](https://github.com/openai/triton/blob/0b46687895f0bc7c4d5216150d8d5cfeb5b4e254/python/test/unit/language/test_core.py#L4771) around layout conversions that have been very hard to fix, 2. there are certain operations (like some reshape + transpose + reshape combinations) that cannot be represented efficiently with today's layouts, and 3. the code for handling layouts is already very complex, and I'm concerned that Blackwell is going to make the problem worse. One approach I considered is using CuTe inside Triton. But I concluded it's not a great fit for various reasons. As an alternative, @apgoucher proposed this idea of "linear layouts" that seems to work really well, and is a lot simpler. This PR is currently a first pass of linear layouts. It Appears To Work (tm). The way this PR uses linear layouts is that before we generate the indices for a Triton BlockedLayout, we convert it to a linear layout and use that to generate indices instead. The implementation plan is to do the same thing for the other Triton layouts (i.e. make codegen only use linear layouts). Once that is working, we can start using linear layouts in the Triton middle-end. Eventually the goal is to replace all layouts with just this one. There are a few questions still outstanding which need to be resolved before we can land this. 1. Are linear layouts actually flexible enough to represent all the layouts we care about? 2. What will the textual IR look like for linear layouts? Can we make it as easy to read as the current IR?2 年前
README.md

f2reduce: a MIT-licenced library for Gaussian elimination over GF(2)

This is a very lightweight implementation for converting a binary matrix to row reduced echelon form. It incorporates the following optimisations:

  • Kronrod's algorithm ('method of four Russians');
  • Designed to properly autovectorise in both GCC and LLVM;
  • Attempts to ensure that memory loads/stores are cache-aligned;
  • Designed to achieve high instruction-level parallelism;
  • Able to use AVX512's vpternlogq instruction if present;
  • Minimal memory overhead (a few megabytes).

There are no architecture-specific intrinsics or assembly, so this should work well on any architecture where the compiler can autovectorise.

For simplicity, we do not use Strassen, so our performance is overtaken by M4RI whenever the matrices are large and have full column rank.

For all other cases, we have several advantages over M4RI:

  • Substantially better performance on small, wide, or low-rank matrices;
  • MIT-licenced rather than GPL-licenced;
  • No assumptions about the processor architecture;
  • No configuration required (-O3 -march=native is enough).

We expose a single function with the following signature:

void inplace_rref_strided(uint64_t *matrix, uint64_t rows, uint64_t cols, uint64_t stride);

The matrix should be in row-major format and is overwritten in-place. The stride parameter specifies the offset between adjacent rows in 64-bit words, not bytes. The mapping between matrix entries and memory is as follows:

the (j+64*k)th entry of the ith row is (matrix[i * stride + k] >> j) & 1

Since the performance can depend on the stride and how it interacts with processor caches, we expose another function to return a recommended stride:

uint64_t get_recommended_stride(uint64_t cols);

Although f2reduce is compiled in C++11, the resulting static library has C-linkage so can be called from any C/C++ code.

Dependencies

f2reduce has no dependencies; just compile f2reduce.cpp with the -O3 -march=native flags to produce a static library and include the header file f2reduce.h in your project.

The automated test suite has dependencies on M4RI (for benchmarking timings against M4RI and checking that implementations agree), GoogleTest (for unit testing), and cpads (for high-quality pseudo-random number generation). Downloading of the dependencies and building of the test suite is automated by CMake.

To build the test suite, you need to manually append add_subdirectory(test) to the end of the CMakeLists.txt file. This is so that f2reduce does not have any build dependencies by default.