| [LAYOUTS] Implement generalized swizzling for convert_layout (#7565)
We generalize the swizzling algorithm to consider the instructions
ldmatrix/stmatrix and their transpose versions.
To do this, we now require having a dedicated allocator for nvidia, as
the required shmem for a convert_layout will now depend on the
instructions we can emit.
After cleaning up the stmatrix path from the common convert_layout
lowering, it became clear that we always take the swizzling path. I
changed
the allocator to reflect this, and I had to change a ton of tests that
used it and
now don't require padding.
We also implement an improved lowering for the indexing of
ldmatrix/stmatrix
following the optimisations from ld.shared/st.shared. | 9 个月前 |