文件最后提交记录最后更新时间
[AMD] Rewrite extract_slice op implementation (#7128) This PR refactors the extract_slice operation to support two major improvements: 1) Relaxed Layout Constraints The operation now allows more flexible source and destination layouts, aligning better with linear layouts. 2) Support for Arbitrary Tensor Ranks extract_slice is no longer limited to 2D tensors and can now handle tensors of any rank. The "extract_slice" operation enables extracting a slice of a tensor in registers. It supports the following arguments: * source: the base tensor on which to create a view tensor * offsets: offsets into the base tensor at which to create the view In distributed layouts, tensors are divided into CTA tiles. A CTA tile represents the smallest contiguous portion of a tensor that is distributed across all threads and warps within a workgroup. The ExtractSlice operation extracts a portion of the tensor that aligns with CTA tile boundaries. This op is designed to work on logical tensors directly, avoiding the need for complex layout reinterpretation or reshaping. For example, the tt.split operation only supports splitting along the innermost dimension, and requires that the resulting innermost dimension provide 2 elements per thread, distributed across registers. In contrast, extract_slice op imposes no constraints on the extraction dimension or the size of dimensions. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com>11 个月前
[BACKEND] Localize the use and definition of getShapePerCTATile in the AMD backend and aim for elimination (#7740) 9 个月前
[BACKEND] Localize the use and definition of getShapePerCTATile in the AMD backend and aim for elimination (#7740) 9 个月前
[AMD] NFC: simplify pass/pattern constructor declaration (#7665) 10 个月前
[AMD] Add a Concat op to AMDGPU dialect (#6590) The "concat" operation combines a list of source n-dimensional tensors into a single larger destination tensor. All source tensors must have the same shape, element type, and encoding. The concatenation dimension is inferred from the source and destination shapes provided by the user. For example, two tensors of shape 64x128 can produce a destination shape of 128x128, indicating concatenation along dimension 0; or 64x256, indicating concatenation along dimension 1. Generally, source tensors passed as op arguments can be arranged into the resulting shape in multiple ways. For example, given four tensors of shape 64x64: concat s0<64x64>, s1<64x64>, s2<64x64>, s3<64x64> -> <128x128> They can be laid out in different configurations within the result tensor: 1) s0 s1 s2 s3 2) s0 s2 s1 s3 From a logical tensor perspective, the source tensors are treated as elements of a tensor of tensors. In other words, the 1-D array of input tensors is conceptually reshaped into an n-D grid. The semantics of this op assume a row-major order (or its n-D generalization), meaning the fastest-varying dimension is filled first, and the slowest-varying dimension is filled last. In the example above, this corresponds to layout 1). The source and destination tensors must have identical linear layouts at the CTA tile level. That is, all base vectors for input dimensions must match, except for the register input dimension. The register basis must align on the subset that defines the logical tensor shape of a single CTA tile. This ensures that the concatenation is a no-op, meaning no data rearrangement among threads is required to assemble the destination tensor with the given shape and layout. However, the order of CTA tiles within the layout does not need to match between source and destination layouts. It is the responsibility of the op's lowering logic to handle this correctly. This op is designed to work on logical tensors directly, avoiding the need for complex layout reinterpretation or reshaping. For example, the tt.join operation only supports concatenation along the innermost dimension, and requires that the resulting innermost dimension provide 2 elements per thread, distributed across registers. In contrast, this concat op imposes no constraints on the concatenation dimension or the size of dimensions. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com>1 年前
[BACKEND] Localize the use and definition of getShapePerCTATile in the AMD backend and aim for elimination (#7740) 9 个月前
[AMD] Rewrite extract_slice op implementation (#7128) This PR refactors the extract_slice operation to support two major improvements: 1) Relaxed Layout Constraints The operation now allows more flexible source and destination layouts, aligning better with linear layouts. 2) Support for Arbitrary Tensor Ranks extract_slice is no longer limited to 2D tensors and can now handle tensors of any rank. The "extract_slice" operation enables extracting a slice of a tensor in registers. It supports the following arguments: * source: the base tensor on which to create a view tensor * offsets: offsets into the base tensor at which to create the view In distributed layouts, tensors are divided into CTA tiles. A CTA tile represents the smallest contiguous portion of a tensor that is distributed across all threads and warps within a workgroup. The ExtractSlice operation extracts a portion of the tensor that aligns with CTA tile boundaries. This op is designed to work on logical tensors directly, avoiding the need for complex layout reinterpretation or reshaping. For example, the tt.split operation only supports splitting along the innermost dimension, and requires that the resulting innermost dimension provide 2 elements per thread, distributed across registers. In contrast, extract_slice op imposes no constraints on the extraction dimension or the size of dimensions. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com>11 个月前