[mlir] Fix block merging (#97697) With this PR I am trying to address: https://github.com/llvm/llvm-project/issues/63230. What changed: - While merging identical blocks, don't add a block argument if it is "identical" to another block argument. I.e., if the two block arguments refer to the same Value. The operations operands in the block will point to the argument we already inserted. This needs to happen to all the arguments we pass to the different successors of the parent block - After merged the blocks, get rid of "unnecessary" arguments. I.e., if all the predecessors pass the same block argument, there is no need to pass it as an argument. - This last simplification clashed with BufferDeallocationSimplification. The reason, I think, is that the two simplifications are clashing. I.e., BufferDeallocationSimplification contains an analysis based on the block structure. If we simplify the block structure (by merging and/or dropping block arguments) the analysis is invalid . The solution I found is to do a more prudent simplification when running that pass. **Note**: this a rework of #96871 . I ran all the integration tests (-DMLIR_INCLUDE_INTEGRATION_TESTS=ON) and they passed.
[mlir][bufferization] Add BufferOriginAnalysis (#86461) This commit adds the BufferOriginAnalysis, which can be queried to check if two buffer SSA values originate from the same allocation. This new analysis is used in the buffer deallocation pass to fold away or simplify bufferization.dealloc ops more aggressively. The BufferOriginAnalysis is based on the BufferViewFlowAnalysis, which collects buffer SSA value "same buffer" dependencies. E.g., given IR such as: %0 = memref.alloc() %1 = memref.subview %0 %2 = memref.subview %1 The BufferViewFlowAnalysis will report the following "reverse" dependencies (resolveReverse) for %2: {%2, %1, %0}. I.e., all buffer SSA values in the reverse use-def chain that originate from the same allocation as %2. The BufferOriginAnalysis is built on top of that. It handles only simple cases at the moment and may conservatively return "unknown" around certain IR with branches, memref globals and function arguments. This analysis enables additional simplifications during -buffer-deallocation-simplification. In particular, "regular" scf.for loop nests, that yield buffers (or reallocations thereof) in the same order as they appear in the iter_args, are now handled much more efficiently. Such IR patterns are generated by the sparse compiler.
Avoid buffer hoisting from parallel loops (#90735) This change corrects an invalid behavior in pass --buffer-loop-hoisting. The pass is in charge of extracting buffer allocations (e.g., memref.alloca) from loop regions (e.g., scf.for) when possible. This works OK for looks with sequential execution semantics. However, a buffer allocated in the body of a parallel loop may be concurrently accessed by multiple thread to store its local data. Extracting such buffer from the loop causes all threads to wrongly share the same memory region. In the following example, dimension 1 of the input tensor is reversed. Dimension 0 is traversed with a parallel loop. func.func @f(%input: memref<2x3xf32>) -> memref<2x3xf32> { %c0 = index.constant 0 %c1 = index.constant 1 %c2 = index.constant 2 %c3 = index.constant 3 %output = memref.alloc() : memref<2x3xf32> scf.parallel (%index) = (%c0) to (%c2) step (%c1) { // Create subviews for working input and output slices %input_slice = memref.subview %input[%index, 2][1, 3][1, -1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, -1], offset: ?>> %output_slice = memref.subview %output[%index, 0][1, 3][1, 1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>> // Copy the input slice into this temporary buffer. This intermediate // copy is unnecessary, but is used for illustration purposes. %temp = memref.alloc() : memref<1x3xf32> memref.copy %input_slice, %temp : memref<1x3xf32, strided<[3, -1], offset: ?>> to memref<1x3xf32> // Copy temporary buffer into output slice memref.copy %temp, %output_slice : memref<1x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>> scf.reduce } return %output : memref<2x3xf32> } The patch submitted here prevents %temp = memref.alloc() : memref<1x3xf32> from being hoisted when the containing op is scf.parallel or scf.forall. A new op trait called HasParallelRegion is introduced and assigned to these two ops to indicate that their regions have parallel execution semantics. @joker-eph @ftynse @nicolasvasilache @sabauma
[mlir][bufferization] Add "bottom-up from terminators" analysis heuristic (#83964) One-Shot Bufferize currently does not support loops where a yielded value bufferizes to a buffer that is different from the buffer of the region iter_arg. In such a case, the bufferization fails with an error such as: Yield operand #0 is not equivalent to the corresponding iter bbArg scf.yield %0 : tensor<5xf32> One common reason for non-equivalent buffers is that an op on the path from the region iter_arg to the terminator bufferizes out-of-place. Ops that are analyzed earlier are more likely to bufferize in-place. This commit adds a new heuristic that gives preference to ops that are reachable on the reverse SSA use-def chain from a region terminator and are within the parent region of the terminator. This is expected to work better than the existing heuristics for loops where an iter_arg is written to multiple times within a loop, but only one write is fed into the terminator. Current users of One-Shot Bufferize are not affected by this change. "Bottom-up" is still the default heuristic. Users can switch to the new heuristic manually. This commit also turns the "fuzzer" pass option into a heuristic, cleaning up the code a bit.
[mlir][bufferization] Add "bottom-up from terminators" analysis heuristic (#83964) One-Shot Bufferize currently does not support loops where a yielded value bufferizes to a buffer that is different from the buffer of the region iter_arg. In such a case, the bufferization fails with an error such as: Yield operand #0 is not equivalent to the corresponding iter bbArg scf.yield %0 : tensor<5xf32> One common reason for non-equivalent buffers is that an op on the path from the region iter_arg to the terminator bufferizes out-of-place. Ops that are analyzed earlier are more likely to bufferize in-place. This commit adds a new heuristic that gives preference to ops that are reachable on the reverse SSA use-def chain from a region terminator and are within the parent region of the terminator. This is expected to work better than the existing heuristics for loops where an iter_arg is written to multiple times within a loop, but only one write is fed into the terminator. Current users of One-Shot Bufferize are not affected by this change. "Bottom-up" is still the default heuristic. Users can switch to the new heuristic manually. This commit also turns the "fuzzer" pass option into a heuristic, cleaning up the code a bit.
[mlir][bufferization] MaterializeInDestinationOp: Support memref destinations (#68074) Extend bufferization.materialize_in_destination to support memref destinations. This op can now be used to indicate that a tensor computation should materialize in a given buffer (that may have been allocated by another component/runtime). The op still participates in "empty tensor elimination". Example: mlir func.func @test(%out: memref<10xf32>) { %t = tensor.empty() : tensor<10xf32> %c = linalg.generic ... outs(%t: tensor<10xf32>) -> tensor<10xf32> bufferization.materialize_in_destination %c in restrict writable %out : (tensor<10xf32>, memref<10xf32>) -> () return } After "empty tensor elimination", the above IR can bufferize without an allocation: mlir func.func @test(%out: memref<10xf32>) { linalg.generic ... outs(%out: memref<10xf32>) return } This change also clarifies the meaning of the restrict unit attribute on bufferization.to_tensor ops.
[mlir][bufferization] Remove allow-return-allocs and create-deallocs pass options, remove bufferization.escape attribute (#66619) This commit removes the deallocation capabilities of one-shot-bufferization. One-shot-bufferization should never deallocate any memrefs as this should be entirely handled by the ownership-based-buffer-deallocation pass going forward. This means the allow-return-allocs pass option will default to true now, create-deallocs defaults to false and they, as well as the escape attribute indicating whether a memref escapes the current region, will be removed. A new allow-return-allocs-from-loops option is added as a temporary workaround for some bufferization limitations.
[MLIR] Generalize expand_shape to take shape as explicit input (#90040) This patch generalizes tensor.expand_shape and memref.expand_shape to consume the output shape as a list of SSA values. This enables us to implement generic reshape operations with dynamic shapes using collapse_shape/expand_shape pairs. The output_shape input to expand_shape follows the static/dynamic representation that's also used in tensor.extract_slice. Differential Revision: https://reviews.llvm.org/D140821 --------- Signed-off-by: Gaurav Shukla<gaurav.shukla@amd.com> Signed-off-by: Gaurav Shukla <gaurav.shukla@amd.com> Co-authored-by: Ramiro Leal-Cavazos <ramiroleal050@gmail.com>
[mlir][bufferization] Add "bottom-up from terminators" analysis heuristic (#83964) One-Shot Bufferize currently does not support loops where a yielded value bufferizes to a buffer that is different from the buffer of the region iter_arg. In such a case, the bufferization fails with an error such as: Yield operand #0 is not equivalent to the corresponding iter bbArg scf.yield %0 : tensor<5xf32> One common reason for non-equivalent buffers is that an op on the path from the region iter_arg to the terminator bufferizes out-of-place. Ops that are analyzed earlier are more likely to bufferize in-place. This commit adds a new heuristic that gives preference to ops that are reachable on the reverse SSA use-def chain from a region terminator and are within the parent region of the terminator. This is expected to work better than the existing heuristics for loops where an iter_arg is written to multiple times within a loop, but only one write is fed into the terminator. Current users of One-Shot Bufferize are not affected by this change. "Bottom-up" is still the default heuristic. Users can switch to the new heuristic manually. This commit also turns the "fuzzer" pass option into a heuristic, cleaning up the code a bit.
[mlir][bufferization] Remove allow-return-allocs and create-deallocs pass options, remove bufferization.escape attribute (#66619) This commit removes the deallocation capabilities of one-shot-bufferization. One-shot-bufferization should never deallocate any memrefs as this should be entirely handled by the ownership-based-buffer-deallocation pass going forward. This means the allow-return-allocs pass option will default to true now, create-deallocs defaults to false and they, as well as the escape attribute indicating whether a memref escapes the current region, will be removed. A new allow-return-allocs-from-loops option is added as a temporary workaround for some bufferization limitations.
[mlir][bufferization] Add "bottom-up from terminators" analysis heuristic (#83964) One-Shot Bufferize currently does not support loops where a yielded value bufferizes to a buffer that is different from the buffer of the region iter_arg. In such a case, the bufferization fails with an error such as: Yield operand #0 is not equivalent to the corresponding iter bbArg scf.yield %0 : tensor<5xf32> One common reason for non-equivalent buffers is that an op on the path from the region iter_arg to the terminator bufferizes out-of-place. Ops that are analyzed earlier are more likely to bufferize in-place. This commit adds a new heuristic that gives preference to ops that are reachable on the reverse SSA use-def chain from a region terminator and are within the parent region of the terminator. This is expected to work better than the existing heuristics for loops where an iter_arg is written to multiple times within a loop, but only one write is fed into the terminator. Current users of One-Shot Bufferize are not affected by this change. "Bottom-up" is still the default heuristic. Users can switch to the new heuristic manually. This commit also turns the "fuzzer" pass option into a heuristic, cleaning up the code a bit.
[mlir][bufferization] Add "bottom-up from terminators" analysis heuristic (#83964) One-Shot Bufferize currently does not support loops where a yielded value bufferizes to a buffer that is different from the buffer of the region iter_arg. In such a case, the bufferization fails with an error such as: Yield operand #0 is not equivalent to the corresponding iter bbArg scf.yield %0 : tensor<5xf32> One common reason for non-equivalent buffers is that an op on the path from the region iter_arg to the terminator bufferizes out-of-place. Ops that are analyzed earlier are more likely to bufferize in-place. This commit adds a new heuristic that gives preference to ops that are reachable on the reverse SSA use-def chain from a region terminator and are within the parent region of the terminator. This is expected to work better than the existing heuristics for loops where an iter_arg is written to multiple times within a loop, but only one write is fed into the terminator. Current users of One-Shot Bufferize are not affected by this change. "Bottom-up" is still the default heuristic. Users can switch to the new heuristic manually. This commit also turns the "fuzzer" pass option into a heuristic, cleaning up the code a bit.
[mlir][bufferization] Allow cyclic function graphs without tensors (#68632) Cyclic function call graphs are generally not supported by One-Shot Bufferize. However, they can be allowed when a function does not have tensor arguments or results. This is because it is then no longer necessary that the callee will be bufferized before the caller.
[mlir][Bufferization] castOrReallocMemRefValue: Use BufferizationOptions (#89175) This allows to configure both the op used for allocation and copy of memrefs. It also changes the default behavior because the default allocation in BufferizationOptions creates memref.alloc with alignment = 64 where we used to create memref.alloca without any alignment before. Fixes // TODO: Use alloc/memcpy callback from BufferizationOptions if called via // BufferizableOpInterface impl of ToMemrefOp.
[mlir][Bufferization] castOrReallocMemRefValue: Use BufferizationOptions (#89175) This allows to configure both the op used for allocation and copy of memrefs. It also changes the default behavior because the default allocation in BufferizationOptions creates memref.alloc with alignment = 64 where we used to create memref.alloca without any alignment before. Fixes // TODO: Use alloc/memcpy callback from BufferizationOptions if called via // BufferizableOpInterface impl of ToMemrefOp.
[mlir][bufferization] Remove allow-return-allocs and create-deallocs pass options, remove bufferization.escape attribute (#66619) This commit removes the deallocation capabilities of one-shot-bufferization. One-shot-bufferization should never deallocate any memrefs as this should be entirely handled by the ownership-based-buffer-deallocation pass going forward. This means the allow-return-allocs pass option will default to true now, create-deallocs defaults to false and they, as well as the escape attribute indicating whether a memref escapes the current region, will be removed. A new allow-return-allocs-from-loops option is added as a temporary workaround for some bufferization limitations.
[mlir][Bufferization] Add support for controlled bufferization of alloc_tensor (#70957) This revision adds support to transform.structured.bufferize_to_allocation to bufferize bufferization.alloc_tensor() ops. This is useful as a means path to control the bufferization of tensor.empty ops that have bene previously bufferization.empty_tensor_to_alloc_tensor'ed.