Compilation Features
MindIE SD provides a custom backend MindieSDBackend() based on PyTorch's torch.compile compiler, delivering two complementary acceleration capabilities on Ascend chips:
- Pattern Fusion: Uses Pattern Matcher to automatically replace common operator combinations with Ascend fusion operators, reducing kernel launch overhead
- ACLGraph Acceleration: Captures the computation graph as a static execution graph via
torch.npu.NPUGraph, skipping dynamic graph scheduling during replay
Both capabilities are uniformly controlled through CompilationConfig.
Note
When this feature is enabled, there is a certain compilation overhead during initial model execution (up to 8 attempts by default), but generally no recompilation occurs in subsequent runs. During actual benchmark testing, the warm-up phase overhead should be excluded.
Basic Usage
Both acceleration capabilities share the same entry point: call torch.compile on the model or its submodule and specify MindieSDBackend().
Compile the entire transformer:
pipe = FluxPipeline.from_pretrained(...)
transformer = torch.compile(pipe.transformer, backend=MindieSDBackend())
setattr(pipe, "transformer", transformer)
Use decorator on a single Module:
@torch.compile(backend=MindieSDBackend())
class FluxSingleTransformerBlock(nn.Module):
Use decorator on a forward function:
class FluxSingleTransformerBlock(nn.Module):
@torch.compile(backend=MindieSDBackend())
def forward(...):
Pattern Fusion
MindieSDBackend() has multiple built-in operator fusion patterns. At compile time, they are automatically matched and replaced with Ascend-optimized operators. Each pattern can be individually toggled via CompilationConfig.fusion_patterns:
from mindiesd.compilation import CompilationConfig
CompilationConfig.fusion_patterns.enable_rms_norm = False # Disable RMSNorm fusion
CompilationConfig.fusion_patterns.enable_rope = False # Disable RoPE fusion
CompilationConfig.fusion_patterns.enable_adalayernorm = False # Disable adaLN fusion
CompilationConfig.fusion_patterns.enable_fast_gelu = False # Disable fastGelu fusion
CompilationConfig.fusion_patterns.enable_mul_add = False # Disable Mul+Add fusion
For detailed API documentation of each fusion operator, see the fusion operator section in core_layers.md.
API Reference
MindieSDBackend
from mindiesd.compilation import MindieSDBackend
Passed as the backend parameter to torch.compile, automatically enables Pattern Fusion and ACLGraph acceleration.
CompilationConfig
from mindiesd.compilation import CompilationConfig
Pattern fusion toggles are controlled via fusion_patterns:
| Option | Default | Description |
|---|---|---|
fusion_patterns.enable_rms_norm |
True |
RMSNorm fusion |
fusion_patterns.enable_rope |
True |
RoPE fusion |
fusion_patterns.enable_adalayernorm |
True |
AdaLayerNorm fusion |
fusion_patterns.enable_fast_gelu |
True |
fastGELU fusion |
fusion_patterns.enable_mul_add |
True |
Mul+Add fusion |
ACLGraph Acceleration
On top of Pattern Fusion, you can further enable ACLGraph to capture the optimized graph as a static execution plan.
Configuration
aclgraph_only and aclgraph_with_compile are mutually exclusive; when both are enabled, aclgraph_with_compile takes higher priority.
| Option | Default | Description |
|---|---|---|
aclgraph_only |
False |
ACLGraph only, skip Pattern Fusion |
aclgraph_with_compile |
False |
Pattern Fusion first, then capture as ACLGraph |
enable_freezing |
True |
Whether to perform constant folding before compilation |
safe_output_mode |
True |
Whether to clone output during ACLGraph replay |
graph_log_url |
None |
Graph transform log URL for debugging |
Usage Example
Based on the Basic Usage above, configure CompilationConfig before calling:
from mindiesd.compilation import CompilationConfig
CompilationConfig.aclgraph_with_compile = True
# Subsequent calls to torch.compile(..., backend=MindieSDBackend()) will automatically enable it
Variable-Length Input Handling
In scenarios such as audio, input lengths are not fixed and can be adapted via external padding:
max_len = 512
model = torch.compile(transformer, backend=MindieSDBackend())
_ = model(torch.randn(max_len, dim, device="npu")) # Trigger capture
for audio_chunk in chunks:
actual_len = audio_chunk.shape[0]
padded = torch.nn.functional.pad(audio_chunk, (0, 0, 0, max_len - actual_len))
output = model(padded)[:actual_len]
Limitations and Notes
- Environment dependency: Only supported on Ascend NPU environments
- Input shape: Runtime input shape must match the shape at capture time; changes will trigger re-capture
- Dynamic features: Dynamic shapes, dynamic control flow, or conditional branching are not supported
- First-time overhead: The first graph capture incurs a one-time overhead
- graph.update: The
graph.updateinterface is not provided (this interface is used for dynamically injecting attention metadata in LLM scenarios and is not needed for SD scenarios) - Configuration timing:
CompilationConfigmust be configured before callingtorch.compile()
Troubleshooting Tips
- The troubleshooting methods are consistent with PyTorch's compile. A logging module is defined in mindie_sd_backend.py. After enabling it, you can observe graph changes before and after pattern activation. Combined with narrowing the scope via torch.compile, you can identify the cause of pattern failures.
- Controlling the compile scope can effectively control the troubleshooting scope.
- For other troubleshooting methods, refer to the PyTorch official documentation.