ModelZoo-PyTorch/ACL_PyTorch/docs/torchair/图优化/2.迁移适配.md-代码预览-ModelZoo-PyTorch:基于昇腾生态的AI模型平台项目 - AtomGit

ad2a1da9创建于 2025年12月3日历史提交

基础使用方法

导入TorchAir相关库
```
import torchair
```

配置compile的backend

config = torchair.CompilerConfig()
npu_backend = torchair.get_npu_backend(compiler_config=config)
# 创建模型实例
model = MyModel()
# 编译模型
model = torch.compile(model, backend=npu_backend)
# 或者只编译模型的子模块
model.encoder = torch.compile(model.encoder, backend=npu_backend)
model.decoder = torch.compile(model.decoder, backend=npu_backend)
# 或者只编译模型的一个方法
model.forward = torch.compile(model.forward, backend=npu_backend)

huggingface模型示例

从huggingface的Model Card的sample code迁移时，注意不要直接compile整个pipeline，而是将pipeline中的模型编译。

示例：

```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
# 编译模型
model = torch.compile(model)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,  # 将编译后的模型作为参数传入pipeline中
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)
# 执行推理
result = pipeline(sample)
```

由多个子模块拼接的模型

有些模型由多个子模块拼接，在实际调用的时候或许并没有走自己的forward函数，而是通过其他接口去调用子模块。此时需要通过代码调试、代码走读等方式找到实际推理时调用的子模块。

示例：

```python
# chinese clip模型
import torch
from cn_clip.clip import load_from_name
device = torch.device('npu:0')
model, preprocess = load_from_name('ViT-B-16', device=device)

# 推理接口
with torch.no_grad():
    logits_per_image, logits_per_text = model.get_similarity(image, text)
```

直接打印模型，模型结构如下：

```
CLIP(
    (visual): VisualTransformer(
        ...
    )
    (bert): BertModel(
        ...
    )
)
```

API接口具体实现如下：

```python
...
def encode_image(self, image, mask_ratio=0):
    if isinstance(self.visual, ModifiedResNet):
        # 调用visual submodule
        return self.visual(image.type(self.dtype))
    return self.visual(image.type(self.dtype), mask_ratio)

def encode_text(self, text):
    ...
    # 调用bert submodule
    x = self.bert(text, attention_mask=attn_mask)[0].type(self.dtype)
    return x[:, 0, :] @ self.text_projection

def get_similarity(self, image, text):
    # 处理image
    image_features = self.encode_image(image)
    # 处理text
    text_features = self.encode_text(text)

    # normalized features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.exp()
    logits_per_image = logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()

    return logits_per_image, logits_per_text
```

通过以上模型结构和推理接口的具体实现可以看出CLIP模型实际调用了visual和bert这两个子模块的forward，因此compile的编译对象为visual和bert模型。

```python
config = torchair.CompilerConfig()
npu_backend = torchair.get_npu_backend(compiler_config=config)
model, preprocess = load_from_name('ViT-B-16', device=device)
model.visual = torch.compile(model.visual, backend=npu_backend)
model.bert = torch.compile(model.bert, backend=npu_backend)
```

常用编译配置

1. torchair.CompilerConfig类

通过torchair.CompilerConfig类的experimental_config属性可以打开一些提升性能的功能：

experimental_config.frozen_parameter：推理场景下，该功能将模型执行期间地址不变的Tensor标识为Parameter类型，从而缩短图下发时间，提升下发性能。推荐设置为True。
experimental_config.tiling_schedule_optimize：静态Shape场景下，开启Tiling调度优化，Tiling计算将直接在Device测执行，提升静态Shape模型性能。静态场景下推荐设置为True。

使用示例：

config = torchair.CompilerConfig()
config.experimental_config.frozen_parameter = True
config.experimental_config.tiling_schedule_optimize = True
npu_backend = torchair.get_npu_backend(compiler_config=config)

model = torch.compile(model, backend=npu_backend)

2. torch.compile()

PyTorch提供的原生接口，详细资料可参考torch.compile文档

torch.compile(
    model: Callable[[_InputT], _RetT], 
    *, 
    fullgraph: bool = False, 
    dynamic: Optional[bool] = None, 
    backend: Union[str, Callable] = 'inductor', 
    disable: bool = False
) → Callable[[_InputT], _RetT][source]

参数说明

参数名	参数说明
model	如图的模型或者函数，必选
fullgraph	bool类型，可选。是否对整图进行优化。 - False（缺省值）：自动查询可优化部分，有不支持的算子自动断图 -True：捕获整图优化，但如果发生断图会抛出异常
dynamic	bool类型或者None，可选。是否使用动态shape tracing。 - None（缺省值）：自动检测是否为动态图 - False：执行静态图 - True：执行动态图
backend	后端选择，缺省值为“inductor"，目前昇腾NPU暂不支持。昇腾NPU成图只有一种后端，通过torchair.get_npu_backend接口获取，必选。
disable	bool类型，可选。是否关闭torch.compile能力。 - False（缺省值）：开启torch.compile能力 - True：关闭torch.compile能力，采用单算子模式

参数建议：

fullgraph：整图性能优于非整图，推荐设置为True。如果遇到不支持成图的计算逻辑，可通过替换算子、修改代码位置等方式尝试解决。
dynamic：静态图性能优于动态图，推荐设置为True。静态图要求输入shape固定，且计算过程中不能出现动态shape。

使用示例：

config = torchair.CompilerConfig()
config.experimental_config.frozen_parameter = True
config.experimental_config.tiling_schedule_optimize = True
npu_backend = torchair.get_npu_backend(compiler_config=config)

model = torch.compile(model, dynamic=False, fullgraph=True, backend=npu_backend)

成图问题解决方案

1. 拆分prefill和decode阶段

因为prefill和decode阶段使用的FlashAttention算子不同，需要分别成图。

原模型代码：

import torch, torch_npu, torchair
class Decoder(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, ...):
        ...
        return ...

为prefill和decode单独写一个方法，在forward里判断走prefill还是decode

import torch, torch_npu, torchair
class Decoder(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, ...):
        if x.size(1) > 1:
            ...
            return self.prefill(x, ...)
        ...
        return self.decode(x, kv_cache, ...)
    
    def prefill(self, x, ...):
        ...
        return ...
    
    def decode(self, x, kv_cache, ...):
        ...
        return ...

decoder = Decoder()
# 编译prefill和decode方法
config = torchair.CompilerConfig()
config.experimental_config.frozen_parameter = True
config.experimental_config.tiling_schedule_optimize = True
npu_backend = torchair.get_npu_backend(compiler_config=config)
decoder.prefill = torch.compile(decoder.prefill, dynamic=False, fullgraph=True, backend=npu_backend)
# decode阶段通常是自回归模式运行，输入不固定，dynamic设置为True
decoder.decode = torch.compile(decoder.decode, dynamic=True, fullgraph=True, backend=npu_backend)

2. 将不支持的算子移出图编译部分

部分数据类型或者操作不支持torchAir成图，可以将这部分代码从需要被优化的函数中移出。比如在语音模型中通常会用torch.stft()对语音输入做傅里叶变换，torchAir不支持该操作，无法通过编译，可以调整这部分代码位置，使得我们可以用fullgraph=True让函数主体可以成图。

示例：

替换前：

class Model
    ...
    def decode(self, x):
        spec = torch.stft(x, N_FFT, HOP_LENGTH, window, return_complex=True)
        ....
        return ...
    
    def forward(self, x):
        ...
        out = self.decode(x)
        ...
        return ...

替换后：

class Model
    ...
    def decode(self, x, spec):
        # spec = torch.stft(x, N_FFT, HOP_LENGTH, window, return_complex=True)
        ....
        return ...
    
    def forward(self, x):
        spec = torch.stft(x, N_FFT, HOP_LENGTH, window, return_complex=True) # 将stft操作移出decode
        stft_real_img = torch.view_as_real(spec) # npu不支持复数
        out = self.decode(x, stft_real_img)
        ...
        return ...

model = Model()
model.decode = torch.compile(model.decode)

3. 算子converter注册和实现

完成算子注册PyTorch后，支持以Eager模式调用自定义算子，但不支持图模式调用。自定义算子还需向TorchAir注册并实现对应converter函数，完成ATen IR向GE IR转换，实现算子在NPU上入图。具体步骤参考：converter补齐