Ii-robot!7275 [Fix]2024-2025适配的模型适配全量测试暴露的资料问题修改集合

c6e2024c创建于 2025年6月4日历史提交

文件	最后提交记录	最后更新时间
test_results	!4332 [自研][模型推理][初次提交]stable-diffusion模型	3 年前
LICENSE	!4332 [自研][模型推理][初次提交]stable-diffusion模型	3 年前
README.md	!7275 [Fix]2024-2025适配的模型适配全量测试暴露的资料问题修改集合	11 个月前
Readme_quant.md	!5614 [自研][PyTorch离线推理][Text-to-Image][stable diffusion]细化参数说明	2 年前
background_session.py	!6237 [自研][PyTorch离线推理][foundation_models]Stable Diffusion	2 年前
clip.patch	!4332 [自研][模型推理][初次提交]stable-diffusion模型	3 年前
clip_score.py	!5419 [自研][PyTorch离线推理][Text-to-Image] StableDiffusion 添加精度验证、修改改图脚本	2 年前
modelzoo_level.txt	!4332 [自研][模型推理][初次提交]stable-diffusion模型	3 年前
modify_onnx.py	!6293 [自研][PyTorch离线推理][foundation_models]Stable Diffusion适配新版本算子	2 年前
pipeline_ascend_stable_diffusion.py	!6237 [自研][PyTorch离线推理][foundation_models]Stable Diffusion	2 年前
prompts.txt	!5355 [自研][PyTorch离线推理][Text-to-Image] StableDiffusion 适配V2.1版本	2 年前
public_address_statement.md	!6055 整改公网地址	2 年前
quant_unet.py	!6111 [自研][PyTorch离线推理][foundation_models]Stable Diffusion修复量化并行问题	2 年前
requirements.txt	!6237 [自研][PyTorch离线推理][foundation_models]Stable Diffusion	2 年前
stable_diffusion_2_onnx.py	!5450 [自研][PyTorch离线推理][Text-to-Image] StableDiffusion 修复参数缺少问题	2 年前
stable_diffusion_ascend_infer.py	!6237 [自研][PyTorch离线推理][foundation_models]Stable Diffusion	2 年前
stable_diffusion_clip_patch.py	!4332 [自研][模型推理][初次提交]stable-diffusion模型	3 年前
unet_cache.py	!6103 [自研][PyTorch离线推理][foundation_models]Stable Diffusion适配unet cache方案	2 年前

stable-diffusion模型-推理指导

概述
- 输入输出数据
推理环境准备
快速上手
模型推理性能&精度

概述

stable-diffusion是一种文本到图像的扩散模型，能够在给定任何文本输入的情况下生成照片逼真的图像。有关稳定扩散函数的更多信息，请查看Stable Diffusion blog。

参考实现：

 # StableDiffusion v1.5
 https://huggingface.co/runwayml/stable-diffusion-v1-5

 # StableDiffusion v2.1
 https://huggingface.co/stabilityai/stable-diffusion-2-1-base

输入输出数据

输入数据

输入数据大小数据类型数据排布格式

input 1 x 77 FLOAT32 ND
输出数据

输出数据大小数据类型数据排布格式

output1 1 x 512 x 512 x 3 FLOAT32 NHWD

输入数据	大小	数据类型	数据排布格式
input	1 x 77	FLOAT32	ND

输出数据	大小	数据类型	数据排布格式
output1	1 x 512 x 512 x 3	FLOAT32	NHWD

推理环境准备

该模型需要以下插件与驱动

表 1 版本配套表

配套版本环境准备指导

固件与驱动 24.1.RC1 Pytorch框架推理环境准备

CANN（+MindIE-RT） 8.0.RC1(1.0.RC1) -

Python 3.10 -

配套	版本	环境准备指导
固件与驱动	24.1.RC1	Pytorch框架推理环境准备
CANN（+MindIE-RT）	8.0.RC1(1.0.RC1)	-
Python	3.10	-

如在优化模型时使用了--FA、--TOME_num、--faster_gelu参数，需要安装与CANN包配套版本的MindIE

该模型性能受CPU规格影响，建议将CPU设置为性能模式以获得最优性能

快速上手

获取源码

获取本仓源码

git clone https://gitee.com/ascend/ModelZoo-PyTorch.git
cd ModelZoo-PyTorch/ACL_PyTorch/built-in/foundation_models/stable_diffusion

安装依赖。
```
pip3 install -r requirements.txt
```
代码修改

执行命令：
```
python3 stable_diffusion_clip_patch.py
```
安装昇腾推理工具
1. 请访问ais_bench推理工具，根据readme文件进行工具安装。
2. 请访问msit代码仓，根据readme文档进行工具安装 debug surgeon。

准备数据集

获取原始数据集。

本模型输入文本信息生成图片，无需数据集。

模型推理

模型转换。使用PyTorch将模型权重文件.pth转换为.onnx文件，再使用ATC工具将.onnx文件转为离线推理模型文件.om文件。

获取权重（可选）

可提前下载权重，以避免执行后面步骤时可能会出现下载失败。

# 需要使用 git-lfs (https://git-lfs.com)
git lfs install

# v1.5
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5

# v2.1
git clone https://huggingface.co/stabilityai/stable-diffusion-2-1-base

导出ONNX模型

设置模型名称或路径

# v1.5 (执行时下载权重)
model_base="runwayml/stable-diffusion-v1-5"

# v1.5 (使用上一步下载的权重)
model_base="./stable-diffusion-v1-5"

# v2.1 (执行时下载权重)
model_base="stabilityai/stable-diffusion-2-1-base"

# v2.1 (使用上一步下载的权重)
model_base="./stable-diffusion-2-1-base"

注意：若条件允许，该模型可以双芯片并行的方式进行推理，从而获得更短的端到端耗时。具体指令的差异之处会在后面的步骤中单独说明，请留意。

执行命令：

# 设置模型的batch size
bs=1

python3 stable_diffusion_2_onnx.py --model ${model_base} --output_dir ./models_bs${bs} --batch_size ${bs}

# 使用并行方案
python3 stable_diffusion_2_onnx.py --model ${model_base} --output_dir ./models_bs${bs} --batch_size ${bs} --parallel

参数说明：

--model：模型名称或本地模型目录的路径
--output_dir: ONNX模型输出目录
--batch_size：模型batch size
--parallel：导出适用于并行方案的模型

执行成功后生成onnx模型：

models_bs${bs}/clip/clip.onnx
models_bs${bs}/unet/unet.onnx
models_bs${bs}/vae/vae.onnx

优化onnx模型
1. 量化（可选，Duo/Pro卡上可提升性能但可能导致精度下降）
  
  量化步骤请参考量化指导
2. 模型优化
  
  运行modify_onnx.py脚本。
  
  未量化场景，TOME_num可设为5以获得最优性能收益。如果使用量化，推荐将TOME_num参数设为4以获得较好的精度和性能数据。
```
# 使用未量化模型
python3 modify_onnx.py \
      --model models_bs${bs}/unet/unet.onnx \
      --new_model models_bs${bs}/unet/unet_md.onnx \
      --FA_soc Duo \
      --TOME_num 5 \
      --faster_gelu

# 使用量化模型
python3 modify_onnx.py \
      --model models_bs${bs}/unet_quant/unet.onnx \
      --new_model models_bs${bs}/unet/unet_md.onnx \
      --FA_soc Duo \
      --TOME_num 4 \
      --faster_gelu
```
  参数说明：
  - --model：onnx模型路径。
  - --new_model：优化后生成的onnx模型路径。
  - --FA_soc：使用FA算子的硬件形态。目前FlashAttention算子支持Atlas 300I Duo/Pro和Atlas 800I A2，请根据使用硬件设置参数Duo或A2，其他不支持硬件请设置为None。默认为None。
  - --TOME_num：插入TOME插件的数量，有效取值为[0, 5]。Tome插件目前支持Atlas 300I Duo/Pro和Atlas 800I A2，其他不支持硬件请设置为0。默认为0。
  - --faster_gelu：使用slice+gelu的融合算子。
  FA、TOME、Gelu融合算子需通过安装与CANN版本对应的推理引擎包(MindIE)来获取，如未安装推理引擎或使用的版本不支持FA、TOME、SliceGelu算子，FA_soc和TOME_num参数请使用默认配置、不设置faster_gelu参数。
3. 使用cache方案（可选，可提升性能但可能导致精度下降）
  
  运行unet_cache.py脚本。
```
python3 unet_cache.py --model models_bs${bs}/unet/unet_md.onnx --save_dir models_bs${bs}/unet/
```
  参数说明：
  - --model：优化后的onnx模型路径。
  - --save_dir：cache模型的保存路径。
  运行成功后在save_dir下得到unet_cache.onnx和unet_skip.onnx。

使用ATC工具将ONNX模型转OM模型。

配置环境变量。
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 如果安装了推理引擎算子包，需配置推理引擎路径
source /usr/local/Ascend/mindie-rt/set_env.sh
```
说明： 该脚本中环境变量仅供参考，请以实际安装环境配置环境变量。详细介绍请参见《CANN 开发辅助工具指南 (推理)》。

执行命令查看芯片名称（${chip_name}）。

npu-smi info
#该设备芯片名为Ascend310P3 （自行替换）
回显如下：
+-------------------+-----------------+------------------------------------------------------+
| NPU     Name      | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device    | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
+===================+=================+======================================================+
| 0       310P3     | OK              | 15.8         42                0    / 0              |
| 0       0         | 0000:82:00.0    | 0            1074 / 21534                            |
+===================+=================+======================================================+
| 1       310P3     | OK              | 15.4         43                0    / 0              |
| 0       1         | 0000:89:00.0    | 0            1070 / 21534                            |
+===================+=================+======================================================+

执行ATC命令。

# clip
atc --framework=5 \
    --model=./models_bs${bs}/clip/clip.onnx \
    --output=./models_bs${bs}/clip/clip \
    --input_format=ND \
    --log=error \
    --soc_version=Ascend${chip_name}

# unet
cd ./models_bs${bs}/unet/

# 不使用cache方案
atc --framework=5 \
    --model=./unet_md.onnx \
    --output=./unet \
    --input_format=NCHW \
    --log=error \
    --optypelist_for_implmode="Gelu,Sigmoid" \
    --op_select_implmode=high_performance \
    --soc_version=Ascend${chip_name}

# 使用cache方案
atc --framework=5 \
    --model=./unet_cache.onnx \
    --output=./unet_cache \
    --input_format=NCHW \
    --log=error \
    --optypelist_for_implmode="Gelu,Sigmoid" \
    --op_select_implmode=high_performance \
    --soc_version=Ascend${chip_name}

atc --framework=5 \
    --model=./unet_skip.onnx \
    --output=./unet_skip \
    --input_format=NCHW \
    --log=error \
    --optypelist_for_implmode="Gelu,Sigmoid" \
    --op_select_implmode=high_performance \
    --soc_version=Ascend${chip_name}

cd ../../

# vae
atc --framework=5 \
    --model=./models_bs${bs}/vae/vae.onnx \
    --output=./models_bs${bs}/vae/vae \
    --input_format=NCHW \
    --log=error \
    --soc_version=Ascend${chip_name}

参数说明：

--model：为ONNX模型文件。
--output：输出的OM模型。
--framework：5代表ONNX模型。
--log：日志级别。
--soc_version：处理器型号。
--input_shape: 模型的输入shape信息。

执行成功后生成om模型列表：

models_bs${bs}/clip/clip.om
models_bs${bs}/unet/unet.om
models_bs${bs}/unet/unet_cache.om
models_bs${bs}/unet/unet_skip.om
models_bs${bs}/vae/vae.om

开始推理验证。
1. 安装绑核工具并根据NUMA亲和性配置任务进程与NUMA node 的映射关系是为了排除cpu的影响
  
  安装绑核工具
```
yum install numactl
```
  查询卡的NUMA node
```
lspci -vs bus-id
```
  bus-id可通过npu-smi info获得，查询到NUMA node，在推理命令前加上对应的数字
  
  可通过lscpu获得NUMA node对应的CPU核数
```
NUMA node0: 0-23
NUMA node1: 24-47
NUMA node2: 48-71
NUMA node3: 72-95
```
  当前查到NUMA node是0，对应0-23，推荐绑定其中单核以获得更好的性能。
2. 执行推理脚本。
```
# 普通方式
numactl -C 0 python3 stable_diffusion_ascend_infer.py \
        --model ${model_base} \
        --model_dir ./models_bs${bs} \
        --prompt_file ./prompts.txt \
        --device 0 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50 \
        --use_cache

# 并行方式
numactl -C 0 python3 stable_diffusion_ascend_infer.py \
        --model ${model_base} \
        --model_dir ./models_bs${bs} \
        --prompt_file ./prompts.txt \
        --device 0,1 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50 \
        --use_cache
```
  参数说明：
  - --model：模型名称或本地模型目录的路径。
  - --model_dir：存放导出模型的目录。
  - --prompt_file：输入文本文件，按行分割。
  - --save_dir：生成图片的存放目录。
  - --batch_size：模型batch size。
  - --steps：生成图片迭代次数。
  - --device：推理设备ID；可用逗号分割传入两个设备ID，此时会使用并行方式进行推理。
  - --use_cache: 在推理过程中使用cache。
  - --cache_steps: 使用cache的迭代次数，迭代次数越多性能越好，但次数过多可能会导致精度下降。
  执行完成后在./results目录下生成推理图片。并在终端显示推理时间，参考如下：
```
[info] infer number: 16; use time: 292.648s; average time: 18.290s
```
  注意：
  
  如果使用arm机器，出现*torch*.so*: cannot allocate memory in static TLS block报错，则增加环境变量指向报错路径
```
export LD_PRELOAD=报错.so路径:$LD_PRELOAD
```
3. 测试推理图片展示在./test_results目录下，注：每次生成的图像不同。部分测试结果如下：
  
  Prompt: "Beautiful illustration of The ocean. in a serene landscape, magic realism, narrative realism, beautiful matte painting, heavenly lighting, retrowave, 4 k hd wallpaper"
  
  Prompt: "Beautiful illustration of Islands in a serene landscape, magic realism, narrative realism, beautiful matte painting, heavenly lighting, retrowave, 4 k hd wallpaper"
  
  Prompt: "Beautiful illustration of Seaports in a serene landscape, magic realism, narrative realism, beautiful matte painting, heavenly lighting, retrowave, 4 k hd wallpaper"

精度验证

由于生成的图片存在随机性，所以精度验证将使用CLIP-score来评估图片和输入文本的相关性，分数的取值范围为[-1, 1]，越高越好。

注意，由于要生成的图片数量较多，进行完整的精度验证需要耗费很长的时间。

下载Parti数据集

wget https://raw.githubusercontent.com/google-research/parti/main/PartiPrompts.tsv --no-check-certificate

下载Clip模型权重

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
cd ./CLIP-ViT-H-14-laion2B-s32B-b79K

# 用 git-lfs 下载
git lfs pull

# 或者访问https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/open_clip_pytorch_model.bin，将权重下载并放到这个目录下

使用推理脚本读取Parti数据集，生成图片
```
# 普通方式
numactl -C 0 python3 stable_diffusion_ascend_infer.py \
        --model &#36;{model_base} \
        --model_dir ./models_bs&#36;{bs} \
        --prompt_file ./PartiPrompts.tsv \
        --prompt_file_type parti \
        --num_images_per_prompt 4 \
        --max_num_prompts 0 \
        --device 0 \
        --save_dir ./results \
        --batch_size &#36;{bs} \
        --steps 50 \
        --use_cache

# 并行方式
numactl -C 0 python3 stable_diffusion_ascend_infer.py \
        --model &#36;{model_base} \
        --model_dir ./models_bs&#36;{bs} \
        --prompt_file ./PartiPrompts.tsv \
        --prompt_file_type parti \
        --num_images_per_prompt 4 \
        --max_num_prompts 0 \
        --device 0,1 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50 \
        --use_cache
```
参数说明：
- --model：模型名称或本地模型目录的路径。
- --model_dir：存放导出模型的目录。
- --prompt_file：输入文本文件，按行分割。
- --prompt_file_type: prompt文件类型，用于指定读取方式。
- --num_images_per_prompt: 每个prompt生成的图片数量。
- --max_num_prompts：限制prompt数量为前X个，0表示不限制。
- --save_dir：生成图片的存放目录。
- --batch_size：模型batch size。
- --steps：生成图片迭代次数。
- --device：推理设备ID；可用逗号分割传入两个设备ID，此时会使用并行方式进行推理。
- --use_cache: 在推理过程中使用cache，迭代次数越多性能越好，但次数过多可能会导致精度下降。
执行完成后会在./results目录下生成推理图片，并且会在当前目录生成一个image_info.json文件，记录着图片和prompt的对应关系。
计算CLIP-score
```
python clip_score.py \
       --device=cpu \
       --image_info="image_info.json" \
       --model_name="ViT-H-14" \
       --model_weights_path="./CLIP-ViT-H-14-laion2B-s32B-b79K/open_clip_pytorch_model.bin"
```
参数说明：
- --device: 推理设备。
- --image_info: 上一步生成的image_info.json文件。
- --model_name: Clip模型名称。
- --model_weights_path: Clip模型权重文件路径。
执行完成后会在屏幕打印出精度计算结果。

模型推理性能&精度

调用ACL接口推理计算，性能参考下列数据。

StableDiffusion v2.1

加速卡	服务器	运行方案	优化方案	迭代次数	平均耗时
Atlas 300I Duo	Atlas 800 3000 + 2路处理器，处理器规格：48核3.0GHz	并行	FA+TOME*5+faster_gleu+cache	50	1.513s

迭代50次的参考精度结果如下：

average score: 0.379
category average scores:
[Abstract], average score: 0.285
[Vehicles], average score: 0.379
[Illustrations], average score: 0.378
[Arts], average score: 0.425
[World Knowledge], average score: 0.388
[People], average score: 0.382
[Animals], average score: 0.389
[Artifacts], average score: 0.374
[Food & Beverage], average score: 0.367
[Produce & Plants], average score: 0.367
[Outdoor Scenes], average score: 0.372
[Indoor Scenes], average score: 0.382

公网地址说明

代码涉及公网地址参考 public_address_statement.md