Ii-robot!7275 [Fix]2024-2025适配的模型适配全量测试暴露的资料问题修改集合

c6e2024c创建于 2025年6月4日历史提交

文件	最后提交记录	最后更新时间
README.md	!7275 [Fix]2024-2025适配的模型适配全量测试暴露的资料问题修改集合	1 年前
background_session.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
clip.patch	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
clip_score.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
hpsv2_score.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
modify_onnx.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
pipeline_ascend_stable_diffusionxl.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
requirements.txt	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
stable_diffusionxl_2_onnx.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
stable_diffusionxl_ascend_infer.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前
unet_cache.py	!6482 [自研][PyTorch离线推理][Image-to-Image] 新增SDXL_Refiner模型，支持图生图功能	1 年前

stable-diffusionxl_refiner模型-推理指导

概述
- 输入输出数据
推理环境准备
快速上手
- 获取源码
- 模型推理
模型推理性能&精度

概述

SDXL 由一组用于潜在扩散的专家管道组成：在第一步中，使用基础模型生成（噪声）潜伏，然后使用专门用于最终降噪步骤的细化模型此处获得，该模型提供SDXL的图生图功能

参考实现：

 # StableDiffusionxl
 https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

输入输出数据

输入数据

输入数据大小数据类型数据排布格式

prompt 1 x 77 INT64 ND
输出数据

输出数据大小数据类型数据排布格式

output1 1 x 3 x 1024 x 1024 FLOAT32 NCHW

输入数据	大小	数据类型	数据排布格式
prompt	1 x 77	INT64	ND

输出数据	大小	数据类型	数据排布格式
output1	1 x 3 x 1024 x 1024	FLOAT32	NCHW

推理环境准备

该模型需要以下插件与驱动

表 1 版本配套表

配套版本环境准备指导

固件与驱动 24.1.rc1 Pytorch框架推理环境准备

CANN（+MindIE） 8.0.RC1(1.0.RC1) -

Python 3.10 -

配套	版本	环境准备指导
固件与驱动	24.1.rc1	Pytorch框架推理环境准备
CANN（+MindIE）	8.0.RC1(1.0.RC1)	-
Python	3.10	-

如在优化模型时使用了--FA_soc、--TOME_num、--faster_gelu参数，需要安装与CANN包配套版本的MindIE

该模型性能受CPU规格影响，建议使用64核CPU（arm）以复现性能

快速上手

获取源码

获取本仓源码

git clone https://gitee.com/ascend/ModelZoo-PyTorch.git
cd ModelZoo-PyTorch/ACL_PyTorch/built-in/foundation_models/stable_diffusionxl_refiner

安装依赖。

pip3 install -r requirements.txt

git clone https://github.com/tgxs002/HPSv2.git
cd HPSv2
pip3 install -e .

代码修改

执行命令：

TRANSFORMERS_PATH=`python3 -c "import transformers; print(transformers.__path__[0])"`
patch  -p0 ${TRANSFORMERS_PATH}/models/clip/modeling_clip.py clip.patch

安装昇腾统一推理工具（AIT）

请访问AIT代码仓，根据readme文档进行工具安装。可只安装需要的组件：debug surgeon，其他组件为可选安装。

请访问ais_bench，根据readme文件进行工具安装。

模型推理

模型转换。使用PyTorch将模型权重文件转换为.onnx文件，再使用ATC工具将.onnx文件转为离线推理模型文件.om文件。

获取权重（可选）

可提前下载权重，放到代码同级目录下，以避免执行后面步骤时可能会出现下载失败。

# 需要使用 git-lfs (https://git-lfs.com)
git lfs install

# 下载权重
git clone https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0

导出ONNX模型

设置模型名称或路径

# base (执行时下载权重)
model_base="stabilityai/stable-diffusion-xl-refiner-1.0"

# base (下载的权重路径)
model_base="./stable-diffusion-xl-refiner-1.0"

执行命令：

python3 stable_diffusionxl_2_onnx.py --model ${model_base} --output_dir ./models

参数说明：

--model：模型权重路径
--output_dir: ONNX模型输出目录

执行成功后生成onnx模型：

|—— models
       |—— text_encoder 
              |—— text_encoder_2.onnx 
       |—— unet 
              |—— unet.onnx 
       |—— vae 
              |—— vae.onnx 
       |—— ddim 
              |—— ddim.onnx

优化onnx模型
1. 模型优化
  
  运行modify_onnx.py脚本。
```
bs=1

# 非并行方案
python3 modify_onnx.py \
      --model models/unet/unet.onnx \
      --new_model models/unet/unet_md.onnx \
      --FA_soc Duo \
      --faster_gelu \
      --batch_size ${bs}

# 并行方案
python3 modify_onnx.py \
      --model models/unet/unet.onnx \
      --new_model models/unet/unet_md.onnx \
      --FA_soc Duo \
      --faster_gelu \
      --batch_size ${bs} \
      --parallel
```
  参数说明：
  - --model：onnx模型路径。
  - --new_model：优化后生成的onnx模型路径。
  - --FA_soc：使用FA算子的硬件形态。目前FlashAttention算子支持Atlas 300I Duo/Pro和Atlas 800I A2，请根据硬件设置参数为Duo或A2，其他不支持硬件请设置为None。
  - --faster_gelu：使用slice+gelu的融合算子。
  - --batch_size：生成适用于指定batch_size的模型，默认值为1。
  - --parallel：生成适用于并行方案的模型
  FA、SliceGelu融合算子需通过安装与CANN版本对应的推理引擎包(MindIE)来获取，如未安装推理引擎或使用的版本不支持FA、SliceGelu算子，FA_soc参数请使用默认配置、不设置faster_gelu参数。
  
  多batch场景限制：A2场景下暂不支持FA算子优化，FA_soc参数请设置为None。

使用ATC工具将ONNX模型转OM模型。

配置环境变量。
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 如果安装了推理引擎算子包，需配置推理引擎路径
source /usr/local/Ascend/mindie/set_env.sh
```
说明： 该脚本中环境变量仅供参考，请以实际安装环境配置环境变量。详细介绍请参见《CANN 开发辅助工具指南 (推理)》。

执行命令查看芯片名称（${chip_name}）。

npu-smi info
#该设备芯片名为Ascend310P3 （自行替换）
回显如下：
+-------------------+-----------------+------------------------------------------------------+
| NPU     Name      | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device    | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
+===================+=================+======================================================+
| 0       310P3     | OK              | 15.8         42                0    / 0              |
| 0       0         | 0000:82:00.0    | 0            1074 / 21534                            |
+===================+=================+======================================================+
| 1       310P3     | OK              | 15.4         43                0    / 0              |
| 0       1         | 0000:89:00.0    | 0            1070 / 21534                            |
+===================+=================+======================================================+

执行ATC命令。

# text_encoder
cd ./models/text_encoder
atc --framework=5 \
    --model=./text_encoder_2.onnx \
    --output=./text_encoder_2 \
    --input_format=ND \
    --input_shape="prompt:${bs},77" \
    --log=error \
    --soc_version=Ascend${chip_name}

# unet
cd ../unet/

atc --framework=5 \
    --model=./unet_md.onnx \
    --output=./unet \
    --input_format=NCHW \
    --log=error \
    --optypelist_for_implmode="Gelu,Sigmoid" \
    --op_select_implmode=high_performance \
    --soc_version=Ascend${chip_name}

cd ../../

# vae
atc --framework=5 \
    --model=./models/vae/vae_encoder.onnx \
    --output=./models/vae/vae_encoder \
    --input_format=NCHW \
    --input_shape="image:${bs},3,1024,1024" \
    --log=error \
    --soc_version=Ascend${chip_name}

atc --framework=5 \
    --model=./models/vae/vae_decoder.onnx \
    --output=./models/vae/vae_decoder \
    --input_format=NCHW \
    --input_shape="latents:${bs},4,128,128" \
    --log=error \
    --soc_version=Ascend${chip_name}

# 如果使用ddim采样器
atc --framework=5 \
    --model=./models/ddim/ddim.onnx \
    --output=./models/ddim/ddim \
    --input_format=ND \
    --input_shape="noise_pred:${bs},4,128,128;latents:${bs},4,128,128" \
    --log=error \
    --soc_version=Ascend${chip_name}

参数说明：

--model：为ONNX模型文件。
--output：输出的OM模型。
--framework：5代表ONNX模型。
--log：日志级别。
--soc_version：处理器型号。
--input_shape: 模型的输入shape信息。

执行成功后生成om模型列表：

|—— models
        |—— text_encoder
               |—— text_encoder_2.om
        |—— unet
               |—— unet.om
        |—— vae
               |—— vae.om
        |—— ddim
               |—— ddim.om

开始推理验证。
1. 安装绑核工具并根据NUMA亲和性配置任务进程与NUMA node 的映射关系是为了排除cpu的影响
  
  安装绑核工具
```
yum install numactl
```
  通过npu-smi info查询device的bus-id，并根据bus-id通过lspci -vs bus-id查询卡的NUMA node。
  
  查到NUMA node后，使用lscpu获得NUMA node对应的CPU核，推荐绑定其中单核以获得更好的性能。
```
NUMA node0: 0-23
NUMA node1: 24-47
NUMA node2: 48-71
NUMA node3: 72-95
```
  例如，device对应的NUMA node为3，则在NUMA node3对应的CPU核中选择一个，比如72
2. 执行推理脚本。
  
  推理前需要先准备推理所需的文本和图片，并将信息保存在json文件中，生成方法可参考SDXL_Base
  
  json文件中保存的image路径是与json文件的相对路径。
```
# 非并行方案
numactl -C 72 python3 stable_diffusionxl_ascend_infer.py \
        --model ${model_base} \
        --model_dir ./models \
        --image_info image_info.json \
        --info_file_save_path refiner_image_info.json \
        --device 0 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50

# 并行方案
numactl -C 72 python3 stable_diffusionxl_ascend_infer.py \
        --model ${model_base} \
        --model_dir ./models \
        --image_info image_info.json \
        --info_file_save_path refiner_image_info.json \
        --device 0,1 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50
```
  参数说明：
  - --model：模型名称或本地模型目录的路径。
  - --model_dir：存放导出模型的目录。
  - --image_info：存放输入的prompt和image路径的json文件。
  - --info_file_save_path：存放输出的prompt和image路径的json文件。
  - --save_dir：生成图片的存放目录。
  - --max_num_prompts：限制prompt数量为前X个，0表示不限制。
  - --batch_size：模型batch size。
  - --steps：生成图片迭代次数。
  - --device：推理设备ID；可用逗号分割传入两个设备ID，此时会使用并行方式进行推理。
  - --use_cache: 在推理过程中使用cache。
  - --cache_steps: 使用cache的迭代次数，迭代次数越多性能越好，但次数过多可能会导致精度下降。取值范围为[1, stpes-1]。
  - --scheduler：采样器。可选None、DDIM、Euler、DPM、EulerAncestral、DPM++SDEKarras。None即为默认scheduler。
  执行完成后在./results目录下生成推理图片。并在终端显示推理时间，参考如下：
```
[info] infer number: 16; use time: 104.6s; average time: 6.542s
```
  注意：
  
  如果使用arm机器，出现*torch*.so*: cannot allocate memory in static TLS block报错，则增加环境变量指向报错路径
```
export LD_PRELOAD=报错.so路径:$LD_PRELOAD
```

精度验证

由于生成的图片存在随机性，提供两种精度验证方法：

CLIP-score（文图匹配度量）：评估图片和输入文本的相关性，分数的取值范围为[-1, 1]，越高越好。使用Parti数据集进行验证。
HPSv2（图片美学度量）：评估生成图片的人类偏好评分，分数的取值范围为[0, 1]，越高越好。使用HPSv2数据集进行验证

注意，由于要生成的图片数量较多，进行完整的精度验证需要耗费很长的时间。

下载Parti数据集

wget https://raw.githubusercontent.com/google-research/parti/main/PartiPrompts.tsv --no-check-certificate

下载模型权重

# Clip Score 和 HPSv2 均需使用的权重
GIT_LFS_SKIP_SMUDGE=1 
git clone https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K

# HPSv2权重
wget https://huggingface.co/spaces/xswu/HPSv2/resolve/main/HPS_v2_compressed.pt --no-check-certificate

也可手动下载CLIP权重将权重放到CLIP-ViT-H-14-laion2B-s32B-b79K目录下，手动下载HPSv2权重放到当前路径

使用推理脚本生成图片
```
# 非并行方案
python3 stable_diffusionxl_ascend_infer.py \
        --model ${model_base} \
        --model_dir ./models \
        --image_info image_info.json \
        --info_file_save_path refiner_image_info.json \
        --max_num_prompts 0 \
        --device 0 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50 \
        --use_cache
        
# 并行方案
python3 stable_diffusionxl_ascend_infer.py \
        --model ${model_base} \
        --model_dir ./models \
        --image_info image_info.json \
        --info_file_save_path refiner_image_info.json \
        --max_num_prompts 0 \
        --device 0,1 \
        --save_dir ./results \
        --batch_size ${bs} \
        --steps 50 \
        --use_cache
```
参数说明：
- --model：模型名称或本地模型目录的路径。
- --model_dir：存放导出模型的目录。
- --image_info：存放输入的prompt和image路径的json文件。
- --info_file_save_path：存放输出的prompt和image路径的json文件。
- --num_images_per_prompt: 每个prompt生成的图片数量。
- --max_num_prompts：限制prompt数量为前X个，0表示不限制。
- --save_dir：生成图片的存放目录。
- --batch_size：模型batch size。
- --steps：生成图片迭代次数。
- --device：推理设备ID；可用逗号分割传入两个设备ID，此时会使用并行方式进行推理。
- --use_cache: 在推理过程中使用cache。
- --cache_steps: 使用cache的迭代次数，迭代次数越多性能越好，但次数过多可能会导致精度下降。
执行完成后会在./results目录下生成推理图片，并且会在当前目录生成一个image_info.json文件，记录着图片和prompt的对应关系。
计算精度指标
1. CLIP-score
```
python3 clip_score.py \
      --device=cpu \
      --image_info="refiner_image_info.json" \
      --model_name="ViT-H-14" \
      --model_weights_path="./CLIP-ViT-H-14-laion2B-s32B-b79K/open_clip_pytorch_model.bin"
```
  参数说明：
  - --device: 推理设备。
  - --image_info: 上一步生成的refiner_image_info.json文件。
  - --model_name: Clip模型名称。
  - --model_weights_path: Clip模型权重文件路径。
  执行完成后会在屏幕打印出精度计算结果。
2. HPSv2
```
python3 hpsv2_score.py \
      --image_info="refiner_image_info.json" \
      --HPSv2_checkpoint="./HPS_v2_compressed.pt" \
      --clip_checkpoint="./CLIP-ViT-H-14-laion2B-s32B-b79K/open_clip_pytorch_model.bin"
```
  参数说明：
  - --image_info: 上一步生成的refiner_image_info.json文件。
  - --HPSv2_checkpoint: HPSv2模型权重文件路径。
  - --clip_checkpointh: Clip模型权重文件路径。
  执行完成后会在屏幕打印出精度计算结果。

模型推理性能&精度

调用ACL接口推理计算，性能参考下列数据。

StableDiffusionxl

硬件形态	batch size	迭代次数	平均耗时	优化方案	clip score	采样器
DUO	1	50	7.54s	并行，FA+faster_gelu	0.372	ddim

性能测试需要独占npu和cpu