dd6fe347创建于 4月9日历史提交

文件	最后提交记录	最后更新时间
authority_repository	!6719 [built-in][Pytorch] 调整多模态模型存放目录 Merge pull request !6719 from zhangjunyi08/master	1 年前
README.md	fix link validity Co-authored-by: frozenleaves<914814442@qq.com> # message auto-generated for no-merge-commit merge: !7517 merge master into master fix link validity Created-by: frozenn Commit-by: frozenleaves Merged-by: ascend-robot Description: ## Motivation Please describe the motivation of this PR and the goal you want to achieve through this PR. ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7517	1 个月前
requirements.txt	!6719 [built-in][Pytorch] 调整多模态模型存放目录 Merge pull request !6719 from zhangjunyi08/master	1 年前

CogVLM2 for Pytorch

简介

模型介绍

官方仓：https://github.com/THUDM/CogVLM2

说明：本仓代码仅为适配官方仓脚本，执行训练与推理需要在官方仓cogvlm2项目路径下进行。

支持任务列表

本仓已支持以下模型任务类型。

模型	模型大小	任务类型	是否支持
CogVLM2	cogvlm2-llama3-chinese-chat-19B	lora微调	✅

代码实现

参考实现

CogVLM仓: https://github.com/THUDM/CogVLM2
commit id: 3adb5ce3243a9c81c1df5336d3297c94d0f9e1cc
参考链接：https://github.com/THUDM/CogVLM2/tree/3adb5ce3243a9c81c1df5336d3297c94d0f9e1cc

适配昇腾 AI 处理器的实现：

url=https://gitcode.com/ascend/ModelZoo-PyTorch.git
code_path=PyTorch/built-in/foundation/CogVLM2

CogVLM2

准备训练环境

安装昇腾环境

请参考昇腾社区中《Pytorch框架训练环境准备》文档搭建昇腾环境，本仓已支持表1中软件版本。

表 1 昇腾软件版本支持表

软件类型	支持版本
FrameworkPTAdapter	6.0.RC3
CANN	8.0.RC3
昇腾NPU固件	24.1.RC3
昇腾NPU驱动	24.1.RC3

安装模型环境

表 2 三方库版本支持表

三方库	支持版本
PyTorch	2.1.0

安装模型对应PyTorch版本需要的依赖, 需要先安装PTA包。
下载model_zoo下面的CogVLM2相关文件，可以根据cogvlm2官方仓安装三方件或者使用本仓提供的requirements.txt进行三方件安装。

pip install -r requirements.txt

准备数据集

微调数据集: 训练与评估所使用的数据集为CogVLM-SFT-311K下载，该数据集是官网提供的一个lora微调数据集。
数据集准备：进行微调时，可以根据官方提醒指定到具体文件，如CogVLM-SFT-311K/llava_instruction_multi_conversations_formate/

获取预训练权重

官方提供微调权重cogvlm2-llama3-chinese-chat-19B下载。

快速开始

微调任务

主要提供基于CogVLM-SFT-311K数据集微调的lora微调训练脚本。

模型适配

替换和新增本仓authority_repository/finetune_demo下的文件到cogvlm2官方仓的finetune_demo/文件下替换和新增本仓llama3_chinese_chat_19B下的文件到预训练权重文件下

执行单机8卡微调

替换authority_repository/finetune_demo/cogvlm2_lora_finetune.sh文件中的"训练数据路径"，"预训练权重路径"和"模型保存路径"为实际路径
CogVLM2官方仓下执行训练，如下命令：

cd finetune_demo
bash cogvlm2_lora_finetune.sh

性能

芯片	卡数	s/it	micro_batch_size	AMP_Type	Torch_Version
竞品A	8p	3.3	1	bf16	2.1
Atlas 900 A2 PODc	8p	3.9	1	bf16	2.1

执行双机16卡微调

替换authority_repository/finetune_demo/cogvlm2_lora_finetune_2nodes.sh文件中的"训练数据路径"，"预训练权重路径"和"模型保存路径"为实际路径
authority_repository/finetune_demo/hostfile中的内容为服务器名(server1和server2)和每台服务器的卡数
由于torch.util.data中的DataLoader在shuffle为False的情况下广播容易超时，所以多机情况下peft_lora.py中DataLoader的shuffle参数建议不要设置为False，不影响模型收敛
双机配置可参考"配置双机通信环境"
CogVLM2官方仓下执行训练，如下命令：

cd finetune_demo
bash cogvlm2_lora_finetune_2nodes.sh

配置双机通信环境

安装pdsh url： https://github.com/chaos/pdsh/tree/pdsh-2.29 安装

git clone https://github.com/chaos/pdsh/archive/refs/tags/pdsh-2.29.tar.gz

tar -zxvf pdsh-2.29.tar.gz
cd pdsh-2.29
./configure --with-ssh --with-rsh --with-mrsh --with-mqshel --with-qshell  --with-dshgroups --with-machines=/etc/pdsh/machines  --without-pam

make
make install

安装完成后，执行pdsh -h命令。显示如下信息，表示安装成功。

# pdsh -h
Usage: pdsh [-options] command ...
-S                return largest of remote command return values
-h                output usage menu and quit
-V                output version information and quit
-q                list the option settings and quit
-b                disable ^C status feature (batch mode)
-d                enable extra debug information from ^C status
-l user           execute remote commands as user
-t seconds        set connect timeout (default is 10 sec)
-u seconds        set command timeout (no default)
-f n              use fanout of n nodes
-w host,host,...  set target node list on command line
-x host,host,...  set node exclusion list on command line
-R name           set rcmd module to name
-M name,...       select one or more misc modules to initialize first
-N                disable hostname: labels on output lines
-L                list info on all loaded modules and exit
-g groupname      target hosts in dsh group "groupname"
-X groupname      exclude hosts in dsh group "groupname"
-a                target all nodes
available rcmd modules: ssh,rsh,exec (default: rsh)

双机通信配置首先，我们需要编辑两台服务器的/etc/hosts文件，添加两台服务器的IP地址，并将node1和node2替换为两台服务器的实际IP地址 ``shell vim /etc/hosts

```shell
node1 server1
node2 server2

然后，我们需要执行以下命令来生成sshkey。

ssh-keygen -t rsa

接着，将ssh-key拷贝到每个节点，本机也要拷贝。

ssh-copy-id root@server1
ssh-copy-id root@server2

然后，在每个节点上运行以下代码，首次执行时需要手动输入yes，然后执行exit退出。再次执行以下命令时，如果不需要输入密码，则表示配置成功。

ssh server1
ssh server2

随机性说明

模型中包含多种随机问题，会影响loss曲线和竞品的对齐，用户可根据需要自行修改，部分确定性问题本代码不做更换：

Cogvlm2项目路径的finetune_demo/peft_lora.py 中DataLoader是开启了shuffle，根据需要进行关闭：
模型本身有确定性问题，需要固定随机种子
triton中的FastRotaryEmbedding和RotaryEmbedding精度上也略有差异

在线推理任务

推理前准备

Cogvlm2项目路径的finetune_demo/peft_infer.py文件中在import torch后新增npu依赖，如下所示。

import torch_npu
from torch_npu.contrib import transfer_to_npu

需要指定推理卡号，如下所示，指定1卡：

export ASCEND_RT_VISIBLE_DEVICES=1

替换peft_infer.py中MODEL_PATH和PEFT_MODEL_PATH为实际路径，执行推理
由于npu暂时不支持get_device_capability算子，因此需要在peft_infer.py中做如下替换

原内容:

TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

修改为:

TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available()  else torch.float16

执行推理

python peft_infer.py

公网地址变更说明

暂无。

变更说明

2024.08.08：CogVLM2 bf16微调任务首次发布。

FAQ

暂无。