dd6fe347创建于 4月9日历史提交

文件	最后提交记录	最后更新时间
DeepSpeed	!6873 修改GPT-2_for_PyTorch的README及DeepSpeed文件增加 Merge pull request !6873 from 王凯宇/master	1 年前
compression	!4277 [自研][PyTorch] GPT-2 fixbug，补充更大规模的模型配置及性能值 * GPT-2 bug修复，信息完善 * GPT-2 bug修复，信息完善 * GPT-2 bug修复，信息完善	3 年前
gpt_patch	!6473 delete checkout for torch.autograd.Function.apply for torch Merge pull request !6473 from 邵非凡/master	1 年前
test	!6888 add INF_NAN_MODE_FORCE_DISABLE Merge pull request !6888 from WeiChunyu/master	1 年前
LICENSE	GPT-2_for_PyTorch initialize	3 年前
README.md	fix link validity Co-authored-by: frozenleaves<914814442@qq.com> # message auto-generated for no-merge-commit merge: !7517 merge master into master fix link validity Created-by: frozenn Commit-by: frozenleaves Merged-by: ascend-robot Description: ## Motivation Please describe the motivation of this PR and the goal you want to achieve through this PR. ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7517	1 个月前
amp_C.py	GPT-2_for_PyTorch initialize	3 年前
ds_config.json	GPT-2_for_PyTorch initialize	3 年前
hostfile	GPT-2_for_PyTorch initialize	3 年前
pretrain_gpt.py	!5752 [自研][PyTorch][GPT2]修复loss为NAN的问题 Merge pull request !5752 from 朱彦儒/master	2 年前
public_address_statement.md	!7376 optimize public_address_statement.md Merge pull request !7376 from 王凯宇/master	8 个月前
requirements.txt	!6114 【fix】[Pytorch][GPT2][DeeplabV3][Yolov3][FCOS] Modify dependent versions Merge pull request !6114 from 孙毅/master	2 年前
run.py	GPT-2_for_PyTorch initialize	3 年前
url.ini	!5231 【pytorch】【built-in】【nlp】Bert-Squad+GPT-2+LSTM+M2M模型公网地址整改 * 删除无用代码 * Bert-Squad+GPT-2+LSTM+M2M模型公网地址整改	2 年前

GPT-2_for_PyTorch

概述

简述

Megatron 和 DeepSpeed 是两个很重要的预训练框架。Megatron 是英伟达做的超大规模预训练模型框架，主要是利用 tensor parallel 做性能优化以及 mode parallel。DeepSpeed 是微软团队做的深度学习加速框架。这两个团队合作构造出 Megatron-DeepSpeed 框架，相当于是把两个框架的特点结合在一起，并用它训练一个 530B 的模型。

参考实现：

url=https://github.com/microsoft/Megatron-DeepSpeed.git
commit_id=b4d4a0e

适配昇腾 AI 处理器的实现：

url=https://gitcode.com/ascend/ModelZoo-PyTorch.git
code_path=PyTorch/built-in/nlp

通过Git获取代码方法如下：

git clone {url}       # 克隆仓库的代码
cd {code_path}        # 切换到模型代码所在路径，若仓库下只有该模型，则无需切换

通过单击“立即下载”，下载源码包。

准备训练环境

该模型为不随版本演进模型（随版本演进模型范围可在此处查看），未在最新昇腾配套软件中适配验证，您可以：

根据下面提供PyTorch版本在软件版本配套表中选择匹配的CANN等软件下载使用。
查看软件版本配套表后确认对该模型有新版本PyTorch和CANN中的适配需求，请在modelzoo/issues中提出您的需求。自行适配不保证精度和性能达标。

准备环境

当前模型支持的 PyTorch 历史版本和已知三方库依赖如下表所示。

表 1 版本支持表

Torch_Version 三方库依赖版本

PyTorch 1.8 -

PyTorch 1.11 -
环境准备指导。

请参考《Pytorch框架训练环境准备》。
安装依赖。

在模型源码包根目录下执行命令，安装模型对应PyTorch版本需要的依赖。
```
pip3 install -r requirements.txt
```

Torch_Version	三方库依赖版本
PyTorch 1.8	-
PyTorch 1.11	-

安装Ascend DeepSpeed

cd GPT-2_for_PyTorch
pip3 install deepspeed==0.9.2  # 首先安装原生deepspeed
cd ./DeepSpeed
pip3 install ./
cd -

克隆原始仓

cd GPT-2_for_PyTorch
git clone https://github.com/microsoft/Megatron-DeepSpeed.git
cd ./Megatron-DeepSpeed
git checkout b4d4a0e
cd -

准备数据集

获取数据集。

bash ./test/dataset_preprocess_gpt.sh

数据集目录结构将数据集默认放置在./data/下，数据集的目录结构如下所示：

├── ./data/
      ├── gpt2-vocab.json         
      ├── gpt2-merges.txt
      ├── my-gpt_text_sentence.bin
      ├── my-gpt_text_sentence.idx

说明： 该数据集的训练过程脚本只作为一种参考示例。

开始训练

训练模型

进入解压后的源码包根目录。
```
cd ./${模型文件夹名称} 
```
运行训练脚本。

该模型支持单机8卡训练，由于多卡训练使能了多种并行特性，单卡没有相匹配的模型，故不支持单卡训练。
- 单机8卡训练
  
  启动8卡训练。
```
bash ./test/train_full_8p.sh --data_path=$real_data_path --model_size=$selected_model_size --train_iters=$train_iters
bash ./test/train_performance_8p.sh --data_path=$real_data_path --model_size=$selected_model_size --train_iters=$train_iters
```
- 备注1：model_size代表模型参数量，目前只提供了5种：345M、1.3B、2.7B、3.7B、345M_without_mp，用户可从5种之中选1种，也可不选默认model_size=345M
- 备注2：train_iters代表训练迭代次数，不设置时默认值500000
- 训练完成后，权重文件保存在./ckpts/ckpts_tmp下，并输出模型训练精度和性能信息。

训练结果展示

表 2 训练结果展示表

NAME	params	PPL	samples/s	seq_len	Steps
8p-竞品V	345M	26.375	37.8	1024	500000
8p-NPU	345M	26.485	59.7	1024	500000
8p-竞品V	1.3B	-	15.71	1024
8p-NPU	1.3B	-	21.47	1024
8p-竞品V	2.7B	-	4.125	2048
8p-NPU	2.7B	-	5.280	2048
8p-竞品V	3.7B	-	3.120	2048
8p-NPU	3.7B	-	4.558	2048

表 3 345M_without_mp训练结果展示表

NAME	params	PPL	samples/s	seq_len	Steps
8p-竞品A	345M	28.67	235	1024	100000
8p-NPU	345M	28.71	227	1024	100000

备注：一定要有竞品和NPU。

公网地址说明

代码涉及公网地址参考 public_address_statement.md

版本说明

变更

2023.03.20：首次发布 2023.05.24：首次变更

FAQ

针对Pytorch 2.0及以后版本，由于torch._six接口已弃用，且npu目前只支持0.6.0版本的deepspeed，可对应修改该三方库的源码文件。

# 请参考以下路径修改源码文件，将文件中的 “from torch._six import inf” 修改为 “from math import inf”
vim /usr/local/python3.8.10/lib/python3.8/site-packages/deepspeed/runtime/utils.py +18
vim /usr/local/python3.8.10/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py +8
vim /usr/local/python3.8.10/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py +19
vim /home/xxx/GPT-2_for_PyTorch/Megatron-DeepSpeed/megatron/optimizer/clip_grads.py +19