dd6fe347创建于 4月9日历史提交

文件	最后提交记录	最后更新时间
finetune	!6323 【Huggingface】【客户模型】【CodeShell】CodeShell适配代码 Merge pull request !6323 from 汤云翔/master	1 年前
model	!6323 【Huggingface】【客户模型】【CodeShell】CodeShell适配代码 Merge pull request !6323 from 汤云翔/master	1 年前
README.md	fix link validity Co-authored-by: frozenleaves<914814442@qq.com> # message auto-generated for no-merge-commit merge: !7517 merge master into master fix link validity Created-by: frozenn Commit-by: frozenleaves Merged-by: ascend-robot Description: ## Motivation Please describe the motivation of this PR and the goal you want to achieve through this PR. ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7517	1 个月前
convert_alpaca.py	!6323 【Huggingface】【客户模型】【CodeShell】CodeShell适配代码 Merge pull request !6323 from 汤云翔/master	1 年前
public_address_statement.md	!7376 optimize public_address_statement.md Merge pull request !7376 from 王凯宇/master	8 个月前
requirements_2_1.txt	!6821 【PT2.4需求】Flan-t5-xxl、Qwen1.0、Qwen1.5和CodeShell模型适配torch2.4 Merge pull request !6821 from 刘彤彤/master	1 年前
requirements_2_4.txt	!6822 【bugfix】CodeShell pt2.4 requirements文件修改 Merge pull request !6822 from 刘彤彤/master	1 年前

CodeShell-7B for PyTorch

简介

模型介绍

CodeShell是北京大学知识计算实验室联合四川天府银行AI团队研发的多语言代码大模型基座。CodeShell具有70亿参数，在五千亿Tokens进行了训练，上下文窗口长度为8194。在权威的代码评估Benchmark（HumanEval与MBPP）上，CodeShell取得同等规模最好的性能。与此同时，我们提供了与CodeShell配套的部署方案与IDE插件，请参考代码库CodeShell。

支持任务列表

本仓已经支持以下模型任务类型

模型	任务列表	是否支持
CodeShell-7B	微调	✔

代码实现

参考实现：

url=https://github.com/WisdomShell/codeshell
commit_id=a33262e348eff888a28dd7226ee11ebc083c9df0

适配昇腾AI处理器的实现：

url=https://gitcode.com/ascend/ModelZoo-PyTorch.git
code_path=PyTorch/built-in/foundation

CodeShell

准备训练环境

该模型为不随版本演进模型（随版本演进模型范围可在此处查看），未在最新昇腾配套软件中适配验证，您可以：

根据下面提供PyTorch版本在软件版本配套表中选择匹配的CANN等软件下载使用。
查看软件版本配套表后确认对该模型有新版本PyTorch和CANN中的适配需求，请在modelzoo/issues中提出您的需求。自行适配不保证精度和性能达标。

安装环境

表 1 三方库版本支持表

三方库	支持版本(PT2.1)	支持版本(PT2.4)
PyTorch	2.1.0	2.4.0
accelerate	0.29.3	0.29.3
deepspeed	0.12.6	0.15.3
transformers	4.40.1	4.40.1

在模型根目录下执行以下命令，安装模型对应PyTorch版本需要的依赖。

# PyTorch 2.1请使用requirements_2_1.txt
pip install -r requirements_2_1.txt

# PyTorch 2.4请使用requirements_2_4.txt
pip install -r requirements_2_4.txt

说明： 只需执行一条对应的PyTorch版本依赖安装命令。

安装昇腾环境

请参考昇腾社区中《Pytorch框架训练环境准备》文档搭建昇腾环境，本仓已支持表2中软件版本。

表 2 昇腾软件版本支持表

软件类型	支持版本
FrameworkPTAdaper	6.0.RC2
CANN	8.0.RC2
昇腾NPU固件	24.1.RC2
昇腾NPU驱动	24.1.RC2

准备预训练权重

CodeShell-7B预训练权重需要在官网手动下载（除了modeling_codeshell.py），并放于model目录下。

model参考目录如下：

├── model
      ├──added_tokens.json
      ├──config.json
      ├──configuration_codeshell.py
      ├──generation_config.json
      ├──merges.txt
      ├──model-00001-of-00002.safetensors
      ├──model-00002-of-00002.safetensors
      ├──model.safetensors.index.json
      ├──modeling_codeshell.py
      ├──pytorch_model.bin.index.json
      ├──quantizer.py
      ├──special_tokens_map.json
      ├──tokenizer.json
      ├──tokenizer_config.json
      ├──vocab.json

准备数据集

自行下载准备alpaca_data.json数据集，并放于finetune目录下。
在CodeShell-7B根目录下，执行以下python命令：

python convert_alpaca.py --in-file finetune/alpaca_data.json --out-file finetune/data.json

处理后的数据集目录结构如下所示：

├── finetune
      ├──alpaca_data.json(原始数据集)
      ├──data.json(微调用数据集)

说明：
该数据集的训练过程脚本只作为一种参考示例。

快速开始

训练任务

本任务主要提供单机的8卡训练脚本。

开始训练

进入源码根目录。
```
cd /${模型文件夹名称}
```

运行训练脚本。

该模型支持单机8卡训练。

单机8卡训练

bash finetune/run_finetune.sh --model=./model --data_path=./finetune/data.json --exp_id=0

模型训练脚本参数说明如下。

--model                 //预训练权重路径
--data_path             //数据集路径
--exp_id                //标识符，用于日志打印和模型保存的路径区分，可自定义

训练结果

训练loss与train_samples_per_second可在训练日志中获取，其结果如下：

芯片	卡数	Batch size	Steps	Train_Samples_Per_Second
GPU	8p	6	2000	40.952
Atlas-A2	8p	6	2000	36.801

公网地址说明

代码涉及公网地址参考 public_address_statement.md

变更说明

2024.05.16：首次发布。

FAQ

无