CodeGeex
CodeGeeX是一个具有130亿参数的多编程语言代码生成预训练模型。CodeGeeX采用华为MindSpore框架实现,在鹏城实验室“鹏城云脑II”中的192个节点(共1536个国产昇腾910 AI处理器)上训练而成。
快速使用
MindFormers安装以及软硬件配套关系参考MindFormers安装和版本匹配关系。
CodeGeex-13B 预训练权重转换
通过该链接申请权重,您将收到一个包含临时下载链接文件urls.txt的邮件。推荐使用aria2通过以下命令快速下载(请保证有足够的硬盘空间存放权重(~26GB)):
aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt
使用以下命令合并得到完整的权重:
cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz
tar xvf codegeex_13b.tar.gz
执行权重转换脚本
python research/codegeex/convert_weight.py --torch_path TORCH_CKPT_DIR --mindspore_path MS_CKPT_NAME
# 参数说明
TORCH_CKPT_DIR: torch权重保存目录路径
mindspore_path: 权重保存文件名,保存为TORCH_CKPT_DIR/OUTPUT_NAME, 也可以指定为自定义保存路径
基于API接口推理
使用MindSpore API进行推理.
Atlas 800T A2需要配置环境变量
# node 1
export MS_ENABLE_GE=1
export MS_GE_TRAIN=1
export MS_ENABLE_REF_MODE=1
export MS_GE_ATOMIC_CLEAN_POLICY=1
# >>> `chat.py`文件
import numpy as np
from typing import *
from mindspore.parallel import set_algo_parameters
from mindformers import PanguAlphaConfig, init_context
from code_tokenizer import CodeTokenizer
from codegeex import CodeGeexHeadModel
# set context
context_config = {"device_target": "Ascend", "mode": 0, "max_device_memory": "31GB", "device_id": 2}
parallel_context_config = {"parallel_mode": 1, "gradients_mean": False, "full_batch": True}
rank_id, device_num = init_context(use_parallel=False, context_config=context_config, parallel_config=parallel_context_config)
set_algo_parameters(elementwise_op_strategy_follow=True, fully_use_devices=True)
config = PanguAlphaConfig(
checkpoint_name_or_path=CKPT_PATH,
batch_size = 1,
seq_length = 2048,
vocab_size = 52224,
hidden_size = 5120,
ffn_hidden_size = 20480,
num_layers = 40,
num_heads = 40,
pad_token_id = 50256,
eos_token_id = 50256,
post_layernorm_residual = False,
param_init_type = 'float16',
compute_dtype = 'float16',
softmax_compute_type = 'float32',
dropout_rate = 0.1,
hidden_act = 'fast_gelu',
use_past = True,
use_moe = False,
expert_num = 1,
per_token_num_experts_chosen = 1,
repetition_penalty = 1,
max_decode_length = 1024,
top_k = 100,
top_p = 0.95,
temperature = 0.8,
do_sample = True,
eod_mask_loss = False,
)
def chat():
model = CodeGeexHeadModel(config)
model.set_train(False)
question_list = [
"def add(a, b):\n '''\n Find the sum of a and b.\n '''\n",
"bool prime(int n) {\n // Find whether n is a prime number\n",
]
# Define tokenizer
tokenizer = CodeTokenizer(config.vocab_size)
i = 0
for question in question_list:
inputs = tokenizer.encode_code(question)
inputs = np.array([inputs]).astype(np.int32) # add batch dim
outputs = model.generate(inputs, max_length=1024, top_p=0.95, temperature=0.8, eos_token_id=50256)
output_samples = tokenizer.decode_code(outputs)
output_samples_str = "".join(output_samples)
print(f"=================== prompt {i} ====================")
print(question, flush=True)
print(f"=================== generation {i} ====================")
print(output_samples_str, flush=True)
i = i + 1
if __name__ == "__main__":
chat()
单机多卡运行训练
# node 1
export MS_ENABLE_GE=1
export MS_GE_TRAIN=1
export MS_ENABLE_REF_MODE=1
export MS_GE_ATOMIC_CLEAN_POLICY=1
cd mindformers/research
bash run_singlenode.sh "python codegeex/run_codegeex.py --config codegeex/run_codegeex_910b.yaml --run_mode=train --train_data path/to/mindrecord_dir" path/to/rank_table_file [0,8] 8
参数说明
config: code_geex相关配置文件
run_mode:运行模式,包括train,finetune,eval,predict
train_data:train数据,训练时需要填入。
更多输入可参考run_codegeex.py 脚本内入参