文件最后提交记录最后更新时间
docs(pytorch): doc fix error Co-authored-by: LQ1206<liuqian164@h-partners.com> # message auto-generated for no-merge-commit merge: !4402 merge master into master docs(pytorch): doc fix error Created-by: LQ1206 Commit-by: LQ1206 Merged-by: ascend-robot Description: ## What this PR does / why we need it? Please describe the background and detailed changes of the PR. If it is a bugfix, please attach the related issue. ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-LLM!44021 个月前
!3155 [mindspore][master] convert third-party library diff to patch Merge pull request !3155 from 周徐敏/master 9 个月前
!3155 [mindspore][master] convert third-party library diff to patch Merge pull request !3155 from 周徐敏/master 9 个月前
!2733 add glm4-9b 预训练、微调、推理、评测、权重转换、数据处理 Merge pull request !2733 from ZhihaoLi/master 11 个月前
!2733 add glm4-9b 预训练、微调、推理、评测、权重转换、数据处理 Merge pull request !2733 from ZhihaoLi/master 11 个月前
fix(pytorch):add ckpt-format argument to scripts Co-authored-by: z__y<z4t155664@163.com> # message auto-generated for no-merge-commit merge: !4371 merge add_ckpt_torch_dist_argument_for_shells into master fix(pytorch):add ckpt-format argument to scripts Created-by: z__y Commit-by: z__y Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR explicitly adds ckpt-format torch to all repository scripts to support the asynchronous checkpoint saving feature. ## Does this PR introduce any user-facing change? No. This change only adjusts internal script parameters to maintain existing behavior. There are no user-facing API or usage changes. ## How was this patch tested? Tests confirm that asynchronous checkpoint saving works correctly and that the original torch format checkpoint behavior is preserved. See merge request: Ascend/MindSpeed-LLM!43711 个月前
fix(pytorch):add ckpt-format argument to scripts Co-authored-by: z__y<z4t155664@163.com> # message auto-generated for no-merge-commit merge: !4371 merge add_ckpt_torch_dist_argument_for_shells into master fix(pytorch):add ckpt-format argument to scripts Created-by: z__y Commit-by: z__y Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR explicitly adds ckpt-format torch to all repository scripts to support the asynchronous checkpoint saving feature. ## Does this PR introduce any user-facing change? No. This change only adjusts internal script parameters to maintain existing behavior. There are no user-facing API or usage changes. ## How was this patch tested? Tests confirm that asynchronous checkpoint saving works correctly and that the original torch format checkpoint behavior is preserved. See merge request: Ascend/MindSpeed-LLM!43711 个月前
fix(pytorch):add ckpt-format argument to scripts Co-authored-by: z__y<z4t155664@163.com> # message auto-generated for no-merge-commit merge: !4371 merge add_ckpt_torch_dist_argument_for_shells into master fix(pytorch):add ckpt-format argument to scripts Created-by: z__y Commit-by: z__y Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR explicitly adds ckpt-format torch to all repository scripts to support the asynchronous checkpoint saving feature. ## Does this PR introduce any user-facing change? No. This change only adjusts internal script parameters to maintain existing behavior. There are no user-facing API or usage changes. ## How was this patch tested? Tests confirm that asynchronous checkpoint saving works correctly and that the original torch format checkpoint behavior is preserved. See merge request: Ascend/MindSpeed-LLM!43711 个月前
fix(pytorch):add ckpt-format argument to scripts Co-authored-by: z__y<z4t155664@163.com> # message auto-generated for no-merge-commit merge: !4371 merge add_ckpt_torch_dist_argument_for_shells into master fix(pytorch):add ckpt-format argument to scripts Created-by: z__y Commit-by: z__y Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR explicitly adds ckpt-format torch to all repository scripts to support the asynchronous checkpoint saving feature. ## Does this PR introduce any user-facing change? No. This change only adjusts internal script parameters to maintain existing behavior. There are no user-facing API or usage changes. ## How was this patch tested? Tests confirm that asynchronous checkpoint saving works correctly and that the original torch format checkpoint behavior is preserved. See merge request: Ascend/MindSpeed-LLM!43711 个月前
fix(pytorch):add ckpt-format argument to scripts Co-authored-by: z__y<z4t155664@163.com> # message auto-generated for no-merge-commit merge: !4371 merge add_ckpt_torch_dist_argument_for_shells into master fix(pytorch):add ckpt-format argument to scripts Created-by: z__y Commit-by: z__y Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR explicitly adds ckpt-format torch to all repository scripts to support the asynchronous checkpoint saving feature. ## Does this PR introduce any user-facing change? No. This change only adjusts internal script parameters to maintain existing behavior. There are no user-facing API or usage changes. ## How was this patch tested? Tests confirm that asynchronous checkpoint saving works correctly and that the original torch format checkpoint behavior is preserved. See merge request: Ascend/MindSpeed-LLM!43711 个月前
fix(pytorch):add ckpt-format argument to scripts Co-authored-by: z__y<z4t155664@163.com> # message auto-generated for no-merge-commit merge: !4371 merge add_ckpt_torch_dist_argument_for_shells into master fix(pytorch):add ckpt-format argument to scripts Created-by: z__y Commit-by: z__y Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR explicitly adds ckpt-format torch to all repository scripts to support the asynchronous checkpoint saving feature. ## Does this PR introduce any user-facing change? No. This change only adjusts internal script parameters to maintain existing behavior. There are no user-facing API or usage changes. ## How was this patch tested? Tests confirm that asynchronous checkpoint saving works correctly and that the original torch format checkpoint behavior is preserved. See merge request: Ascend/MindSpeed-LLM!43711 个月前
README.md

Mindspore后端提供GLM4-9B系列模型支持

模型 下载链接 序列 实现 集群 是否支持
GLM4-9B 9B 4K Mcore 1x8
32K Mcore 1x8

MindSpore后端跑通GLM4-9B模型手把手教程

环境配置

MindSpeed-LLM MindSpore后端的安装步骤参考MindSpeed LLM安装指导

权重转换

  1. 权重下载
    huggingface下载模型权重和其它配置文件,若需要在开源权重上继续预训练、微调、推理,也请下载网络模型文件。

  2. 权重转换
    提供脚本将huggingface开源权重转换为mcore权重,用于训练、推理、评估等任务。使用方法如下,请根据实际需要的TP/PP等切分策略和权重路径修改权重转换脚本:

    cd MindSpeed-LLM
    bash examples/mindspore/glm4/ckpt_convert_glm4_hf2mcore.sh
    

    运行脚本后,预期会看到类似以下的日志输出,表示权重转换成功:

    successfully saved checkpoint from iteration 1 to ./model_weights/glm4_mcore/
    INFO:root:Done!
    

注意:

  • MindSpore 后端默认在Device侧进行权重转换,在模型较大时存在OOM风险,因此建议用户手动修改convert_ckpt.py,在包导入时加入如下代码设置CPU侧执行权重转换:
import mindspore as ms
ms.set_context(device_target="CPU", pynative_synchronize=True)
import torch
torch.configs.set_pyboost(False)
  • MindSpore 后端转换出的模型权重无法用于 Torch后端训练或推理。

数据预处理

当前MindSpore后端已完全支持MindSpeed-LLM的多种任务场景下的数据预处理,数据预处理指南参见数据预处理

以预训练场景为例,只需在预训练数据预处理脚本data_convert_glm4_pretrain.sh中配置好数据输入/输出路径、tokenizer模型路径,并启动即可:

bash examples/mindspore/glm4/data_convert_glm4_pretrain.sh

预训练数据集处理结果如下:

./dataset/alpaca_text_document.bin
./dataset/alpaca_text_document.idx

训练

预训练

预训练使用方法如下:

cd MindSpeed-LLM
bash examples/mindspore/glm4/pretrain_glm4_8k_ms.sh

用户需要根据实际情况修改脚本中的以下变量:

变量名 含义
MASTER_ADDR 多机情况下主节点IP
NODE_RANK 多机下,各机对应节点序号
CKPT_SAVE_DIR 训练中权重保存路径
DATA_PATH 数据预处理后的数据路径
TOKENIZER_PATH GLM4-9B tokenizer目录
CKPT_LOAD_DIR 权重转换保存的权重路径,用于初始权重加载,如无初始权重则随机初始化

微调

微调和预训练的使用方法类似。

cd MindSpeed-LLM
bash examples/mindspore/glm4/tune_glm4_9b_32k_full_ms.sh

与预训练一样,用户需要根据实际情况修改脚本中的上述变量。

推理

推理使用方法如下:

# 以9b模型为例
cd MindSpeed-LLM
bash examples/mindspore/glm4/generate_glm4_9b_ms.sh

用户需要根据实际情况修改脚本中以下变量:

变量名 含义
MASTER_ADDR 多机情况下主节点IP
NODE_RANK 多机下,各机对应节点序号
CHECKPOINT 训练保存的权重路径
TOKENIZER_PATH GLM4-9B tokenizer目录

评估

评估使用方法如下:

# 以9b模型为例
cd MindSpeed-LLM
bash examples/mindspore/glm4/evaluate_glm4_9b_ms.sh

用户需要根据实际情况修改脚本中以下变量:

变量名 含义
MASTER_ADDR 多机情况下主节点IP
NODE_RANK 多机下,各机对应节点序号
TOKENIZER_PATH GLM4-9B tokenizer目录
CKPT_LOAD_DIR 权重转换保存的权重路径,或训练保存的权重路径
DATA_PATH 评估采用的数据集路径,当前推荐使用MMLU
TASK 评估采用的数据集,当前推荐使用MMLU

使用MMLU,进行评估的结果如下:

INFO:mindspeed_llm.tasks.evaluation.eval_impl.mmlu_eval:mmlu acc = 321/387=0.8294573643410853
total: 100%|█████████████████████████████████████████████████████| 3/3 [06:16<00:00, 128.12s/it]INFO:main:
                         subject   question_n   acc
0                      sociology          201   0.915423
1   high_school_european_history          165   0.884848
2                      astronomy          152   0.763158
3                      nutrition          306   0.781046
...
57                         total        14042   0.747543
INFO:main:MMLU Running Time:, 4332.02222272872925