文件最后提交记录最后更新时间
Correct the document format Co-authored-by: WendongPang<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2411 merge doc_26 into 26.0.0 Correct the document format Created-by: WendongPang Commit-by: WendongPang Merged-by: ascend-robot Description: ## What this PR does / why we need it? Correct the document format. ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!24111 个月前
[Docs]rename release_notes Co-authored-by: z60119525<zhaotao68@h-partners.com> # message auto-generated for no-merge-commit merge: !2416 merge 26.0.0 into 26.0.0 [Docs]rename release_notes Created-by: kali00 Commit-by: z60119525 Merged-by: ascend-robot Description: 1、更新CANN版本为最新版本 2、重命名了版本说明 3、矫正资料目录中的简介路径 See merge request: Ascend/MindSpeed-MM!24161 个月前
[Feature] fsdp2 qwen3vl 30B support lora finetune Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2189 merge master into master [Feature] fsdp2 qwen3vl 30B support lora finetune Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## Motivation During gradient recomputation in the backward pass, it is necessary to obtain the gradient function outside this block as the entry point for the backward pass. Therefore, a patch is added to enable the gradient function of the embed_tokens layer, allowing LoRA fine-tuning to enter the gradient recomputation backward pass. Related history PR: https://gitee.com/ascend/MindSpeed-LLM/pulls/1498 ## Modification 1.Add a new LoRA fine-tuning script for 30B models 2.Add parsing code for LoRA parameters 3.Add a LoRA patch for the gradient recomputation scenario ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21893 个月前
[Docs] Document corrections Co-authored-by: js1234567<jiangshuo9@h-partners.com> # message auto-generated for no-merge-commit merge: !2108 merge master into master [Docs] Document corrections Created-by: js1234567 Commit-by: js1234567 Merged-by: ascend-robot Description: ## Motivation Document corrections: 1. 添加2.3.0配套信息 2. 中英文标点问题 3. 链接版本更新 4. CANN8.5.0版本配置环境变量刷新, 涉及环境变量设置需全面排查修改 ## Modification Readme.md ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21083 个月前
[Docs] Document corrections Co-authored-by: js1234567<jiangshuo9@h-partners.com> # message auto-generated for no-merge-commit merge: !2108 merge master into master [Docs] Document corrections Created-by: js1234567 Commit-by: js1234567 Merged-by: ascend-robot Description: ## Motivation Document corrections: 1. 添加2.3.0配套信息 2. 中英文标点问题 3. 链接版本更新 4. CANN8.5.0版本配置环境变量刷新, 涉及环境变量设置需全面排查修改 ## Modification Readme.md ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21083 个月前
[Modify] move files from examples/fsdp2/ to examples/ Co-authored-by: pangwendong<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2320 merge cherry-pick-mr-2317-1774579670830-auto into 26.0.0 [Modify] move files from examples/fsdp2/ to examples/ Created-by: WendongPang Commit-by: pangwendong Merged-by: ascend-robot Description: ## What this PR does / why we need it? [Modify] move files from examples/fsdp2/ to examples/ ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!23202 个月前
[featre] Qwen3vl-32b add PMCC feature Co-authored-by: cxiaolong<2845907121@qq.com> # message auto-generated for no-merge-commit merge: !2264 merge master into master [featre] Qwen3vl-32b add PMCC feature Created-by: cxiaolong Commit-by: cxiaolong Merged-by: ascend-robot Description: ## What this PR does / why we need it? PMCC(Privacy and Model Confidential Computing)是昇腾提供的一种隐私计算解决方案,用于保护模型训练过程中的模型权重和数据隐私。 当前 PMCC 特性已适配Qwen3VL数据集并在Qwen3VL-32B模型上进行了验证。 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!22642 个月前
[Docs] Document corrections Co-authored-by: js1234567<jiangshuo9@h-partners.com> # message auto-generated for no-merge-commit merge: !2108 merge master into master [Docs] Document corrections Created-by: js1234567 Commit-by: js1234567 Merged-by: ascend-robot Description: ## Motivation Document corrections: 1. 添加2.3.0配套信息 2. 中英文标点问题 3. 链接版本更新 4. CANN8.5.0版本配置环境变量刷新, 涉及环境变量设置需全面排查修改 ## Modification Readme.md ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21083 个月前
[Feature] support local experts unshard Co-authored-by: htwang<wanghaitao60@huawei.com> # message auto-generated for no-merge-commit merge: !2022 merge master into master [Feature] support local experts unshard Created-by: htwang Commit-by: htwang Merged-by: ascend-robot Description: ## Motivation support local experts unshard ## Modification 1、local experts不进行fully shard的情况下,对moe部分梯度进行all reduce 2、fused moe使能前后在Qwen3VLMoeTextSparseMoeBlock forward中的接口统一 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!20224 个月前
[Feature] qwen3vl新增简单的推理测试脚本 Co-authored-by: yangx_sy<sunyang49@huawei.com> # message auto-generated for no-merge-commit merge: !2024 merge qwen3vl_infer into master [Feature] qwen3vl add infer_demo.py for testing Created-by: yangx_sy Commit-by: yangx_sy Merged-by: ascend-robot Description: ## Motivation qwen3vl添加简单的推理测试脚本 ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!20244 个月前
[Feature] qwen3vl新增简单的推理测试脚本 Co-authored-by: yangx_sy<sunyang49@huawei.com> # message auto-generated for no-merge-commit merge: !2024 merge qwen3vl_infer into master [Feature] qwen3vl add infer_demo.py for testing Created-by: yangx_sy Commit-by: yangx_sy Merged-by: ascend-robot Description: ## Motivation qwen3vl添加简单的推理测试脚本 ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!20244 个月前
[Modify] move files from examples/fsdp2/ to examples/ Co-authored-by: pangwendong<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2320 merge cherry-pick-mr-2317-1774579670830-auto into 26.0.0 [Modify] move files from examples/fsdp2/ to examples/ Created-by: WendongPang Commit-by: pangwendong Merged-by: ascend-robot Description: ## What this PR does / why we need it? [Modify] move files from examples/fsdp2/ to examples/ ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!23202 个月前
【Feature】simplify config files for qwen3v8b and qwen3vl235b Co-authored-by: weixin_44031810<gaojie75@huawei.com> # message auto-generated for no-merge-commit merge: !2026 merge master into master 【Feature】simplify config files for qwen3v8b and qwen3vl235b Created-by: gaojie_ Commit-by: weixin_44031810 Merged-by: ascend-robot Description: ## Motivation 简化qwen3vl 8b 和235b的配置文件及相关readme ## Modification 删除model.json 和data.json,将原来shell脚本里的gpt_args和model.json,data.json中的参数都整合到一个yaml中:qwen3vl_fuul_sft_xxb.yaml ## Self-test (Optional) 本地自验能成功拉起两个模型 ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!20264 个月前
[Feature] Upgrade the functionality of the bridge to make it compatible with loading all types of weights. Co-authored-by: ningmengliu<liuhao438@huawei.com> # message auto-generated for no-merge-commit merge: !2154 merge master into master [Feature] Upgrade the functionality of the bridge to make it compatible with loading all types of weights. Created-by: ningmenglh Commit-by: ningmengliu Merged-by: ascend-robot Description: ## Motivation Upgrade the functionality of the bridge to make it compatible with loading all types of weights. ## Modification 1、将权重加载方式,使用bridge方式置为True,使其成为常态开启的选项 2、改变bridge的patch方式,使其megatron后端能够接受pt权重类型加载与huggingface权重类型加载;使fsdp2后端能够接受pt类型权重加载,huggingface类型权重,以及dcp类型权重加载 3、去掉之前的部分冗余判断 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21543 个月前
[featre] Qwen3vl-32b add PMCC feature Co-authored-by: cxiaolong<2845907121@qq.com> # message auto-generated for no-merge-commit merge: !2264 merge master into master [featre] Qwen3vl-32b add PMCC feature Created-by: cxiaolong Commit-by: cxiaolong Merged-by: ascend-robot Description: ## What this PR does / why we need it? PMCC(Privacy and Model Confidential Computing)是昇腾提供的一种隐私计算解决方案,用于保护模型训练过程中的模型权重和数据隐私。 当前 PMCC 特性已适配Qwen3VL数据集并在Qwen3VL-32B模型上进行了验证。 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!22642 个月前
[Feature] Upgrade the functionality of the bridge to make it compatible with loading all types of weights. Co-authored-by: ningmengliu<liuhao438@huawei.com> # message auto-generated for no-merge-commit merge: !2154 merge master into master [Feature] Upgrade the functionality of the bridge to make it compatible with loading all types of weights. Created-by: ningmenglh Commit-by: ningmengliu Merged-by: ascend-robot Description: ## Motivation Upgrade the functionality of the bridge to make it compatible with loading all types of weights. ## Modification 1、将权重加载方式,使用bridge方式置为True,使其成为常态开启的选项 2、改变bridge的patch方式,使其megatron后端能够接受pt权重类型加载与huggingface权重类型加载;使fsdp2后端能够接受pt类型权重加载,huggingface类型权重,以及dcp类型权重加载 3、去掉之前的部分冗余判断 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21543 个月前
[Feature] fsdp2 qwen3vl 30B support lora finetune Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2189 merge master into master [Feature] fsdp2 qwen3vl 30B support lora finetune Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## Motivation During gradient recomputation in the backward pass, it is necessary to obtain the gradient function outside this block as the entry point for the backward pass. Therefore, a patch is added to enable the gradient function of the embed_tokens layer, allowing LoRA fine-tuning to enter the gradient recomputation backward pass. Related history PR: https://gitee.com/ascend/MindSpeed-LLM/pulls/1498 ## Modification 1.Add a new LoRA fine-tuning script for 30B models 2.Add parsing code for LoRA parameters 3.Add a LoRA patch for the gradient recomputation scenario ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!21893 个月前
[feature] simplify installation Co-authored-by: young256<liumingyang16@huawei.com> # message auto-generated for no-merge-commit merge: !1888 merge install-simplify into master [feature] simplify installation Created-by: young256 Commit-by: young256 Merged-by: ascend-robot Description: ## Motivation Add install.sh and add requirements.txt for each model. ## Modification Add install.sh and add requirements.txt for each model.Executing install.sh will install modules common to all models, and requirements.txt can be used to install libraries specifically needed by each model. Through this modification, the installation steps after PyTorch can be simplified to just one step. ## Self-test (Optional) This modification mainly involves installation script changes and does not involve training functionality. Therefore, it does not involve unit test modifications. ## BC-breaking (Optional) not involved ## Checklist **Before PR**: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!18885 个月前
README.md

Qwen3_VL 使用指南

目录

版本说明

参考实现

url=https://github.com/huggingface/transformers.git
commit_id=c0dbe09

变更记录

2025.09.28: 首次支持Qwen3-VL模型


环境安装

1. 环境准备

【模型开发时推荐使用配套的环境版本】

请参考安装指南,完成昇腾软件安装。

Python版本推荐3.10,torch和torch_npu版本推荐2.7.1版本

‼️MoE部分的加速特性依赖较新版本的torch_npu和CANN,推荐使用以下版本

2. 环境搭建

⚠️如果您之前已经使用过MindSpeed-MM其他模型,这里强烈建议您切换至新的工作目录以及构建新的Conda环境,用以规避可能存在的部分第三方库版本不一致导致的风险。

拉取MindSpeed MM代码仓,并进入代码仓根目录:

git clone --branch 26.0.0 https://gitcode.com/Ascend/MindSpeed-MM.git
cd MindSpeed-MM
bash scripts/install.sh --megatron --msid 96bc0a3bf3398bf45ac26e0bded95ee174ac449b && pip install -r examples/qwen3vl/requirements.txt

权重下载及转换

1. 权重下载

从Hugging Face库下载对应的模型权重:

将下载的模型权重保存到本地的ckpt/hf_path/Qwen3-VL-*B-Instruct目录下。(*表示对应的尺寸)

如果使用fsdp2的meta init初始化模型,需要先完成以下权重转换

mm-convert Qwen3VLConverter hf_to_dcp \
  --hf_dir Qwen3-VL-xxB \
  --dcp_dir Qwen3-VL-xxB-dcp

# 转换后的目录结构为:
# ———— Qwen3-VL-xxB-dcp
#   |—— release
#   |—— latest_checkpointed_iteration.txt

并在examples/qwen3vl/qwen3vl_full_sft_xxB.yaml的gpt_args中设置init_model_with_meta_device为true,同时将该yaml中的MM_MODEL_LOAD_PATH修改为转换后的dcp权重路径(写到release文件夹的上一级目录,如Qwen3-VL-xxB-dcp)。

注意,针对Qwen3VL-30B和Qwen3VL-235B模型,必须使用meta init初始化加载权重,仓上默认开启init_model_with_meta_device。


数据集准备及处理

1. 数据集下载(以COCO2017数据集为例)

(1)用户需要自行下载COCO2017数据集COCO2017,并解压到项目目录下的./data/COCO2017文件夹中。

(2)获取图片数据集的描述文件(LLaVA-Instruct-150K),下载至./data/路径下。

(3)运行数据转换脚本python examples/qwen2vl/llava_instruct_2_mllm_demo_format.py,转换后参考数据目录结构如下:

$playground
├── data
    ├── COCO2017
        ├── train2017

    ├── llava_instruct_150k.json
    ├── mllm_format_llava_instruct_data.json
    ...

当前支持读取多个以,(注意不要加空格)分隔的数据集,配置方式为qwen3vl_full_sft_xxB.yamlDATASET_PATH参数 从./data/mllm_format_llava_instruct_data.json修改为./data/mllm_format_llava_instruct_data.json,./data/mllm_format_llava_instruct_data2.json

同时注意qwen3vl_full_sft_xxB.yamldata->dataset_param->basic_parameters->max_samples的配置,会限制数据只读max_samples条,这样可以快速验证功能。如果正式训练时,可以把该参数去掉则读取全部的数据。

2.纯文本或有图无图混合训练数据(以LLaVA-Instruct-150K为例)

现在本框架已经支持纯文本/混合数据(有图像和无图像数据混合训练)。

在数据构造时,对于包含图片的数据,需要保留image这个键值。

{
  "id": your_id,
  "image": your_image_path,
  "conversations": [
      {"from": "human", "value": your_query},
      {"from": "gpt", "value": your_response},
  ],
}

在数据构造时,对于纯文本数据,可以去除image这个键值。

{
  "id": your_id,
  "conversations": [
      {"from": "human", "value": your_query},
      {"from": "gpt", "value": your_response},
  ],
}

微调

1. 准备工作

配置脚本前需要完成前置准备工作,包括:环境安装权重下载及转换数据集准备及处理,详情可查看对应章节。

2. 配置参数

【模型类别配置】 当前默认微调nothink模型,如果想微调qwen3-VL-thinking模型,请将配置文件qwen3vl_full_sft_xxB.yaml中的template配置为qwen3_vlenable_thinking配置为true

【数据目录配置】

根据实际情况修改qwen3vl_full_sft_xxB.yaml中的数据集路径,包括model_name_or_pathdataset_dirdataset等字段。

示例:如果数据及其对应的json都在/home/user/data/目录下,其中json目录为/home/user/data/video_data_path.json,此时配置如下: dataset_dir配置为/home/user/data/; dataset配置为./data/video_data_path.json 注意此时dataset需要配置为相对路径

以Qwen3VL-xxB为例,qwen3vl_full_sft_xxB.yaml进行以下修改,注意model_name_or_path的权重路径为转换前的权重路径,即原始hf权重路径。

注意cache_dir在多机上不要配置同一个挂载目录避免写入同一个文件导致冲突

HF_MODEL_LOAD_PATH: &HF_MODEL_LOAD_PATH ./ckpt/hf_path/Qwen3-VL-8B-Instruct
DATASET_PATH: &DATASET_PATH ./data/mllm_format_llava_instruct_data.json
data:
  dataset_param:
    dataset_type: huggingface
    preprocess_parameters:
      model_name_or_path: *HF_MODEL_LOAD_PATH

    basic_parameters:
      dataset_dir: ./data
      dataset: *DATASET_PATH
      cache_dir: ./data/cache_dir

如果需要加载大批量数据,可使用流式加载,修改qwen3vl_full_sft_xxB.yaml中的sampler_type字段,增加streaming字段。(注意:使用流式加载后当前仅支持num_workers=0,单进程处理数据,会有性能波动,并且不支持断点续训功能。)

data:
  dataset_param:
    basic_parameters:
      streaming: true
  dataloader_param:
      sampler_type: stateful_distributed_sampler

【模块冻结配置】

当前支持vision encoder、vision projector、text decoder及lm head模块的冻结,其中,vision encoder、vision projector默认训练时为冻结状态,

通过配置qwen3vl_full_sft_xxB.yaml文件中model字段下各个模块的freeze字段,来修改各个模块的冻结与否。

【MoE 加速配置】

开启MoE融合可以提升模型训练性能,开启方式为将qwen3vl_full_sft_xxB.yaml文件中修改use_npu_fused_moe字段为true

注意:FusedMoE特性依赖较新版本,新版本的下载链接和安装方式参考【环境准备】章节。

【MoE 专家并行配置】

开启MOE专家并行可以有效降低内存峰值,当前开启专家并行时,需先设置MOE融合加速,即将qwen3vl_full_sft_xxB.yaml文件中修改use_npu_fused_moe字段为true。 专家并行开启方式在fsdp2_config.yaml文件中设置expert_parallel_size > 1,例如:

expert_parallel_size: 16

注意:专家并行数需能够被模型专家数整除。

【序列并行配置】

当前已支持Ulysses序列并行,当使用长序列训练时,需要开启CP特性,开启方式为在qwen3vl_full_sft_xxB.yaml中设置context_parallel_size > 1,例如

gpt_args:
  context_parallel_size: 4

【Attention配置】

  • 是否计算AttnMask 配置方式为在 qwen3vl_full_sft_xxB.yaml 文件中修改is_causal字段。 是否使用casual_mask,设置为 true 时按照casual mask计算,为 false 时会创建完整的attention mask,长序列时推荐使能以节省显存。

  • attn_implementation 和 layout配置 当前支持vision和text模块选择不同的Attntion实现方式,具体为在qwen3vl_full_sft_xxB.yaml文件中修改attn_implementation字段,当前支持情况如下表。

    模块 支持的FA以及layout 支持的cp类型
    ViT flash_attention_2: TND ulysses、ring、usp
    ViT flash_attention_2: BNSD ulysses
    ViT sdpa: BNSD ulysses
    LLM flash_attention_2: TND ulysses
    LLM flash_attention_2: BNSD ulysses、ring、usp
    LLM flash_attention_2: BSND ulysses
    LLM sdpa: BNSD ulysses

【synchronize_per_layer配置】 当使用FSDP2训练时,可能会存在显存未及时释放导致OOM的问题,可以开启synchronize_per_layer让每个transformer layer强制同步,缓解多流复用带来显存未及时释放问题,降低部分显存使用。 开启方式为在 qwen3vl_full_sft_xxB.yaml 文件中修改synchronize_per_layer字段,当前已默认设置为true

【activation_offload配置】 使用activation_offload可以将重计算过程中产生的checkpoint点的激活值移动到host,反向异步从host传输到device,降低device激活显存占用,配置方式为在qwen3vl_full_sft_xxB.yaml中将activation_offload字段设置为True。

【FSDP2 offload_to_cpu配置】 在fsdp2_config.yaml配置offload_to_cpu为True, 可以将参数,梯度和优化器状态卸载到CPU内存,进一步降低显存。但同时训练速度相对会变慢,在显存足够的情况下不建议开启。 功能描述请详见:docs/zh/features/fsdp2.md。 开启该功能时,同时需要在qwen3vl_full_sft_xxB.yaml文件中gpt_args配置项里配置distributed_backend: npu:hccl,cpu:gloo,以开启双通信后端。

【chunkloss 配置】 参考chunk loss文档

【负载均衡损失配置】 支持自定义moe模型中专家负载均衡的aux_loss的系数,在qwen3vl_full_sft_xxB.yaml中的router_aux_loss_coef,默认为0.0,即不计算该损失。

【模型保存加载及日志信息配置】

根据实际情况配置qwen3vl_full_sft_xxB.yaml的参数,包括加载、保存路径以及保存间隔save_interval(注意:分布式优化器保存文件较大耗时较长,请谨慎设置保存间隔)

# 转换后的dcp权重或断点续训权重加载路径
MM_MODEL_LOAD_PATH: &MM_MODEL_LOAD_PATH ./ckpt/save_dir/Qwen3-VL-xxB-Instruct
SAVE_PATH: &SAVE_PATH save_dir
gpt_args:
  ## training:
  no_load_optim: true  # 不加载优化器状态,若需加载请移除
  no_load_rng: true  # 不加载随机数状态,若需加载请移除
  no_save_optim: true  # 不保存优化器状态,若需保存请移除
  no_save_rng: true  # 不保存随机数状态,若需保存请移除

  ## save_and_logging:
  log_interval: 1  # 日志间隔
  save_interval: 10000   # 保存间隔
  save: *SAVE_PATH  # 保存路径

根据实际情况配置qwen3vl_full_sft_xxB.yaml中的init_from_hf_path参数,该参数表示初始权重的加载路径。 根据实际情况配置qwen3vl_full_sft_xxB.yaml中的image_encoder.vision_encoder.freezeimage_encoder.vision_projector.freezetext_decoder.freeze参数,该参数分别代表是否冻结vision model模块、projector模块、及language model模块。 注:当前qwen3vl_full_sft_xxB.yaml中的各网络层数均为未过校验的无效配置,如需减层请修改原始hf路径下相关配置文件。

【单机运行配置】

配置examples/qwen3vl/finetune_qwen3vl_xxB.sh参数如下

# 根据实际情况修改 ascend-toolkit 路径
source /usr/local/Ascend/cann/set_env.sh
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=29501
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE * $NNODES))

【LoRA微调(可选)】

LoRA为框架通用能力,当前已支持30B模型的语言模块LoRA微调,参数介绍请参考LoRA特性文档

LoRA微调场景下,需要先对原始权重完成以下权重转换

mm-convert Qwen3VLConverter hf_to_dcp \
  --hf_dir Qwen3-VL-30B-A3B-Instruct \
  --dcp_dir Qwen3-VL-30B-A3B-Instruct-dcp \
  --is_lora_base true

# 转换后的目录结构为:
# ———— Qwen3-VL-30B-A3B-Instruct-dcp
#   |—— release
#   |—— latest_checkpointed_iteration.txt

若需加载LoRA预训练权重,需要先对LoRA权重完成以下权重转换

mm-convert Qwen3VLConverter lora_hf_to_dcp \
  --hf_dir Qwen3-VL-30B-A3B-Instruct-lora \
  --dcp_dir Qwen3-VL-30B-A3B-Instruct-lora-dcp

# 转换后的目录结构为:
# ———— Qwen3-VL-30B-A3B-Instruct-lora-dcp
#   |—— release
#   |—— latest_checkpointed_iteration.txt

并在examples/qwen3vl/qwen3vl_lora_sft_30B.yaml中添加LoRA预训练权重路径,相关配置修改如下:

MM_MODEL_LOAD_PATH: &MM_MODEL_LOAD_PATH ./ckpt/mm_path/Qwen3-VL-30B-A3B-Instruct
LORA_MODEL_LOAD_PATH: &LORA_MODEL_LOAD_PATH ./ckpt/mm_path/Qwen3-VL-30B-A3B-Instruct-lora

...
# 原始的 load: *MM_MODEL_LOAD_PATH 需替换为 load_base_model: *MM_MODEL_LOAD_PATH
load: *LORA_MODEL_LOAD_PATH
load_base_model: *MM_MODEL_LOAD_PATH
...

运行以下命令进行LoRA微调

bash examples/qwen3vl/finetune_lora_qwen3vl_30B.sh

3. 启动微调

以Qwen3VL-xxB为例,启动微调训练任务。 loss计算方式差异会对训练效果造成不同的影响,在启动训练任务之前,请查看关于loss计算的文档,选择合适的loss计算方式vlm_model_loss_calculate_type.md 通过修改qwen3vl_full_sft_xxB.yaml文件中的loss_type字段可以在不同的loss计算方式中切换。

bash examples/qwen3vl/finetune_qwen3vl_xxB.sh

优化特性:

  • ChunkLoss:可以参考文档ChunkLoss开启该特性优化长序列时的显存占用。

4. 启动推理

训练完成之后,以Qwen3VL-xxB为例,将保存在save_dir目录下的权重转换成huggingface格式

mm-convert Qwen3VLConverter dcp_to_hf \
  --load_dir save_dir/iter_000xx/ \
  --save_dir save_dir/iter_000xx_hf/ \
  --model_assets_dir ./ckpt/Qwen3-VL-xxB-Instruct \
  --to_bf16 False \

其中,iter_000xx表示保存的第xx步的权重,--save_dir表示转换后的权重保存路径,--model_assets_dir原始huggingface权重的路径,--to_bf16表示权重数据类型是否从fp32转换成bf16。

完成权重转换之后,即可参考如下教程使用transformers库进行推理。

本脚本只为提供方便的推理工具以测试训练效果,不保证推理性能
使用教程:
1、按照用户自己的路径配置好MODEL_PATH、MODEL_TYPE和DATA_JSON_PATH
2、cd 切换到MindSpeed-MM路径下
3、source 用户的cann路径
4、必须通过export ASCEND_RT_VISIBLE_DEVICES手动指定使用哪些卡,否则执行时会遇到无法自动识别多张卡导致OOM的情况
5、执行python examples/qwen3vl/inference_demo.py

【多机运行配置】

如需拉起多机训练,修改启动脚本下 MASTER_ADDR、NODE_ADDR、NODES以及NODE_RANK变量

MASTER_ADDR: 主节点IP地址
NODE_ADDR: 本机IP地址
NODE_RANK: 第几个节点
NODES: 一共几个节点

PMCC(Privacy and Model Confidential Computing)

PMCC是昇腾提供的一种隐私计算解决方案,用于保护模型训练过程中的模型权重和数据隐私。在微调Qwen3VL-32B模型时,若需要开启PMCC功能,首先需要在昇腾AI软件栈中安装PMCC组件。

pip install ai_asset_obfuscate
pip install opencv-python
pip install pandas==2.3.3

启动pmcc权重加密和数据预处理加密处理,命令如下:

# 加密hf模型权重
python mindspeed_mm/tools/pmcc/pmcc_qwen3vl.py \
    --obf-type model \
    --hf-model-path "/data/ckpt/Qwen3-VL-32B-Instruct/" \
    --obf-seed "22222222222222222222222222222222" \
    --model-save-path "/data/pmcc/obf_hf_ckpt/" \
    --device-id 0 1 2 3 4 5 6 7

# 加密数据集
python mindspeed_mm/tools/pmcc/pmcc_qwen3vl.py \
    --obf-type data \
    --hf-model-path "/data/ckpt/Qwen3-VL-32B-Instruct/" \
    --obf-seed "22222222222222222222222222222222" \
    --src-json-path "/data/dataset/llava_instruct_150k.json" \
    --src-img-dir "/data/dataset/COCO2017/train2017" \
    --obf-json-path "/data/pmcc/obf_json_2000.json" \
    --obf-img-dir "/data/pmcc/obf_images" \
    --data-limit 2000

# 转换加密后的hf模型权重为dcp格式
mm-convert Qwen3VLConverter hf_to_dcp \
    --hf_dir /data/pmcc/obf_hf_ckpt \
    --dcp_dir /data/pmcc/obf_dcp_ckpt

完成模型和数据加密,加密HF权重转DCP格式后,修改qwen3vl_full_sft_32B.yaml文件中的HF_MODEL_LOAD_PATHMM_MODEL_LOAD_PATHDATASET_PATHDATASET_DIR分别为加密后的HF权重路径、DCP权重路径、加密后的数据集json路径、数据集文件夹路径,修改use_pmcc_data参数为true,以开启PMCC数据加载。


环境变量声明

环境变量 描述 取值说明
ASCEND_SLOG_PRINT_TO_STDOUT 是否开启日志打印 0: 关闭日志打屏
1: 开启日志打屏
ASCEND_GLOBAL_LOG_LEVEL 设置应用类日志的日志级别及各模块日志级别,仅支持调试日志 0: 对应DEBUG级别
1: 对应INFO级别
2: 对应WARNING级别
3: 对应ERROR级别
4: 对应NULL级别,不输出日志
TASK_QUEUE_ENABLE 用于控制开启task_queue算子下发队列优化的等级 0: 关闭
1: 开启Level 1优化
2: 开启Level 2优化
COMBINED_ENABLE 设置combined标志。设置为0表示关闭此功能;设置为1表示开启,用于优化非连续两个算子组合类场景 0: 关闭
1: 开启
CPU_AFFINITY_CONF 控制CPU端算子任务的处理器亲和性,即设定任务绑核 设置0或未设置: 表示不启用绑核功能
1: 表示开启粗粒度绑核
2: 表示开启细粒度绑核
HCCL_CONNECT_TIMEOUT 用于限制不同设备之间socket建链过程的超时等待时间 需要配置为整数,取值范围[120,7200],默认值为120,单位s
PYTORCH_NPU_ALLOC_CONF 控制缓存分配器行为 expandable_segments:<value>: 使能内存池扩展段功能,即虚拟内存特征
HCCL_EXEC_TIMEOUT 控制设备间执行时同步等待的时间,在该配置时间内各设备进程等待其他设备执行通信同步 需要配置为整数,取值范围[68,17340],默认值为1800,单位s
ACLNN_CACHE_LIMIT 配置单算子执行API在Host侧缓存的算子信息条目个数 需要配置为整数,取值范围[1, 10,000,000],默认值为10000
TOKENIZERS_PARALLELISM 用于控制Hugging Face的transformers库中的分词器(tokenizer)在多线程环境下的行为 False: 禁用并行分词
True: 开启并行分词
MULTI_STREAM_MEMORY_REUSE 配置多流内存复用是否开启 0: 关闭多流内存复用
1: 开启多流内存复用
NPU_ASD_ENABLE 控制是否开启Ascend Extension for PyTorch的特征值检测功能 设置0或未设置: 关闭特征值检测
1: 表示开启特征值检测,只打印异常日志,不告警
2:开启特征值检测,并告警
3:开启特征值检测,并告警,同时会在device侧info级别日志中记录过程数据
ASCEND_LAUNCH_BLOCKING 控制算子执行时是否启动同步模式 0: 采用异步方式执行
1: 强制算子采用同步模式运行
NPUS_PER_NODE 配置一个计算节点上使用的NPU数量 整数值(如 1, 8 等)