ascend-robot[Feature] transformers model support aux_loss

15a56281创建于 2025年12月22日历史提交

文件	最后提交记录	最后更新时间
README.md	[Feature] transformers model support aux_loss Co-authored-by: liyingxuan<liyingxuan3@huawei.com> # message auto-generated for no-merge-commit merge: !1965 merge aux_loss into master [Feature] transformers model support aux_loss Created-by: liyx616 Commit-by: liyingxuan Merged-by: ascend-robot Description: ## Motivation transformers model support aux_loss. ## Modification transformers_model支持aux_loss，配置方式：在model.json中，加入下面的字段 `json "loss_cfg": { "router_aux_loss_coef": 0.1 }` 其中该系数对应的loss计算为：lm_loss + router_aux_loss_coef * aux_loss 如果不配置该字段，默认不开启aux loss计算，通过该字段开启aux loss计算需要transformers模型本身支持aux loss计算，否则不生效。遗留问题：当前per-token-loss通过缩放模型梯度实现，会导致auxloss也被错误的缩放了一个系数，所以有兼容性问题。 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1965	6 个月前
data_235B.json	[Feature] 新增qwen3vl 235B模型 Co-authored-by: yangx_sy<sunyang49@huawei.com> # message auto-generated for no-merge-commit merge: !1762 merge qwen3vl_235B into master [Feature] support qwen3vl 235B Created-by: yangx_sy Commit-by: yangx_sy Merged-by: ascend-robot Description: ## Motivation 新增qwen3vl 235B训练脚本 ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1762	7 个月前
data_30B.json	[Feature] add qwen3vl 8b Co-authored-by: yangx_sy<sunyang49@huawei.com> # message auto-generated for no-merge-commit merge: !1676 merge qwen3vl_8b into master [Feature] add qwen3vl 8b Created-by: yangx_sy Commit-by: yangx_sy Merged-by: ascend-robot Description: ## Motivation 添加qwen3vl 8b；添加dcp保存到默认配置 ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1676	7 个月前
data_8B.json	[Feature] add qwen3vl 8b Co-authored-by: yangx_sy<sunyang49@huawei.com> # message auto-generated for no-merge-commit merge: !1676 merge qwen3vl_8b into master [Feature] add qwen3vl 8b Created-by: yangx_sy Commit-by: yangx_sy Merged-by: ascend-robot Description: ## Motivation 添加qwen3vl 8b；添加dcp保存到默认配置 ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1676	7 个月前
finetune_qwen3vl_235B.sh	[Docs] 235B默认节点数量修改为16&&主页模型介绍更新 Co-authored-by: yangx_sy<sunyang49@huawei.com> # message auto-generated for no-merge-commit merge: !1874 merge qwen3vl_235B into master [Docs] Modify the default node count of 235B to 16 && update the model introduction on the homepage Created-by: yangx_sy Commit-by: yangx_sy Merged-by: ascend-robot Description: ## Motivation 1、235B默认节点数量修改为16 2、主页模型介绍更新 ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1874	6 个月前
finetune_qwen3vl_30B.sh	[Modify] modify moe experts fully shard Co-authored-by: liyingxuan<liyingxuan3@huawei.com> # message auto-generated for no-merge-commit merge: !1783 merge moe_modify into master [Modify] modify moe experts fully shard Created-by: liyx616 Commit-by: liyingxuan Merged-by: ascend-robot Description: ## Motivation modify moe experts fully shard ## Modification 优化moe专家部分的fully_shard逻辑 1. 重构npu_fused_moe 2. 由于原来experts的shape是（num_expert, input_dim, output_dim），其中num_expert=128, 在专家数量大于128的时候，会出现显存异常，因此将experts的shape view为（num_expert* input_dim, output_dim)，精度完全对齐，对性能没有影响 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1783	7 个月前
finetune_qwen3vl_8B.sh	[Feature] qwen3vl improve performance Co-authored-by: liyingxuan<liyingxuan3@huawei.com> # message auto-generated for no-merge-commit merge: !1717 merge master into master [Feature] qwen3vl improve performance Created-by: liyx616 Commit-by: liyingxuan Merged-by: ascend-robot Description: ## Motivation qwen3vl improve performance ## Modification 1. moe block融合算子（默认关闭） 2. rmsnorm和rope融合算子 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1717	7 个月前
fsdp2_config.yaml	[Feature] qwen3vl align_fsdp_param_groups Co-authored-by: liyingxuan<liyingxuan3@huawei.com> # message auto-generated for no-merge-commit merge: !1822 merge fully_shard into master [Feature] qwen3vl align_fsdp_param_groups Created-by: liyx616 Commit-by: liyingxuan Merged-by: ascend-robot Description: ## Motivation qwen3vl align_fsdp_param_groups ## Modification llm部分fully_shard的参数进行512对齐 30B MoE有少量通信收益 235B MoE有大量通信收益 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1822	7 个月前
model_235B.json	[Feature] transformers model support aux_loss Co-authored-by: liyingxuan<liyingxuan3@huawei.com> # message auto-generated for no-merge-commit merge: !1965 merge aux_loss into master [Feature] transformers model support aux_loss Created-by: liyx616 Commit-by: liyingxuan Merged-by: ascend-robot Description: ## Motivation transformers model support aux_loss. ## Modification transformers_model支持aux_loss，配置方式：在model.json中，加入下面的字段 `json "loss_cfg": { "router_aux_loss_coef": 0.1 }` 其中该系数对应的loss计算为：lm_loss + router_aux_loss_coef * aux_loss 如果不配置该字段，默认不开启aux loss计算，通过该字段开启aux loss计算需要transformers模型本身支持aux loss计算，否则不生效。遗留问题：当前per-token-loss通过缩放模型梯度实现，会导致auxloss也被错误的缩放了一个系数，所以有兼容性问题。 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1965	6 个月前
model_30B.json	[Feature] transformers model support aux_loss Co-authored-by: liyingxuan<liyingxuan3@huawei.com> # message auto-generated for no-merge-commit merge: !1965 merge aux_loss into master [Feature] transformers model support aux_loss Created-by: liyx616 Commit-by: liyingxuan Merged-by: ascend-robot Description: ## Motivation transformers model support aux_loss. ## Modification transformers_model支持aux_loss，配置方式：在model.json中，加入下面的字段 `json "loss_cfg": { "router_aux_loss_coef": 0.1 }` 其中该系数对应的loss计算为：lm_loss + router_aux_loss_coef * aux_loss 如果不配置该字段，默认不开启aux loss计算，通过该字段开启aux loss计算需要transformers模型本身支持aux loss计算，否则不生效。遗留问题：当前per-token-loss通过缩放模型梯度实现，会导致auxloss也被错误的缩放了一个系数，所以有兼容性问题。 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1965	6 个月前
model_8B.json	[Bugfix] mv sync to inner block Co-authored-by: cxiaolong<2845907121@qq.com> # message auto-generated for no-merge-commit merge: !1897 merge master into master [Bugfix] mv sync to inner block Created-by: cxiaolong Commit-by: cxiaolong Merged-by: ascend-robot Description: ## Motivation bugfix for pr [[feature] qwen3vl add synchronze](https://gitcode.com/Ascend/MindSpeed-MM/pull/1892) syncynoize outer block can not solve memory reuse in bwd. ## Modification mv sync to inner block. add 8b and 235b config. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1897	6 个月前
requirements.txt	[feature] simplify installation Co-authored-by: young256<liumingyang16@huawei.com> # message auto-generated for no-merge-commit merge: !1888 merge install-simplify into master [feature] simplify installation Created-by: young256 Commit-by: young256 Merged-by: ascend-robot Description: ## Motivation Add install.sh and add requirements.txt for each model. ## Modification Add install.sh and add requirements.txt for each model.Executing install.sh will install modules common to all models, and requirements.txt can be used to install libraries specifically needed by each model. Through this modification, the installation steps after PyTorch can be simplified to just one step. ## Self-test (Optional) This modification mainly involves installation script changes and does not involve training functionality. Therefore, it does not involve unit test modifications. ## BC-breaking (Optional) not involved ## Checklist Before PR: - [x] The new code needs to comply with the Clean Code specification. - [x] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [x] CLA has been signed and all committers have signed the CLA in this PR. - [x] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/MindSpeed-MM!1888	6 个月前

Qwen3_VL 使用指南

版本说明

参考实现

url=https://github.com/huggingface/transformers.git
commit_id=c0dbe09

变更记录

2025.09.28: 首次支持Qwen3-VL模型

环境安装

1. 环境准备

【模型开发时推荐使用配套的环境版本】

请参考安装指南，完成昇腾软件安装。

Python版本推荐3.10，torch和torch_npu版本推荐2.7.1版本

‼️MoE部分的加速特性依赖较新版本的torch_npu和CANN，推荐使用以下版本

2. 环境搭建

拉取MindSpeed MM代码仓，并进入代码仓根目录：

git clone https://gitcode.com/Ascend/MindSpeed-MM.git
cd MindSpeed-MM

对于X86架构机器，执行如下指令：

bash scripts/install.sh --arch x86 --msid d76dbddd4517d48a2fc1cd494de8b9a6cfdbfbab&& pip install -r examples/qwen3vl/requirements.txt

对于ARM架构机器，执行如下指令：

bash scripts/install.sh --arch arm --msid d76dbddd4517d48a2fc1cd494de8b9a6cfdbfbab&& pip install -r examples/qwen3vl/requirements.txt

权重下载及转换

1. 权重下载

从Huggingface库下载对应的模型权重:

模型地址: Qwen3-VL-*B；

将下载的模型权重保存到本地的ckpt/hf_path/Qwen3-VL-*B-Instruct目录下。(*表示对应的尺寸)

如果使用fsdp2的meta init初始化模型，需要先完成以下权重转换

mm-convert Qwen3VLConverter hf_to_dcp \
  --hf_dir Qwen3-VL-xxB \
  --dcp_dir Qwen3-VL-xxB-dcp

# 转换后的目录结构为：
# ———— Qwen3-VL-xxB-dcp
#   |—— release
#   |—— latest_checkpointed_iteration.txt

并在examples/qwen3vl/finetune_qwen3vl_xxB.sh的GPT_ARGS中加入--init-model-with-meta-device参数，同时将该脚本中的LOAD_PATH修改为转换后的dcp权重路径（写到release文件夹的上一级目录，如Qwen3-VL-xxB-dcp）。

注意，针对Qwen3VL-30B和Qwen3VL-235B模型，必须使用meta init初始化加载权重。

数据集准备及处理

1. 数据集下载(以coco2017数据集为例)

(1)用户需要自行下载COCO2017数据集COCO2017，并解压到项目目录下的./data/COCO2017文件夹中。

(2)获取图片数据集的描述文件（LLaVA-Instruct-150K），下载至./data/路径下。

(3)运行数据转换脚本python examples/qwen2vl/llava_instruct_2_mllm_demo_format.py，转换后参考数据目录结构如下：

$playground
├── data
    ├── COCO2017
        ├── train2017

    ├── llava_instruct_150k.json
    ├── mllm_format_llava_instruct_data.json
    ...

当前支持读取多个以,（注意不要加空格）分隔的数据集，配置方式为data_xxB.json中 dataset_param->basic_parameters->dataset 从"./data/mllm_format_llava_instruct_data.json"修改为"./data/mllm_format_llava_instruct_data.json,./data/mllm_format_llava_instruct_data2.json"

同时注意data_xxB.json中dataset_param->basic_parameters->max_samples的配置，会限制数据只读max_samples条，这样可以快速验证功能。如果正式训练时，可以把该参数去掉则读取全部的数据。

2.纯文本或有图无图混合训练数据(以LLaVA-Instruct-150K为例)

现在本框架已经支持纯文本/混合数据（有图像和无图像数据混合训练）。

在数据构造时，对于包含图片的数据，需要保留image这个键值。

{
  "id": your_id,
  "image": your_image_path,
  "conversations": [
      {"from": "human", "value": your_query},
      {"from": "gpt", "value": your_response},
  ],
}

在数据构造时，对于纯文本数据，可以去除image这个键值。

{
  "id": your_id,
  "conversations": [
      {"from": "human", "value": your_query},
      {"from": "gpt", "value": your_response},
  ],
}

微调

1. 准备工作

配置脚本前需要完成前置准备工作，包括：环境安装、权重下载及转换、数据集准备及处理，详情可查看对应章节。

2. 配置参数

【数据目录配置】

根据实际情况修改data_xxB.json中的数据集路径，包括model_name_or_path、dataset_dir、dataset等字段。

示例：如果数据及其对应的json都在/home/user/data/目录下，其中json目录为/home/user/data/video_data_path.json，此时配置如下： dataset_dir配置为/home/user/data/; dataset配置为./data/video_data_path.json 注意此时dataset需要配置为相对路径

以Qwen3VL-xxB为例，data_xxB.json进行以下修改，注意model_name_or_path的权重路径为转换前的权重路径,即原始hf权重路径。

注意cache_dir在多机上不要配置同一个挂载目录避免写入同一个文件导致冲突。

{
    "dataset_param": {
        "dataset_type": "huggingface",
        "preprocess_parameters": {
            "model_name_or_path": "./ckpt/hf_path/Qwen3-VL-xxB-Instruct",
            ...
        },
        "basic_parameters": {
            ...
            "dataset_dir": "./data",
            "dataset": "./data/mllm_format_llava_instruct_data.json",
            "cache_dir": "./data/cache_dir",
            ...
        },
        ...
    },
    ...
}

如果需要加载大批量数据，可使用流式加载，修改data_xxB.json中的sampler_type字段，增加streaming字段。（注意：使用流式加载后当前仅支持num_workers=0，单进程处理数据，会有性能波动，并且不支持断点续训功能。）

{
    "dataset_param": {
        ...
        "basic_parameters": {
            ...
            "streaming": true
            ...
        },
        ...
    },
    "dataloader_param": {
        ...
        "sampler_type": "stateful_distributed_sampler",
        ...
    }
}

【模块冻结配置】

当前支持vison encoder、vision projector、text decoder及lm head模块的冻结，其中，vison encoder、vision projector默认训练时为冻结状态，

通过配置model.json文件中各个模块的freeze字段，来修改各个模块的冻结与否。

【MoE 加速配置】

开启MoE融合可以提升模型训练性能，开启方式为将model.json文件中修改use_npu_fused_moe字段为true

注意：FusedMoE特性依赖较新版本，参考【环境安装】章节进行环境安装。

【序列并行配置】

当前已支持Ulysses序列并行，当使用长序列训练时，需要开启CP特性，开启方式为在训练bash脚本设置CP > 1，例如

CP=4
GPT_ARGS="
    ...
    --context-parallel-size ${CP} \
    ...
"

【Attention配置】

是否计算AttnMask 配置方式为在 model.json 文件中修改is_causal字段。是否使用casual_mask，设置为 true 时按照casual mask计算，为 false 时会创建完整的attention mask，长序列时推荐使能以节省显存。

attn_implementation 和 layout配置当前支持vision和text模块选择不同的Attntion实现方式，具体为在model.json文件中修改attn_implementation字段，当前支持情况如下表。

模块	支持的FA以及layout	支持的cp类型
ViT	`flash_attention_2`: `TND`	ulysses、ring、usp
ViT	`flash_attention_2`: `BNSD`	ulysses
ViT	`sdpa`: `BNSD`	ulysses
LLM	`flash_attention_2`: `TND`	/
LLM	`flash_attention_2`: `BNSD`	ulysses、ring、usp
LLM	`flash_attention_2`: `BSND`	ulysses
LLM	`sdpa`: `BNSD`	ulysses

【synchronize_per_layer配置】当使用FSDP2训练时，可能会存在显存未及时释放导致OOM的问题，可以开启synchronize_per_layer让每个transformer layer强制同步，缓解多流复用带来显存未及时释放问题，降低部分显存使用。开启方式为在 model.json 文件中修改synchronize_per_layer字段，当前已默认设置为true

【activation_offload配置】使用activation_offload可以将重计算过程中产生的checkpoint点的激活值移动到host，反向异步从host传输到device，降低device激活显存占用，配置方式为在model_xxB.json中将activation_offload字段设置为True。

【chunkloss 配置】参考chunk loss文档

【负载均衡损失配置】支持自定义moe模型中专家负载均衡的aux_loss的系数，在model_xxB.json中的router_aux_loss_coef，默认为0.0，即不计算该损失。(注意当前不支持同时开启aux_loss和--calculate-per-token-loss)

【模型保存加载及日志信息配置】

根据实际情况配置examples/qwen3vl/finetune_qwen3vl_xxB.sh的参数，包括加载、保存路径以及保存间隔--save-interval（注意：分布式优化器保存文件较大耗时较长，请谨慎设置保存间隔）

...
# 转换后的dcp权重或断点续训权重加载路径
LOAD_PATH="./ckpt/save_dir/Qwen3-VL-xxB-Instruct"
# 保存路径
SAVE_PATH="save_dir"
...
GPT_ARGS="
    ...
    --no-load-optim \  # 不加载优化器状态，若需加载请移除
    --no-load-rng \  # 不加载随机数状态，若需加载请移除
    --no-save-optim \  # 不保存优化器状态，若需保存请移除
    --no-save-rng \  # 不保存随机数状态，若需保存请移除
    ...
"
...
OUTPUT_ARGS="
    --log-interval 1 \  # 日志间隔
    --save-interval 5000 \  # 保存间隔
    --save $SAVE_PATH \ # 保存路径
"

根据实际情况配置examples/qwen3vl/model_xxB.json中的init_from_hf_path参数，该参数表示初始权重的加载路径。根据实际情况配置examples/qwen3vl/model_xxB.json中的image_encoder.vision_encoder.freeze、image_encoder.vision_projector.freeze、text_decoder.freeze参数，该参数分别代表是否冻结vision model模块、projector模块、及language model模块。注：当前examples/qwen3vl/model_xxB.json中的各网络层数均为未过校验的无效配置，如需减层请修改原始hf路径下相关配置文件。

【单机运行配置】

配置examples/qwen3vl/finetune_qwen3vl_xxB.sh参数如下

# 根据实际情况修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=29501
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE * $NNODES))

3. 启动微调

以Qwen3VL-xxB为例，启动微调训练任务。
loss计算方式差异会对训练效果造成不同的影响，在启动训练任务之前，请查看关于loss计算的文档，选择合适的loss计算方式vlm_model_loss_calculate_type.md

bash examples/qwen3vl/finetune_qwen3vl_xxB.sh

优化特性：

ChunkLoss：可以参考文档ChunkLoss开启该特性优化长序列时的显存占用。

4. 启动推理

训练完成之后，以Qwen3VL-xxB为例，将保存在save_dir目录下的权重转换成huggingface格式

mm-convert Qwen3VLConverter dcp_to_hf \
  --load_dir save_dir/iter_000xx/ \
  --save_dir save_dir/iter_000xx_hf/ \
  --model_assets_dir ./ckpt/Qwen3-VL-xxB-Instruct \

其中，iter_000xx表示保存的第xx步的权重，--save_dir表示转换后的权重保存路径，--model_assets_dir原始huggingface权重的路径。

完成权重转换之后，即可使用transformers库进行推理。

【多机运行配置】

如需拉起多机训练，修改启动脚本下 MASTER_ADDR、NODE_ADDR、NODES以及NODE_RANK变量

MASTER_ADDR: 主节点IP地址
NODE_ADDR: 本机IP地址
NODE_RANK: 第几个节点
NODES: 一共几个节点

环境变量声明

环境变量	描述	取值说明
`ASCEND_SLOG_PRINT_TO_STDOUT`	是否开启日志打印	`0`: 关闭日志打屏 `1`: 开启日志打屏
`ASCEND_GLOBAL_LOG_LEVEL`	设置应用类日志的日志级别及各模块日志级别，仅支持调试日志	`0`: 对应DEBUG级别 `1`: 对应INFO级别 `2`: 对应WARNING级别 `3`: 对应ERROR级别 `4`: 对应NULL级别，不输出日志
`TASK_QUEUE_ENABLE`	用于控制开启task_queue算子下发队列优化的等级	`0`: 关闭 `1`: 开启Level 1优化 `2`: 开启Level 2优化
`COMBINED_ENABLE`	设置combined标志。设置为0表示关闭此功能；设置为1表示开启，用于优化非连续两个算子组合类场景	`0`: 关闭 `1`: 开启
`CPU_AFFINITY_CONF`	控制CPU端算子任务的处理器亲和性，即设定任务绑核	设置`0`或未设置: 表示不启用绑核功能 `1`: 表示开启粗粒度绑核 `2`: 表示开启细粒度绑核
`HCCL_CONNECT_TIMEOUT`	用于限制不同设备之间socket建链过程的超时等待时间	需要配置为整数，取值范围`[120,7200]`，默认值为`120`，单位`s`
`PYTORCH_NPU_ALLOC_CONF`	控制缓存分配器行为	`expandable_segments:<value>`: 使能内存池扩展段功能，即虚拟内存特征
`HCCL_EXEC_TIMEOUT`	控制设备间执行时同步等待的时间，在该配置时间内各设备进程等待其他设备执行通信同步	需要配置为整数，取值范围`[68,17340]`，默认值为`1800`，单位`s`
`ACLNN_CACHE_LIMIT`	配置单算子执行API在Host侧缓存的算子信息条目个数	需要配置为整数，取值范围`[1, 10,000,000]`，默认值为`10000`
`TOKENIZERS_PARALLELISM`	用于控制Hugging Face的transformers库中的分词器（tokenizer）在多线程环境下的行为	`False`: 禁用并行分词 `True`: 开启并行分词
`MULTI_STREAM_MEMORY_REUSE`	配置多流内存复用是否开启	`0`: 关闭多流内存复用 `1`: 开启多流内存复用
`NPU_ASD_ENABLE`	控制是否开启Ascend Extension for PyTorch的特征值检测功能	设置`0`或未设置: 关闭特征值检测 `1`: 表示开启特征值检测，只打印异常日志，不告警 `2`:开启特征值检测，并告警 `3`:开启特征值检测，并告警，同时会在device侧info级别日志中记录过程数据
`ASCEND_LAUNCH_BLOCKING`	控制算子执行时是否启动同步模式	`0`: 采用异步方式执行 `1`: 强制算子采用同步模式运行
`NPUS_PER_NODE`	配置一个计算节点上使用的NPU数量	整数值（如 `1`, `8` 等）