ascend-robotbugfix: Fix compatibility issues between MTP module and recomputation

文件	最后提交记录	最后更新时间
README.md	fix: correct qwen3.5 fla-npu run package pattern Co-authored-by: ye_qm<yeqiangmao@huawei.com> # message auto-generated for no-merge-commit merge: !2891 merge fix-qwen35-fla-npu-run-pattern-20260728 into master fix: correct qwen3.5 fla-npu run package pattern Created-by: ye_qm Commit-by: ye_qm Merged-by: ascend-robot Description: ## What this PR does / why we need it? Fix a typo in the Qwen3.5 README fla-npu installation command. The generated fla-npu run package uses the `fla-npu-.run` naming pattern, for example `build_out/fla-npu-fla_npu_linux-x86_64.run`. The original README used `fla-npu_.run`, which cannot match the generated run package and causes the installation command to fail. ## Does this PR introduce any user-facing change? Yes. This PR corrects the Qwen3.5 fla-npu installation command in the user documentation. Documentation path: `examples/qwen3_5/README.md` ## How was this patch tested? self test See merge request: Ascend/MindSpeed-MM!2891	4 天前
finetune_qwen3_5_122B.sh	feat: modify solve_tril for Ascend950 Co-authored-by: WendongPang<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2669 merge gdn into master feat: modify solve_tril for Ascend950 Created-by: WendongPang Commit-by: WendongPang Merged-by: ascend-robot Description: ## What this PR does / why we need it? feat: modify solve_tril for Ascend950 https://gitcode.com/Ascend/MindSpeed-MM/issues/322 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!2669	1 个月前
finetune_qwen3_5_27B.sh	[Modify] Moodify data module for VLModel Co-authored-by: AZe_404<wangze62@h-partners.com> # message auto-generated for no-merge-commit merge: !2619 merge mod_data_module into master [Modify] Moodify data module for VLModel Created-by: AZe_404 Commit-by: AZe_404 Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1. 调整VL模型的数据模块，将数据格式转换以及构造伪数据脚本整合到统一路径，对相同数据格式的所有VL模型生效； 2. 修改Qwen3.5模型训练配置，与最优性能配置对齐； 3. 修复训练走DDP并行时hsdp不适配的问题。 ## Does this PR introduce any user-facing change? 无. ## How was this patch tested? 本地验证功能性. See merge request: Ascend/MindSpeed-MM!2619	1 个月前
finetune_qwen3_5_35B.sh	[Modify] Moodify data module for VLModel Co-authored-by: AZe_404<wangze62@h-partners.com> # message auto-generated for no-merge-commit merge: !2619 merge mod_data_module into master [Modify] Moodify data module for VLModel Created-by: AZe_404 Commit-by: AZe_404 Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1. 调整VL模型的数据模块，将数据格式转换以及构造伪数据脚本整合到统一路径，对相同数据格式的所有VL模型生效； 2. 修改Qwen3.5模型训练配置，与最优性能配置对齐； 3. 修复训练走DDP并行时hsdp不适配的问题。 ## Does this PR introduce any user-facing change? 无. ## How was this patch tested? 本地验证功能性. See merge request: Ascend/MindSpeed-MM!2619	1 个月前
finetune_qwen3_5_397B.sh	feat: modify solve_tril for Ascend950 Co-authored-by: WendongPang<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2669 merge gdn into master feat: modify solve_tril for Ascend950 Created-by: WendongPang Commit-by: WendongPang Merged-by: ascend-robot Description: ## What this PR does / why we need it? feat: modify solve_tril for Ascend950 https://gitcode.com/Ascend/MindSpeed-MM/issues/322 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!2669	1 个月前
finetune_qwen3_5_4B.sh	feat: modify solve_tril for Ascend950 Co-authored-by: WendongPang<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2669 merge gdn into master feat: modify solve_tril for Ascend950 Created-by: WendongPang Commit-by: WendongPang Merged-by: ascend-robot Description: ## What this PR does / why we need it? feat: modify solve_tril for Ascend950 https://gitcode.com/Ascend/MindSpeed-MM/issues/322 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!2669	1 个月前
finetune_qwen3_5_9B.sh	feat: modify solve_tril for Ascend950 Co-authored-by: WendongPang<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2669 merge gdn into master feat: modify solve_tril for Ascend950 Created-by: WendongPang Commit-by: WendongPang Merged-by: ascend-robot Description: ## What this PR does / why we need it? feat: modify solve_tril for Ascend950 https://gitcode.com/Ascend/MindSpeed-MM/issues/322 ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!2669	1 个月前
install_extensions.sh	modify: remove dependency on third-party libraries Co-authored-by: WendongPang<pangwendong@huawei.com> # message auto-generated for no-merge-commit merge: !2719 merge remove into master modify: remove dependency on third-party libraries Created-by: WendongPang Commit-by: WendongPang Merged-by: ascend-robot Description: ## What this PR does / why we need it? modify: remove dependency on third-party libraries https://gitcode.com/Ascend/MindSpeed-MM/issues/311 ## Does this PR introduce any user-facing change? 开放了pandas, accelerate, peft的版本依赖，其中peft完成了megatron后端lora功能的适配，兼容旧版本0.7.1和当前最新版本0.19.1，移除了timm及相关的模型 ## How was this patch tested? Please explain how to verify the correctness and effectiveness of this feature, as well as its usage constraints and limitations. See merge request: Ascend/MindSpeed-MM!2719	1 个月前
qwen3_5_122B_config.yaml	bugfix: Fix compatibility issues between MTP module and recomputation Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2914 merge MM-MTP-BUGFIX into master bugfix: Fix compatibility issues between MTP module and recomputation Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## What this PR does / why we need it? 问题背景：在qwen3.5-35B上开启mtp特性后报错，报错位置位于重计算中FA计算中shape不匹配；开启mtp特性，在调整配置能训练的情况下grad_norm非常大；报错原因：MTP部分的full attention和主干模型的full attention是同一个，在开启chunkmbs 4切2的场景下，MTP 的完整 batch FA 缓存被主干 ChunkMBS 反向重算误取，导致 batch=4 的输出按 batch=2 reshape，出现 8192 vs 4096；关闭 ChunkMBS 虽不报错，但由于主干网络错误的取到了mtp的fa缓存，导致梯度数值异常。配置修改原理：将 mtp.layers.{*} 同时加入 recompute 和 activation offload，为 MTP 建立独立的重算上下文和 FA 缓存，使主干与 MTP 各自保存、恢复自己的缓存。 ## Does this PR introduce any user-facing change? 无 ## How was this patch tested? 在开启chukmbs 4切2 和关闭chunkmbs的场景下，修改后的配置和原先的配置在关闭skip_flash_attn_recompute的场景下，精度完全一致 See merge request: Ascend/MindSpeed-MM!2914	16 小时前
qwen3_5_27B_config.yaml	bugfix: Fix compatibility issues between MTP module and recomputation Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2914 merge MM-MTP-BUGFIX into master bugfix: Fix compatibility issues between MTP module and recomputation Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## What this PR does / why we need it? 问题背景：在qwen3.5-35B上开启mtp特性后报错，报错位置位于重计算中FA计算中shape不匹配；开启mtp特性，在调整配置能训练的情况下grad_norm非常大；报错原因：MTP部分的full attention和主干模型的full attention是同一个，在开启chunkmbs 4切2的场景下，MTP 的完整 batch FA 缓存被主干 ChunkMBS 反向重算误取，导致 batch=4 的输出按 batch=2 reshape，出现 8192 vs 4096；关闭 ChunkMBS 虽不报错，但由于主干网络错误的取到了mtp的fa缓存，导致梯度数值异常。配置修改原理：将 mtp.layers.{*} 同时加入 recompute 和 activation offload，为 MTP 建立独立的重算上下文和 FA 缓存，使主干与 MTP 各自保存、恢复自己的缓存。 ## Does this PR introduce any user-facing change? 无 ## How was this patch tested? 在开启chukmbs 4切2 和关闭chunkmbs的场景下，修改后的配置和原先的配置在关闭skip_flash_attn_recompute的场景下，精度完全一致 See merge request: Ascend/MindSpeed-MM!2914	16 小时前
qwen3_5_35B_config.yaml	bugfix: Fix compatibility issues between MTP module and recomputation Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2914 merge MM-MTP-BUGFIX into master bugfix: Fix compatibility issues between MTP module and recomputation Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## What this PR does / why we need it? 问题背景：在qwen3.5-35B上开启mtp特性后报错，报错位置位于重计算中FA计算中shape不匹配；开启mtp特性，在调整配置能训练的情况下grad_norm非常大；报错原因：MTP部分的full attention和主干模型的full attention是同一个，在开启chunkmbs 4切2的场景下，MTP 的完整 batch FA 缓存被主干 ChunkMBS 反向重算误取，导致 batch=4 的输出按 batch=2 reshape，出现 8192 vs 4096；关闭 ChunkMBS 虽不报错，但由于主干网络错误的取到了mtp的fa缓存，导致梯度数值异常。配置修改原理：将 mtp.layers.{*} 同时加入 recompute 和 activation offload，为 MTP 建立独立的重算上下文和 FA 缓存，使主干与 MTP 各自保存、恢复自己的缓存。 ## Does this PR introduce any user-facing change? 无 ## How was this patch tested? 在开启chukmbs 4切2 和关闭chunkmbs的场景下，修改后的配置和原先的配置在关闭skip_flash_attn_recompute的场景下，精度完全一致 See merge request: Ascend/MindSpeed-MM!2914	16 小时前
qwen3_5_397B_config.yaml	bugfix: Fix compatibility issues between MTP module and recomputation Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2914 merge MM-MTP-BUGFIX into master bugfix: Fix compatibility issues between MTP module and recomputation Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## What this PR does / why we need it? 问题背景：在qwen3.5-35B上开启mtp特性后报错，报错位置位于重计算中FA计算中shape不匹配；开启mtp特性，在调整配置能训练的情况下grad_norm非常大；报错原因：MTP部分的full attention和主干模型的full attention是同一个，在开启chunkmbs 4切2的场景下，MTP 的完整 batch FA 缓存被主干 ChunkMBS 反向重算误取，导致 batch=4 的输出按 batch=2 reshape，出现 8192 vs 4096；关闭 ChunkMBS 虽不报错，但由于主干网络错误的取到了mtp的fa缓存，导致梯度数值异常。配置修改原理：将 mtp.layers.{*} 同时加入 recompute 和 activation offload，为 MTP 建立独立的重算上下文和 FA 缓存，使主干与 MTP 各自保存、恢复自己的缓存。 ## Does this PR introduce any user-facing change? 无 ## How was this patch tested? 在开启chukmbs 4切2 和关闭chunkmbs的场景下，修改后的配置和原先的配置在关闭skip_flash_attn_recompute的场景下，精度完全一致 See merge request: Ascend/MindSpeed-MM!2914	16 小时前
qwen3_5_4B_config.yaml	bugfix: Fix compatibility issues between MTP module and recomputation Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2914 merge MM-MTP-BUGFIX into master bugfix: Fix compatibility issues between MTP module and recomputation Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## What this PR does / why we need it? 问题背景：在qwen3.5-35B上开启mtp特性后报错，报错位置位于重计算中FA计算中shape不匹配；开启mtp特性，在调整配置能训练的情况下grad_norm非常大；报错原因：MTP部分的full attention和主干模型的full attention是同一个，在开启chunkmbs 4切2的场景下，MTP 的完整 batch FA 缓存被主干 ChunkMBS 反向重算误取，导致 batch=4 的输出按 batch=2 reshape，出现 8192 vs 4096；关闭 ChunkMBS 虽不报错，但由于主干网络错误的取到了mtp的fa缓存，导致梯度数值异常。配置修改原理：将 mtp.layers.{*} 同时加入 recompute 和 activation offload，为 MTP 建立独立的重算上下文和 FA 缓存，使主干与 MTP 各自保存、恢复自己的缓存。 ## Does this PR introduce any user-facing change? 无 ## How was this patch tested? 在开启chukmbs 4切2 和关闭chunkmbs的场景下，修改后的配置和原先的配置在关闭skip_flash_attn_recompute的场景下，精度完全一致 See merge request: Ascend/MindSpeed-MM!2914	16 小时前
qwen3_5_9B_config.yaml	bugfix: Fix compatibility issues between MTP module and recomputation Co-authored-by: LKONE<wanglikai4@huawei.com> # message auto-generated for no-merge-commit merge: !2914 merge MM-MTP-BUGFIX into master bugfix: Fix compatibility issues between MTP module and recomputation Created-by: wanglikai1019 Commit-by: LKONE Merged-by: ascend-robot Description: ## What this PR does / why we need it? 问题背景：在qwen3.5-35B上开启mtp特性后报错，报错位置位于重计算中FA计算中shape不匹配；开启mtp特性，在调整配置能训练的情况下grad_norm非常大；报错原因：MTP部分的full attention和主干模型的full attention是同一个，在开启chunkmbs 4切2的场景下，MTP 的完整 batch FA 缓存被主干 ChunkMBS 反向重算误取，导致 batch=4 的输出按 batch=2 reshape，出现 8192 vs 4096；关闭 ChunkMBS 虽不报错，但由于主干网络错误的取到了mtp的fa缓存，导致梯度数值异常。配置修改原理：将 mtp.layers.{*} 同时加入 recompute 和 activation offload，为 MTP 建立独立的重算上下文和 FA 缓存，使主干与 MTP 各自保存、恢复自己的缓存。 ## Does this PR introduce any user-facing change? 无 ## How was this patch tested? 在开启chukmbs 4切2 和关闭chunkmbs的场景下，修改后的配置和原先的配置在关闭skip_flash_attn_recompute的场景下，精度完全一致 See merge request: Ascend/MindSpeed-MM!2914	16 小时前

Qwen3_5 使用指南

版本说明

参考实现

url=https://github.com/huggingface/transformers.git
commit_id=fc91372

变更记录

2026.02.10: 首次支持Qwen3_5模型

环境安装

1. 环境准备

【模型开发时推荐使用配套的环境版本】

请参考安装指南，完成昇腾软件安装。

Python版本推荐3.10，torch和TorchNPU版本推荐2.7.1版本，CANN推荐使用8.5.2版本；

‼️MoE部分的加速特性依赖较新版本的TorchNPU和CANN，推荐使用以下版本

2. 环境搭建

拉取MindSpeed MM代码仓，并进入代码仓根目录：

git clone https://gitcode.com/Ascend/MindSpeed-MM.git
cd MindSpeed-MM

执行如下指令一键安装：

bash scripts/install.sh --msbranch master && bash examples/qwen3_5/install_extensions.sh

3. 安装配套版本的Triton-Ascend

安装配套版本的Triton-Ascend，请参考《Triton-Ascend》中的"通过pip安装Triton-Ascend"章节，获取配套版本的Triton-Ascend安装指令。

可参考如下安装命令：

# 注意：triton-ascend 3.2.0 及以下 Triton-Ascend 和 Triton 不能同时存在。需要先卸载社区 Triton，再安装 Triton-Ascend。
pip install triton-ascend==3.2.1 --extra-index-url=https://triton-ascend.osinfra.cn/pypi/simple

4. 安装fla-npu以适配AscendC

拉取flash-linear-attention-npu代码仓，并进入代码仓根目录，切到对应commitID

git clone https://github.com/flashserve/flash-linear-attention-npu
cd flash-linear-attention-npu
git checkout c2e3d83f

安装步骤：可参考fla-npu仓README：flash-linear-attention-npu

推荐使用以下安装命令

# source 实际的cann路径
source /usr/local/Ascend/cann/set_env.sh

# 编译算子 run 包，--soc 需指定为当前机器芯片类型 {ascend910b/ascend910_93/ascend950}
bash build.sh --soc=ascend910b --pkg --vendor_name=fla_npu
bash build_out/fla-npu-*.run
cd torch_custom/fla_npu/
bash build.sh

检验fla_npu是否安装成功

pip list | grep fla_npu

权重下载及转换

1. 权重下载

从Huggingface库下载对应的模型权重:

Note

如无法顺利访问HuggingFace社区下载资源，推荐前往ModelScope下载，需关注待下载文件的正确性与安全性。

模型地址: Qwen3.5-*B；

将下载的模型权重保存到本地的ckpt/hf_path/xxxxxxx目录下。(*表示对应的尺寸)

如果使用fsdp2的meta init初始化模型或MoE模型需要支持mtp，都需要先根据模型配置完成以下权重转换：

mm-convert Qwen35Converter hf_to_dcp \
--hf_dir ckpt/hf_path/xxxxxxx \
--dcp_dir ckpt/dcp_path/xxxxxxx \
--num_workers 0

# 其中：
# hf_dir: huggingface权重目录
# dcp_dir: 转换后DCP格式的权重保存目录
# num_workers: 并行工作线程数，0表示串行执行，若存储IO性能允许，可适当调大并发数以提升转换效率，推荐设置为4

# 转换后的目录结构为：
# ———— xxxxxxx
#   |—— release
#   |—— latest_checkpointed_iteration.txt

并在xxx_config.yaml中将init_model_with_meta_device参数配置为True，同时将load参数修改为转换后的dcp权重路径（写到release文件夹的上一级目录）。注意：如果MoE模型不支持mtp，可在执行mm-convert权重转换前将ckpt/hf_path/xxxxxxx/config.json中的mtp_num_hidden_layers设置为0，以跳过mtp专家权重合并，缩短转换时间，如397B模型可以缩短约5分钟。

MindSpeed MM保存权重的格式也为dcp格式，可使用如下命令将dcp权重转换回HF权重：

# 待转换的dcp权重目录结构样例为：
# ———— xxxxxxx
#   |—— release
#   |—— latest_checkpointed_iteration.txt

mm-convert Qwen35Converter dcp_to_hf \
--save_hf_dir ckpt/save_hf_path/Qwen3.5-xxB-hf-save \
--dcp_dir ./save_path/iter_000xx \
--origin_hf_dir ckpt/hf_path/Qwen3.5-xxB \
--to_bf16 false \
--num_workers 0

# 其中：
# save_hf_dir: 转换后Huggingface格式的权重保存目录
# dcp_dir: 保存的DCP格式权重目录，`iter_000xx`表示保存的第xx步的权重
# origin_hf_dir：原始Huggingface格式权重目录
# to_bf16：是否将权重数据类型从fp32转换成bf16
# num_workers: 并行工作线程数，0表示串行执行，若存储IO性能允许，可适当调大并发数以提升转换效率，推荐设置为4

注意：如果模型没有开启mtp（即，在xxx_config.yaml中model下的mtp_num_layers字段配置为0或没有配置），默认转换后的权重中不会包含mtp层的权重，可以通过设置--keep_origin_mtp_weights true来保留mtp层的权重。

数据集准备及处理

使用真实数据集训练：参考针对VL模型的数据构造 · 使用真实数据集（下载COCO2017 → 下载LLaVA-Instruct-150K标注 → 运行转换脚本生成mllm_format_llava_instruct_data.json）。
使用虚构数据做功能/性能测试：参考针对VL模型的数据构造 · 使用虚构数据。

微调

1. 准备工作

配置脚本前需要完成前置准备工作，包括：环境安装、权重下载及转换、数据集准备及处理，详情可查看对应章节。

2. 配置参数

【数据目录配置】

根据实际情况修改xxx_config.yaml中的数据集路径，包括model_name_or_path、dataset_dir、dataset等字段。

示例：如果数据及其对应的json都在/home/user/data/目录下，其中json目录为/home/user/data/video_data_path.json，此时配置如下： dataset_dir配置为/home/user/data/; dataset配置为./data/video_data_path.json 注意此时dataset需要配置为相对路径 注意cache_dir在多机上不要配置同一个挂载目录避免写入同一个文件导致冲突。

【模块冻结配置】

当前支持自定义冻结模块，在xxx_config.yaml中model->freeze字段中配置需要冻结的模块即可实现相应模块冻结。

【模型保存加载及日志信息配置】

根据实际情况配置xxx_config.yaml的training参数，包括保存路径以及保存间隔save、save_interval 根据实际情况配置xxx_config.yaml中的init_from_hf_path参数，该参数表示初始权重的加载路径。

【ulysses-cp并行配置】

根据实际情况配置xxx_config.yaml中的ulysses_parallel_size以调整ulysses-cp的并行度。（ulysses_parallel_size为1时不开启ulysses-cp）

注意在开启ulysses-cp时，请将xxx_config.yaml中的attn_implementation配置为flash_attention_2

【EP并行配置】

根据实际的需求配置xxx_config.yaml中的expert_parallel_size（注意仅对MoE模型生效）

根据expert_parallel_size可以自行选择更合适的ep_plan.dispatcher，推荐expert_parallel_size小于topk时，dispatcher选择allgather，expert_parallel_size大于topk时选择alltoall。

【MoE aux loss配置】

针对MoE模型，如果训练的过程中需要在交叉熵损失的基础上增加router_aux_loss使得训练过程中的专家负载分配区域平衡的话，可以配置xxx_config.yaml中的features.loss_cfg.router_aux_loss_coef字段，该字段表示负载均衡损失的系数。

【mtp配置】当前模型支持配置mtp模块，在xxx_config.yaml中model下的mtp_num_layers字段配置为1，默认为0；mtp_loss_scaling_factor字段也支持配置，默认为0.1 注意：qwen3.5的mtp layer目前只支持配置1层。

【性能优化配置】

重计算
- 在features.recompute配置，true表示开启，false表示关闭，默认开启。
- 开启后可以节省显存占用
chunkloss
- 在features.enable_chunk_loss配置，true表示开启，false表示关闭
- features.chunkloss_plan.chunk_size表示计算loss的时候在seq维度切分成大小为chunk_size的小块进行计算。
- 开启后可以大幅降低loss计算时的显存尖刺，节省整体显存占用
async activation offload
- 在features.enable_activation_offload配置，true表示开启，false表示关闭
- 开启后可以异步将重计算入口的激活值offload至host侧，在开启了重计算的场景下可以进一步节省显存。
chunkmbs
- 在features.enable_chunk_mbs配置，true表示开启，false表示关闭
- features.chunkmbs_plan.chunk_mbs表示切分以后单次计算的micro_batch_size
- 开启该特性时需要同时使能重计算和async activation offload特性，可以增加FSDP2单次unshard对应的计算密度，提高整网吞吐。
选择性重计算
- 在开启重计算的场景下，可以跳过linear attention层的gdn重计算，或者full attention层的flash attention重计算，并异步offload中间保存的tensor，在显存占用不变的条件下，减少计算量，提升训练吞吐
- 在model.skip_gdn_recompute配置是否跳过linear attention层gdn的重计算，true表示跳过，false表示不跳过
- 在model.skip_flash_attn_recompute配置是否跳过full attention层的flash attention的重计算，true表示跳过，false表示不跳过
- 开启该特性时需要同时使能重计算和async activation offload特性
gdn_implementation和causal_conv1d_implementation
- gdn_implementation和causal_conv1d_implementation分别支持eager，triton和ascendc配置，使用ascendc性能最佳，需要安装fla_npu库
- 当gdn_implementation配置为ascendc时，causal_conv1d_implementation只支持和triton和ascendc，防止算子之间的布局不匹配

【单机运行配置】以qwen3_5模型为例：配置examples/qwen3_5/finetune_qwen3_5.sh参数如下

# 根据实际情况修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))

【多机运行配置】如需拉起多机训练，修改启动脚本下 MASTER_ADDR、NODE_ADDR、NNODES以及NODE_RANK变量

MASTER_ADDR: 主节点IP地址
NODE_ADDR: 本机IP地址
NODE_RANK: 第几个节点
NNODES: 一共几个节点

3. 启动微调

loss计算方式差异会对训练效果造成不同的影响，在启动训练任务之前，请查看关于loss计算的文档，选择合适的loss计算方式vlm_model_loss_calculate_type.md 可在xxx_config.yaml的model参数中配置上述文档中的loss_type。

bash examples/qwen3_5/finetune_qwen3_5_xxB.sh

环境变量声明

环境变量	描述	取值说明
`ASCEND_SLOG_PRINT_TO_STDOUT`	是否开启日志打印	`0`: 关闭日志打屏 `1`: 开启日志打屏
`ASCEND_GLOBAL_LOG_LEVEL`	设置应用类日志的日志级别及各模块日志级别，仅支持调试日志	`0`: 对应DEBUG级别 `1`: 对应INFO级别 `2`: 对应WARNING级别 `3`: 对应ERROR级别 `4`: 对应NULL级别，不输出日志
`TASK_QUEUE_ENABLE`	用于控制开启task_queue算子下发队列优化的等级	`0`: 关闭 `1`: 开启Level 1优化 `2`: 开启Level 2优化
`COMBINED_ENABLE`	设置combined标志。设置为0表示关闭此功能；设置为1表示开启，用于优化非连续两个算子组合类场景	`0`: 关闭 `1`: 开启
`CPU_AFFINITY_CONF`	控制CPU端算子任务的处理器亲和性，即设定任务绑核	设置`0`或未设置: 表示不启用绑核功能 `1`: 表示开启粗粒度绑核 `2`: 表示开启细粒度绑核
`HCCL_CONNECT_TIMEOUT`	用于限制不同设备之间socket建链过程的超时等待时间	需要配置为整数，取值范围`[120,7200]`，默认值为`120`，单位`s`
`PYTORCH_NPU_ALLOC_CONF`	控制缓存分配器行为	`expandable_segments:<value>`: 使能内存池扩展段功能，即虚拟内存特征
`HCCL_EXEC_TIMEOUT`	控制设备间执行时同步等待的时间，在该配置时间内各设备进程等待其他设备执行通信同步	需要配置为整数，取值范围`[68,17340]`，默认值为`1800`，单位`s`
`ACLNN_CACHE_LIMIT`	配置单算子执行API在Host侧缓存的算子信息条目个数	需要配置为整数，取值范围`[1, 10,000,000]`，默认值为`10000`
`TOKENIZERS_PARALLELISM`	用于控制Hugging Face的transformers库中的分词器（tokenizer）在多线程环境下的行为	`False`: 禁用并行分词 `True`: 开启并行分词
`MULTI_STREAM_MEMORY_REUSE`	配置多流内存复用是否开启	`0`: 关闭多流内存复用 `1`: 开启多流内存复用
`NPU_ASD_ENABLE`	控制是否开启TorchNPU的特征值检测功能	设置`0`或未设置: 关闭特征值检测 `1`: 表示开启特征值检测，只打印异常日志，不告警 `2`:开启特征值检测，并告警 `3`:开启特征值检测，并告警，同时会在device侧info级别日志中记录过程数据
`ASCEND_LAUNCH_BLOCKING`	控制算子执行时是否启动同步模式	`0`: 采用异步方式执行 `1`: 强制算子采用同步模式运行
`NPUS_PER_NODE`	配置一个计算节点上使用的NPU数量	整数值（如 `1`, `8` 等）

注意事项

在加载 processor 过程中，会因 mistral_common 三方库版本的兼容性问题导致无法找到 processor，进而训练报错退出，可通过以下方式解决：
- 卸载mistral_common 三方库：pip uninstall -y mistral_common
- 升级mistral_common 三方库至最新版本：pip install --upgrade mistral_common