torchtitan-npu/torchtitan_npu · CANN/torchtitan-npu - AtomGit

cann-robot[fix] fix mtp hf mapping strategy for master

文件	最后提交记录	最后更新时间
config	[refactor] breaking update ! refactor config system to adapt torchtitan main branch Co-authored-by: 1Fire4<wangdingyi2@huawei.com> # message auto-generated for no-merge-commit merge: !175 merge refactor/adapt_dsv3 into master [refactor] breaking update ! refactor config system to adapt torchtitan main branch Created-by: hitwdy Commit-by: 1Fire4 Merged-by: cann-robot Description: ## 描述 ### 🚨 [破坏性更新] 配置系统全面重构与模型架构展平 > ⚠️ 注意：这是一次破坏性更新，将导致目前尚未完成重构适配的模型完全无法运行。当前本 PR 仅完成了对 DeepSeek-V3 的适配，后续将尽快完成其余模型的适配工作。 ### 💡 重构前后的核心用法差异 (Before vs. After) 本次重构深度对齐了上游最新的基于 Python Dataclass 的 Config Registry 机制，彻底抛弃了过往僵化的 TOML 配置文件。在用户层面，最直观的使用体验变化如下： \| 维度 \| 重构前 (Before) \| 重构后 (After) \| \|---\|---\|---\| \| 配置入口 \| 强依赖硬编码的 TOML 文件（如 `train_configs/.toml`）以及庞大、扁平的全局 `JobConfig` 类。 \| 完全废弃 TOML。全面拥抱 Python 原生的 `config_registry.py`，各个组件自带 Config 类嵌套，具备严格的类型安全。 \| \| NPU 专属参数* \| 依赖维护一个独立的 `custom_config.py`，并且需要通过运行时的 `_merge_configs` 函数把参数强行“挂载”到 JobConfig 中。 \| 类继承式无缝注入。直接继承上游的原生配置类（如 `ParallelismConfig`），NPU 专属特性（如 Ulysses CP、交换优化器等）作为原生 Field 直接定义，取消外部 Merge 操作。 \| \| 命令行交互 \| 命令行参数传递和覆盖依赖定制化的平铺 TOML 键值解析。 \| 基于 `tyro` 的嵌套解析。得益于 Dataclass，所有的 NPU 专属参数现在可以像原生参数一样，通过带层级的命令行参数直接覆盖（例如 `--training.swap_optimizer`）。 \| \| 代码文件组织 \| 并行配置与模型定义分散在不同的深层嵌套目录（`model/`，`infra/`），导致重度依赖和冗长的 import 路径。 \| 目录全面扁平化。`model.py`、`parallelize.py`、`state_dict_adapter.py` 等核心组件全部被上提至各个模型的根目录下（如 `models/deepseek_v3/`），架构一目了然。 \| --- ### 🔧 五大核心适配与重构要点详解 #### 1. 配置系统 (Config System) 的深度融合与精简这是最核心的底层改造之一，彻底改变了 NPU 专属参数的挂载与流转方式： * 废弃 `custom_config.py`：移除了脱离上游体系的独立 NPU 自定义配置逻辑。 * 配置类继承式覆盖：在 `torchtitan_npu/config/configs.py` 中，直接采用继承上游配置类（如 `ParallelismConfig`, `OptimizerConfig`）的设计模式。 * CLI 原生暴露：将 NPU 的专属特性（如 `enable_custom_context_parallel`、`swap_optimizer`、`match_rms_adamw` 等）作为新增字段（Fields）直接注入到继承类中。通过 `tyro` 解析器，这使得 NPU 专属参数可以和原生参数一样完美兼容命令行修改。 #### 2. 模型目录结构的全面展平 (Directory Flattening) 对齐上游最新的模型目录规范，对 `deepseek_v3` 的文件树进行了“去嵌套化”处理： * 消除冗余层级：将原本深层嵌套的 `model/model.py` 直接迁移至模型根目录。 * 统一组件入口：原有的并行配置代码 `infra/parallelize.py` 上提至模型根目录并直接命名为 `parallelize.py`；模型权重适配器 `state_dict_adapter.py` 同样上提至第一层级。 * 清理废弃配置：删除各个模型目录下大量冗余的、用于测试的硬编码 TOML 文件（如 `train_configs/deepseek_v32_671b_debug.toml` 等），极大地净化了代码库。 #### 3. 模型内部逻辑与类的重构 * 拥抱 Dataclass Component：废弃了原有单一且臃肿的 `model_args` 结构，修改当前的内部定义以对齐上游范式。为每一个可配置的模型组件（Component）引入独立的内部 `Config` 数据类，确保配置项流转的高内聚与低耦合。 #### 4. 补丁层 (Patches & Converters) 的自适应修正由于配置文件的入口与加载路径发生了根本性的改变，底层的 AST 拦截与替换逻辑也进行了同步升级： * 动态参数解包适配：大幅更新 `converters/framework/model_custom_config_converter.py` 以及 `quant_converter.py`，确保其能平滑兼容全新的继承式 Config 结构。 * 上游核心补丁更新：修改了 `patches/torchtitan/` 目录下的核心文件（如 `expert_parallel.py`, `hf_datasets.py`, `loss.py` 等）。 * 新增状态劫持机制：新增了 `_trainer_config_stash.py` 文件，用于在 NPU 训练生命周期内更加稳妥地暂存和劫持 Trainer 的运行期环境状态。 #### 5. 外围生态 (Tools, Scripts & Docs) 的全局翻新 * 入口与工具链更新：对主入口 `torchtitan_npu/train.py` 以及所有周边诊断工具（`checkpoint_patch.py`, `flight_recorder.py`, `profiling.py`）中的 import 依赖路径进行了地毯式替换。 * 启动脚本规范化：重构 `scripts/run_train.sh` 和 `run_train_multinodes.sh`，使其完美兼容最新的、基于 `tyro` 的命令行参数解析方式。 ## 类型 - [ ] Bug 修复 - [ ] 新功能 - [x] 重构（即不是新增功能，也不是修改bug的代码变动） - [x] 构建过程或辅助工具的变动 - [x] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试本PR依赖python3.11及特定版本torch,torchtitan,torch_npu ``` # 1. torchtitan git clone https://github.com/pytorch/torchtitan.git cd torchtitan git checkout ac13e536c84e7f6647b14fa9375c3c8a8a2b8578 cd <torchtitan_npu> ln -s <path>/torchtitan/torchtitan . # 2.torch pip3 install --no-cache-dir "https://download-r2.pytorch.org/whl/nightly/cpu/torch-2.12.0.dev20260317%2Bcpu-cp311-cp311-manylinux_2_28_aarch64.whl" # 3. torch_npu git clone https://gitcode.com/Ascend/pytorch.git /tmp/pytorch_npu && git -C /tmp/pytorch_npu checkout f9cbf1f179b59e75b915a72cfc3187f0aadfdea3 cd /tmp/pytorch_npu && bash ci/build.sh --python=3.11 pip3 install --upgrade /tmp/pytorch_npu/dist/torch_npu*.whl ``` ## 其他信息剩余待办： [RFC](https://gitcode.com/cann/torchtitan-npu/issues/21) See merge request: cann/torchtitan-npu!175	18 天前
converters	【fix】fix deepseep v4 pp Co-authored-by: zhangwei1177<zhangwei1177@huawei.com> # message auto-generated for no-merge-commit merge: !261 merge dsv4-pp-master into master 【fix】fix deepseep v4 pp Created-by: zhangwei1177 Commit-by: zhangwei1177 Merged-by: cann-robot Description: ## 描述这个 PR 主要是为了支持 DeepSeek-V4 在 NPU 上跑 Pipeline Parallel。DeepSeek-V4 的结构和普通 LLM 不完全一样，最后阶段需要保留 hc_head -> norm -> output，同时中间 PP stage 仍然需要原始 input_ids 参与 attention/indexer 相关计算，所以不能直接复用通用 pipeline_llm。本 PR 新增了 DeepSeek-V4 专用的 PP 入口 torchtitan_npu/models/deepseek_v4/pipeline_parallel.py::pipeline_deepseek_v4。这个函数负责 DeepSeek-V4 的 PP stage 构建流程：通过 _get_num_virtual_stages 计算 virtual stages，通过 generate_deepseek_v4_fqn_per_model_part 自动生成每个 stage 应保留的模块，再用 _validate_deepseek_v4_stage_modules 校验切分是否合法，确保 tok_embeddings 只在首 stage，hc_head/norm/output 只在末 stage，所有 layer 按顺序且只出现一次。随后调用 pipeline_module_split 完成模型切分，并对每个 model_part 调用 parallelize_fn，最后构建 PP schedule。为了解决非首 PP stage 拿不到原始 token ids 的问题，pipeline_parallel.py 里新增了 _is_deepseek_v4_pp_target 和 _with_deepseek_v4_pp_input_ids。前者判断当前 trainer 是否是 DeepSeek-V4 + PP 场景，后者在 dataloading 后把 input_ids 放进 extra_kwargs。对应的 patch 入口在 torchtitan_npu/patches/torch/pipelining.py::_patch_post_dataloading_process_for_deepseek_v4_pp_input_ids，它只负责 hook trainer 的 post_dataloading_process，具体 DeepSeek-V4 逻辑放回模型自己的 pipeline_parallel.py 中。为了兼容 PP + TP，torchtitan_npu/models/deepseek_v4/parallelize.py 中的 root parallelize 逻辑改成按当前 PP chunk 实际持有的模块动态构造 root_parallelize_plan。也就是说，只有当前 model_part 里实际存在 tok_embeddings、norm、output、hc_head 时，才分别加入 RowwiseParallel、SequenceParallel、ColwiseParallel 或 hc_head_plan。这样中间 stage 不会因为缺少顶层模块而报错。同时，hc_head_fn/base/scale 被移动到 HcHead 模块内部，并在 parallelize.py 中通过 _register_distributed_parameter(model.hc_head, ..., [Replicate()]) 注册为 TP 组内复制参数，使参数归属和 PP 切分更清晰。模型前向也做了对应适配，主要在 torchtitan_npu/models/deepseek_v4/model.py。DeepSeekV4Model.forward 现在能区分首 stage 和非首 stage：如果当前 stage 有 tok_embeddings，就从 token ids 计算 embedding；如果没有 tok_embeddings，则把输入视为上一 stage 传来的 hidden states，并通过 _normalize_pp_input_ids 从 kwargs 获取原始 input_ids。末 stage 在输出前会通过 _validate_last_stage_hc_head 校验 hc_head 相关模块和参数是否完整。此外，PR 还修正了 DSA indexer loss 在 PP/TP/DP/CP 组合下的统计逻辑。model.py::DSAIndexerLossLoggingHelper 负责记录每层 indexer loss，parallelize.py::apply_distributed_indexer_loss_tracking 会根据 compress_ratios == 4 找出真正有 indexer 的层，让所有 rank 都参与 all_reduce，并按 world_size / pp 和梯度累积步数做归一化，避免 PP 场景下 indexer loss 日志被漏算或重复平均。最后，torchtitan_npu/patches/torch/pipelining.py::_patch_fork_rng_for_npu_pipeline 适配了 PyTorch pipeline 内部 fork_rng 的 NPU 场景。当 pipeline 内部传入 devices 但没有显式传 device_type 时，补成 device_type="npu"，避免默认走 CUDA；如果调用方已经显式指定 device_type，则不覆盖，尽量降低全局 patch 的副作用。另外，本 PR 对 hc_head 相关参数做了归属调整。原先 hc_head_fn、hc_head_base、hc_head_scale 挂在 DeepSeekV4Model 顶层，在 PP 切分后这些参数容易和真正使用它们的 hc_head 模块分离，导致 last stage 参数归属不清晰。现在这些参数被移动到 torchtitan_npu/models/deepseek_v4/model.py::HcHead 模块内部，HcHead.forward 直接使用自身参数完成计算。对应地，DeepSeekV4Model.init_weights、state_dict_adapter.py 中的权重 key 映射，以及 parallelize.py 中 TP 参数注册逻辑也都改成访问 model.hc_head.hc_head_fn/base/scale。这样 PP 切分时 hc_head 和它依赖的参数会一起留在末 stage，不再需要额外清理顶层参数。同时，torchtitan_npu/entry.py 中也加入了 DeepSeek-V4 的 indexer loss patch 入口。原来相关逻辑只覆盖了 deepseek_v32，这次扩展为 model_name in ("deepseek_v32", "deepseek_v4")。当模型是 DeepSeek-V4 时，会调用 torchtitan_npu.train::_patch_train_step_for_dsv4_indexer_loss，让训练 step 在主 loss 之外额外触发 DeepSeek-V4 的 DSA indexer loss 计算与日志统计；随后统一调用 _patch_init_for_dsa_set_loss_scale，保证 indexer loss 的 backward scale 和主 loss scale 对齐。这样 DeepSeek-V4 在训练时才能正确启用 indexer loss，而不是只完成普通 LM loss。 ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试在config_registry.py文件中将pipeline_parallel_degree设置为2，pipeline_parallel_schedule改为1F1B。 ## 其他信息 deepseek-v4模型8卡、pp=1和16卡、pp=2的结果比较： ![compare_pp2_164652_vs_pp1_164122.png](https://raw.gitcode.com/user-images/assets/9028822/993b57cc-cb94-4acc-88ad-80d31f7350ca/compare_pp2_164652_vs_pp1_164122.png 'compare_pp2_164652_vs_pp1_164122.png') See merge request: cann/torchtitan-npu!261	8 小时前
distributed	[fix] fix fake backend bug Co-authored-by: 1Fire4<wangdingyi2@huawei.com> # message auto-generated for no-merge-commit merge: !262 merge fix/dsv4_fake_backend_ep into master [fix] fix fake backend bug Created-by: hitwdy Commit-by: 1Fire4 Merged-by: cann-robot Description: ## 描述修复开启fake_backend 对 MoE expert-parallel 的 all-to-all split/output shape 处理不正确问题。解决方案：只在 dist.get_backend(group) == "fake" 时绕过 fake all-to-all 的 token-count 交换，用本地 num_tokens_per_expert 合成稳定 layout，并在 combine 阶段直接 _unpermute ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试 COMM_MODE="fake_backend" bash scripts/run_train.sh - 测试结果 deepseekv3 v32 v4均成功跑通 - v3 — loss nan(应该属于fake PG 预期，待进一步check) - v32 — loss 12.28 - v4 — loss 12.28 See merge request: cann/torchtitan-npu!262	15 小时前
models	[fix] fix mtp hf mapping strategy for master Co-authored-by: zhangjianshe<1603088851@qq.com> # message auto-generated for no-merge-commit merge: !263 merge mtp-master into master [fix] fix mtp hf mapping strategy for master Created-by: zhangjianshe Commit-by: zhangjianshe Merged-by: cann-robot Description: ## 描述 - 修复了deepseek_v4模型MTP模块的hf权重转换逻辑，确保与官方hf权重分布保持一致。 ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试 4层减层模型+MTP层 hf转换前后key值统计对比 \| Module \| Sub-module \| original hf \| saved by titan \| \|--\|--\|--\|--\| \| embed \| embed.weight \| 1 \| 1 \| \| hc_head \| hc_head_base/fn/scale \| 3 \| 3 \| \| head \| head.weight (lm_head) \| 1 \| 1 \| \| norm \| norm.weight \| 1 \| 1 \| \| layers.0-3 \| attn.* (wq_a/b, wkv, wo_a/b, q_norm, kv_norm, attn_sink) \| 各4 \| 各4 \| \| layers.0-3 \| attn.compressor \| 8 \| 8 \| \| layers.0-3 \| attn.indexer \| 6 \| 6 \| \| layers.0-3 \| attn_norm.weight \| 4 \| 4 \| \| layers.0-3 \| ffn.experts..w1 \| 1024 \| 1024 \| \| layers.0-3 \| ffn.experts..w2 \| 1024 \| 1024 \| \| layers.0-3 \| ffn.experts..w3 \| 1024 \| 1024 \| \| layers.0-3 \| ffn.gate \| 8 \| 8 \| \| layers.0-3 \| ffn.shared_experts.w1/w2/w3 \| 各4 \| 各4 \| \| layers.0-3 \| ffn_norm.weight \| 4 \| 4 \| \| layers.0-3 \| hc_attn_base/fn/scale \| 各4 \| 各4 \| \| layers.0-3 \| hc_ffn_base/fn/scale \| 各4 \| 各4 \| \| mtp.0 \| attn. (wq_a/b, wkv, wo_a/b, q_norm, kv_norm, attn_sink) \| 各1 \| 各1 \| \| mtp.0 \| attn_norm \| 1 \| 1 \| \| mtp.0 \| ffn.experts..w1/w2/w3 \| 各256 \| 各256 \| \| mtp.0 \| ffn.gate \| 2 \| 2 \| \| mtp.0 \| ffn.shared_experts.w1/w2/w3 \| 各1 \| 各1 \| \| mtp.0 \| ffn_norm / norm / hnorm \| 各1 \| 各1 \| \| mtp.0 \| e_proj / emb.tok_emb / enorm / h_proj \| 各1 \| 各1 \| \| mtp.0 \| hc_attn_ / hc_ffn_* / hc_head_* \| 各1 \| 各1 \| \| mtp.0 \| head \| 1 \| 1 \| ## 其他信息在这里可以添加任何与这个 Pull Request 相关的其他说明。 See merge request: cann/torchtitan-npu!263	6 小时前
ops	[fix] Adapt cann 9.0.0 + triton_ascend 3.2.1 + torch 2.12.0 + torch_npu 2.12.0rc1 , fix triton ascend slice extension Co-authored-by: 1Fire4<wangdingyi2@huawei.com> # message auto-generated for no-merge-commit merge: !239 merge fix/triton-ascend-slice-extension into master [fix] Adapt cann 9.0.0 + triton_ascend 3.2.1 + torch 2.12.0 + torch_npu 2.12.0rc1 , fix triton ascend slice extension Created-by: hitwdy Commit-by: 1Fire4 Merged-by: cann-robot Description: ## 描述适配cann 9.0.0 + triton_ascend 3.2.1 + torch 2.12.0 + torch_npu 2.12.0rc1 环境，修复 triton 算子编译报错 ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试开启compile前后对比： ![dsv4_compile_vs_eager.png](https://raw.gitcode.com/user-images/assets/9028822/6da34f48-15c0-4cd9-8ac2-4f1f6d92fa1a/dsv4_compile_vs_eager.png 'dsv4_compile_vs_eager.png') See merge request: cann/torchtitan-npu!239	3 天前
patches	【fix】fix deepseep v4 pp Co-authored-by: zhangwei1177<zhangwei1177@huawei.com> # message auto-generated for no-merge-commit merge: !261 merge dsv4-pp-master into master 【fix】fix deepseep v4 pp Created-by: zhangwei1177 Commit-by: zhangwei1177 Merged-by: cann-robot Description: ## 描述这个 PR 主要是为了支持 DeepSeek-V4 在 NPU 上跑 Pipeline Parallel。DeepSeek-V4 的结构和普通 LLM 不完全一样，最后阶段需要保留 hc_head -> norm -> output，同时中间 PP stage 仍然需要原始 input_ids 参与 attention/indexer 相关计算，所以不能直接复用通用 pipeline_llm。本 PR 新增了 DeepSeek-V4 专用的 PP 入口 torchtitan_npu/models/deepseek_v4/pipeline_parallel.py::pipeline_deepseek_v4。这个函数负责 DeepSeek-V4 的 PP stage 构建流程：通过 _get_num_virtual_stages 计算 virtual stages，通过 generate_deepseek_v4_fqn_per_model_part 自动生成每个 stage 应保留的模块，再用 _validate_deepseek_v4_stage_modules 校验切分是否合法，确保 tok_embeddings 只在首 stage，hc_head/norm/output 只在末 stage，所有 layer 按顺序且只出现一次。随后调用 pipeline_module_split 完成模型切分，并对每个 model_part 调用 parallelize_fn，最后构建 PP schedule。为了解决非首 PP stage 拿不到原始 token ids 的问题，pipeline_parallel.py 里新增了 _is_deepseek_v4_pp_target 和 _with_deepseek_v4_pp_input_ids。前者判断当前 trainer 是否是 DeepSeek-V4 + PP 场景，后者在 dataloading 后把 input_ids 放进 extra_kwargs。对应的 patch 入口在 torchtitan_npu/patches/torch/pipelining.py::_patch_post_dataloading_process_for_deepseek_v4_pp_input_ids，它只负责 hook trainer 的 post_dataloading_process，具体 DeepSeek-V4 逻辑放回模型自己的 pipeline_parallel.py 中。为了兼容 PP + TP，torchtitan_npu/models/deepseek_v4/parallelize.py 中的 root parallelize 逻辑改成按当前 PP chunk 实际持有的模块动态构造 root_parallelize_plan。也就是说，只有当前 model_part 里实际存在 tok_embeddings、norm、output、hc_head 时，才分别加入 RowwiseParallel、SequenceParallel、ColwiseParallel 或 hc_head_plan。这样中间 stage 不会因为缺少顶层模块而报错。同时，hc_head_fn/base/scale 被移动到 HcHead 模块内部，并在 parallelize.py 中通过 _register_distributed_parameter(model.hc_head, ..., [Replicate()]) 注册为 TP 组内复制参数，使参数归属和 PP 切分更清晰。模型前向也做了对应适配，主要在 torchtitan_npu/models/deepseek_v4/model.py。DeepSeekV4Model.forward 现在能区分首 stage 和非首 stage：如果当前 stage 有 tok_embeddings，就从 token ids 计算 embedding；如果没有 tok_embeddings，则把输入视为上一 stage 传来的 hidden states，并通过 _normalize_pp_input_ids 从 kwargs 获取原始 input_ids。末 stage 在输出前会通过 _validate_last_stage_hc_head 校验 hc_head 相关模块和参数是否完整。此外，PR 还修正了 DSA indexer loss 在 PP/TP/DP/CP 组合下的统计逻辑。model.py::DSAIndexerLossLoggingHelper 负责记录每层 indexer loss，parallelize.py::apply_distributed_indexer_loss_tracking 会根据 compress_ratios == 4 找出真正有 indexer 的层，让所有 rank 都参与 all_reduce，并按 world_size / pp 和梯度累积步数做归一化，避免 PP 场景下 indexer loss 日志被漏算或重复平均。最后，torchtitan_npu/patches/torch/pipelining.py::_patch_fork_rng_for_npu_pipeline 适配了 PyTorch pipeline 内部 fork_rng 的 NPU 场景。当 pipeline 内部传入 devices 但没有显式传 device_type 时，补成 device_type="npu"，避免默认走 CUDA；如果调用方已经显式指定 device_type，则不覆盖，尽量降低全局 patch 的副作用。另外，本 PR 对 hc_head 相关参数做了归属调整。原先 hc_head_fn、hc_head_base、hc_head_scale 挂在 DeepSeekV4Model 顶层，在 PP 切分后这些参数容易和真正使用它们的 hc_head 模块分离，导致 last stage 参数归属不清晰。现在这些参数被移动到 torchtitan_npu/models/deepseek_v4/model.py::HcHead 模块内部，HcHead.forward 直接使用自身参数完成计算。对应地，DeepSeekV4Model.init_weights、state_dict_adapter.py 中的权重 key 映射，以及 parallelize.py 中 TP 参数注册逻辑也都改成访问 model.hc_head.hc_head_fn/base/scale。这样 PP 切分时 hc_head 和它依赖的参数会一起留在末 stage，不再需要额外清理顶层参数。同时，torchtitan_npu/entry.py 中也加入了 DeepSeek-V4 的 indexer loss patch 入口。原来相关逻辑只覆盖了 deepseek_v32，这次扩展为 model_name in ("deepseek_v32", "deepseek_v4")。当模型是 DeepSeek-V4 时，会调用 torchtitan_npu.train::_patch_train_step_for_dsv4_indexer_loss，让训练 step 在主 loss 之外额外触发 DeepSeek-V4 的 DSA indexer loss 计算与日志统计；随后统一调用 _patch_init_for_dsa_set_loss_scale，保证 indexer loss 的 backward scale 和主 loss scale 对齐。这样 DeepSeek-V4 在训练时才能正确启用 indexer loss，而不是只完成普通 LM loss。 ## 类型 - [x] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试在config_registry.py文件中将pipeline_parallel_degree设置为2，pipeline_parallel_schedule改为1F1B。 ## 其他信息 deepseek-v4模型8卡、pp=1和16卡、pp=2的结果比较： ![compare_pp2_164652_vs_pp1_164122.png](https://raw.gitcode.com/user-images/assets/9028822/993b57cc-cb94-4acc-88ad-80d31f7350ca/compare_pp2_164652_vs_pp1_164122.png 'compare_pp2_164652_vs_pp1_164122.png') See merge request: cann/torchtitan-npu!261	8 小时前
tools	[ci] Re-enabled lint and fixed errors Co-authored-by: mystri<hanboyou@huawei.com> # message auto-generated for no-merge-commit merge: !243 merge lint-fixes into master [ci] Re-enabled lint and fixed errors Created-by: mystri Commit-by: mystri Merged-by: cann-robot Description: ## 描述重新启用lint 并修复所有 lint 错误脚本修改： - .ci/lint.sh：添加 torchtitan 源码安装步骤，安装到 /tmp/torchtitan（pyrefly 配置的 search-path），确保 CI 使用与 smoke_test/unit_test 相同的 torchtitan 版本 - .ci/setup_torchtitan.sh（新增）：共享的 torchtitan 安装脚本，锁定 commit ac13e536c84e7f6647b14fa9375c3c8a8a2b8578，供所有 CI 脚本复用 ## 类型 - [ ] Bug 修复 - [ ] 新功能 - [ ] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [ ] 我的代码遵循这个项目的代码风格 - [ ] 我已经自己测试过我的代码 - [ ] 我已经更新了相应的文档 - [ ] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试简要描述测试方案，并附上自验证记录。 ## 其他信息在这里可以添加任何与这个 Pull Request 相关的其他说明。 See merge request: cann/torchtitan-npu!243	6 天前
__init__.py	[feat]adapter virtual optimizer for master Co-authored-by: CjianForBetter<2318164299@qq.com> # message auto-generated for no-merge-commit merge: !235 merge virtual_optimizer_for_master into master [feat]adapter virtual optimizer for master Created-by: CjianForBetter Commit-by: CjianForBetter Merged-by: cann-robot Description: ## 描述此PR是为了实现virtual_optimizer在master仓上最新代码的适配和支持，已经在双机上完成验证，实现结果表明，可以支持ckpt权重保存、加载，精度不受影响。改动点：从v0.2.2-dev分支迁移过来，主体保持不变，patch架构修改，基于optimizer_selector进行选择，不再init中import swap_optimizer和import virtual_optimizer,切换优化器更加方便；patch架构修改依赖最新的Containers。验证过的场景： 1.使用最新镜像，虚拟优化器可以保存、加载权重（单机场景），显存占用降低。 2.修改config_registry.py中的优化器配置，可以灵活切换到swap，互相不受影响。 3.精度问题。本次验证使用单机16die A3环境，使用网络是deepseek_v4_285b_4layers_debug， ep16，专家数为16 ## 类型 - [ ] Bug 修复 - [ ] 新功能 - [x] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [x] 我的代码遵循这个项目的代码风格 - [x] 我已经自己测试过我的代码 - [x] 我已经更新了相应的文档 - [x] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试 bash scripts/run_train.sh ## 测试结果 1.虚拟优化器可以保存训练5步之后的权重，并进行加载。 ![ScreenShot_20260528205048.PNG](https://raw.gitcode.com/user-images/assets/9028822/1009ca46-73f0-4b5d-895e-42698a280768/ScreenShot_20260528205048.PNG 'ScreenShot_20260528205048.PNG') 可以从第六步开始训练加载权重 ![ScreenShot_20260528205109.PNG](https://raw.gitcode.com/user-images/assets/9028822/080552ec-c5ea-4833-ab2a-736a8072fb14/ScreenShot_20260528205109.PNG 'ScreenShot_20260528205109.PNG') 2.精度对比（1）开启虚拟优化器和关闭虚拟优化器的精度对比 ![1.PNG](https://raw.gitcode.com/user-images/assets/9028822/b96f7098-fdfe-4cfe-a06a-d0324eb799b7/1.PNG '1.PNG') （2）开启虚拟优化器和开启swap优化器的精度对比 ![2.PNG](https://raw.gitcode.com/user-images/assets/9028822/cfd48e97-9890-4812-8b39-d6f009b2d971/2.PNG '2.PNG') 3.性能对比（1）虚拟优化器和关闭虚拟优化器显存占用对比，虚拟优化器较小 ![3.PNG](https://raw.gitcode.com/user-images/assets/9028822/4f7ae605-210e-4d1b-a17b-c2e136837870/3.PNG '3.PNG') （2）虚拟优化器和swap优化器显存占用对比，虚拟优化器显存占用较小 ![4.PNG](https://raw.gitcode.com/user-images/assets/9028822/056467ec-1e8a-4830-9499-506c2f1fdd0e/4.PNG '4.PNG') See merge request: cann/torchtitan-npu!235	15 小时前
entry.py	[refactor] Refactored kl loss to avoid recomputation cost Co-authored-by: mystri<hanboyou@huawei.com> # message auto-generated for no-merge-commit merge: !236 merge refactor-kl-loss-master into master [refactor] Refactored kl loss to avoid recomputation cost Created-by: mystri Commit-by: mystri Merged-by: cann-robot Description: ## 描述将liloss算子移动到backward，去掉 activation checkpoint 补丁。改动点同[https://gitcode.com/cann/torchtitan-npu/pull/191](https://gitcode.com/cann/torchtitan-npu/pull/191)，回合master。精度验证： ![ScreenShot_20260523175054.JPG](https://raw.gitcode.com/user-images/assets/9028822/0ac3e686-03b2-4ae9-b735-320c8a8211e2/ScreenShot_20260523175054.JPG 'ScreenShot_20260523175054.JPG') 跑通测试： AF+deepseek v3.2： ![0528af.JPG](https://raw.gitcode.com/user-images/assets/9028822/3c26532a-63e6-4165-be24-a6fadd8a1400/0528af.JPG '0528af.JPG') ## 类型 - [x] Bug 修复 - [ ] 新功能 - [x] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [ ] 我的代码遵循这个项目的代码风格 - [ ] 我已经自己测试过我的代码 - [ ] 我已经更新了相应的文档 - [ ] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试简要描述测试方案，并附上自验证记录。 ## 其他信息在这里可以添加任何与这个 Pull Request 相关的其他说明。 See merge request: cann/torchtitan-npu!236	1 天前
train.py	[refactor] Refactored kl loss to avoid recomputation cost Co-authored-by: mystri<hanboyou@huawei.com> # message auto-generated for no-merge-commit merge: !236 merge refactor-kl-loss-master into master [refactor] Refactored kl loss to avoid recomputation cost Created-by: mystri Commit-by: mystri Merged-by: cann-robot Description: ## 描述将liloss算子移动到backward，去掉 activation checkpoint 补丁。改动点同[https://gitcode.com/cann/torchtitan-npu/pull/191](https://gitcode.com/cann/torchtitan-npu/pull/191)，回合master。精度验证： ![ScreenShot_20260523175054.JPG](https://raw.gitcode.com/user-images/assets/9028822/0ac3e686-03b2-4ae9-b735-320c8a8211e2/ScreenShot_20260523175054.JPG 'ScreenShot_20260523175054.JPG') 跑通测试： AF+deepseek v3.2： ![0528af.JPG](https://raw.gitcode.com/user-images/assets/9028822/3c26532a-63e6-4165-be24-a6fadd8a1400/0528af.JPG '0528af.JPG') ## 类型 - [x] Bug 修复 - [ ] 新功能 - [x] 重构（即不是新增功能，也不是修改bug的代码变动） - [ ] 构建过程或辅助工具的变动 - [ ] 文档内容更新 ## Checklist: - [ ] 我的代码遵循这个项目的代码风格 - [ ] 我已经自己测试过我的代码 - [ ] 我已经更新了相应的文档 - [ ] 我已经在标题中正确使用了类型标签（例如：`feat`, `fix`, `refactor`, `docs`, `test`） ## 如何测试简要描述测试方案，并附上自验证记录。 ## 其他信息在这里可以添加任何与这个 Pull Request 相关的其他说明。 See merge request: cann/torchtitan-npu!236	1 天前