ascend-robotadd support for compiled autograd and simplefsdp for torchtitan

文件	最后提交记录	最后更新时间
README.md	add support for compiled autograd and simplefsdp for torchtitan Co-authored-by: wanghonghao<wanghonghao9@huawei.com> # message auto-generated for no-merge-commit merge: !7540 merge compiled_autograd into master add support for compiled autograd and simplefsdp for torchtitan Created-by: wanghonghao Commit-by: wanghonghao Merged-by: ascend-robot Description: ## Motivation 针对torchtitan Deepseek模型新增SimpleFSDP以及编译自动微分功能，完善使用资料与功能支持的代码，允许用户在torchtitan上适配compiled autograd ## Modification 新增compiled_autograd.py文件，允许用户在torchtitan上适配compiled autograd功能 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7540	27 天前
compiled_autograd.py	add support for compiled autograd and simplefsdp for torchtitan Co-authored-by: wanghonghao<wanghonghao9@huawei.com> # message auto-generated for no-merge-commit merge: !7540 merge compiled_autograd into master add support for compiled autograd and simplefsdp for torchtitan Created-by: wanghonghao Commit-by: wanghonghao Merged-by: ascend-robot Description: ## Motivation 针对torchtitan Deepseek模型新增SimpleFSDP以及编译自动微分功能，完善使用资料与功能支持的代码，允许用户在torchtitan上适配compiled autograd ## Modification 新增compiled_autograd.py文件，允许用户在torchtitan上适配compiled autograd功能 ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7540	27 天前
torchtitan_npu_patch.py	added new torchtitan model llama3 and added new features to the DeepSeek-V3 model Co-authored-by: zhangjianshe<1603088851@qq.com> # message auto-generated for no-merge-commit merge: !7477 merge master into master added new torchtitan model llama3 and added new features to the DeepSeek-V3 model Created-by: zhangjianshe Commit-by: zhangjianshe Merged-by: ascend-robot Description: ## Motivation Please describe the motivation of this PR and the goal you want to achieve through this PR. ## Modification Please briefly describe what modification is made in this PR. ## Self-test (Optional) If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached. ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7477	4 个月前

TorchTitan昇腾插件

概述

简介

本项目开发了名为torchtitan_npu_patch的TorchTitan昇腾插件，使昇腾NPU可以适配TorchTitan框架，为使用TorchTitan框架的开发者提供昇腾AI处理器的超强算力。

昇腾为基于华为昇腾处理器和软件的行业应用及服务提供全栈AI计算基础设施。您可以通过访问昇腾社区，了解关于昇腾的更多信息。

准备训练环境

版本配套

硬件：昇腾服务器，已验证Atlas 800T A2（Kunpeng CPU）
系统：Linux，已验证openEuler 22.03 (LTS-SP4)
Python：已验证3.10、3.11

已验证软件栈：

CANN	torch	torch_npu	torchtitan	triton-ascend
8.5.0.B120	2.9.0	2.9.0.post1.dev20260108	0.2.0	3.4.0.dev2026010713
8.5.0.B120	2.10.0	2.10.0.post1.dev20260320	0.2.1	3.4.0.dev2026010713

昇腾CANN安装配置

确保昇腾NPU固件和驱动程序是否已正确安装，运行以下命令可确认NPU信息、健康状态和使用状态：

npu-smi info
# output e.g.
# +------------------------------------------------------------------------------------------------+
# | npu-smi 25.0.rc1.1               Version: 25.0.rc1.1                                           |
# +---------------------------+---------------+----------------------------------------------------+
# | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
# | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
# +===========================+===============+====================================================+
# | 0     DEVICE_NAME         | OK            | 181.4       47                0    / 0             |
# | 0                         | 0000:C1:00.0  | 49          0    / 0          44421/ 65536         |
# ...

执行以下命令确认CANN是否安装以及对应版本，以下示例为默认安装路径，实际路径可能不同：

cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
# output e.g.
# package_name=Ascend-cann-toolkit
# version: 8.5.0
# ...

下载和安装详情参考Ascend环境搭建指南或昇腾社区。

确认昇腾固件、驱动、CANN安装配置完成后，可直接跳到快速开始章节快速安装余下软件，也可逐步按照下述章节中各组件安装指导安装余下软件。

安装PyTorch和torch_npu

参考torch_npu环境部署，了解昇腾版本配套关系，安装PyTorch、torch_npu。

安装triton-ascend

安装triton-ascend:

pip install triton-ascend

安装TorchTitan

参考TorchTitan Installation安装TorchTitan。或者直接下载TorcTitan-v0.2.0源码，配置PYTHONPATH=torchtitan根目录。

快速开始

安装PyTorch、torch_npu、triton-ascend等，参考：

# After ascend firmware/driver and CANN are installed correctly
# After python is installed or conda python env is activated

# Config pip mirror
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# For torch-npu dev version or x86 machine
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"

# Install pytorch
# Choose the version based on torchtitan version
pip install torch==2.9.0

# Install torch-npu prerequest
pip install pyyaml
pip install setuptools
pip install psutil
pip install scipy
pip install decorator
pip install numpy==1.2x
# Install torch-npu
# Choose the version based on torchtitan version
pip install torch_npu==2.9.0.post1.dev20260108 --extra-index-url https://mirrors.huaweicloud.com/ascend/repos/pypi

# Install triton-ascend
pip install triton-ascend

# Download torchtitan
git clone --branch v0.2.0 https://github.com/pytorch/torchtitan.git
# Install torchtitan prerequest
pip install -r torchtitan/requirements.txt
# Add torchtitan to PYTHONPATH
export PYTHONPATH=`pwd`/torchtitan/:$PYTHONPATH

补丁文件说明

将torchtitan_npu_patch.py添加至torchtitan根目录torchtitan/train.py同级目录下，在train.py开头添加注：补丁文件torchtitan_npu_patch.py与本README.md在同一目录下

from . import torchtitan_npu_patch
# before import torch

补丁文件主要补充了兼容性、亲和性能优化等修改。

运行

预训练数据集下载

可使用测试数据集c4_test(无需下载)，模型配置文件中已默认指定，无需修改。如果使用C4数据集，请下载并在对应模型配置文件中修改

# set or comment dataset and add dataset_path
dataset = "c4"
dataset_path = "/PATH/TO/c4"

# or save c4 dataset in default directory "allenai/c4" in torchtian source root, and set:
dataset = "c4"

Llama3配置

参考torchtitan社区Llama3模型。

Llama3配置文件路径./torchtitan/models/llama3/train_configs/。

tokenizer下载

运行以下命令可下载 Hugging Face 上的指定资源（如 tokenizer）：

cd torchtitan source root

python scripts/download_hf_assets.py --repo_id <hf_repo_name> --assets tokenizer --hf_token=...

<hf_repo_name>可选：

对于llama3-8b模型，Hugging Face 仓库名称为meta-llama/LLama-3.1-8B；

训练配置

llama3-8B FSDP2训练配置为

[training]
local_batch_size = 1
seq_len = 8192

[parallelism]
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 1
pipeline_parallel_degree = 1
context_parallel_degree = 1

[compile]
enable=false
components = ["model", "loss"]

llama3-8B FSDP2+TP2+PP2训练配置为

[training]
local_batch_size = 2
seq_len = 4096

[parallelism]
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 2
pipeline_parallel_degree = 2
context_parallel_degree = 1

[compile]
enable=false
components = ["model", "loss"]

使用PP并行只会在模型划分好的最后一个stage输出loss，因此需在启动文件run_train.sh将输出的rank号修改为如下形式，否则输出loss为-1

export LOG_RANK=${LOG_RANK:-4}

llama3-8B HSDP2+TP2+PP2训练配置为

[training]
local_batch_size = 2
seq_len = 4096

[parallelism]
data_parallel_replicate_degree = 2
data_parallel_shard_degree = -1
tensor_parallel_degree = 2
pipeline_parallel_degree = 2
context_parallel_degree = 1

[compile]
enable=false
components = ["model", "loss"]

使用PP并行只会在模型划分好的最后一个stage输出loss，以上述PP=2为例，在双节点上执行上述并行策略时，不同的模型stage会被放置在不同的节点上，因此只会在一个节点上有正确的loss输出，使用上述PP并行可在启动文件run_train.sh将输出的rank号修改为如下形式，会在副节点上输出正常loss

export LOG_RANK=${LOG_RANK:-4}

在启动文件run_train.sh中修改config_file

CONFIG_FILE=${CONFIG_FILE:-"./torchtitan/models/llama3/train_configs/llama3_8b.toml"}

确定性计算配置说明

精度验证/对齐，需固定随机种子，设置确定性计算/通信

设置环境变量

#Set HCCL Determinism
export HCCL_DETERMINISTIC=true
#Set NPU Hardware Determinism
export ASCEND_LAUNCH_BLOCKING=1

在torchtitan/train.py中导包完成后，添加如下代码

import torch_npu
import numpy as np
import random
seed = 1
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.use_deterministic_algorithms(True)
torch_npu.npu.manual_seed_all(seed)
torch_npu.npu.manual_seed(seed)

入图配置说明

可通过模型配置文件中使能如图。使用torch.compile进行模型编译。开启后模型会自动编译，并且在运行时直接使用编译后的模型。 当前入图编译仅支持FSDP，其余并行配置入图仍处于实验状态中。

通过配置文件开启

[compile]
enable=true
backend="inductor" # ["inductor", "aot_eager"]
components = ["model", "loss"]

当前在inductor后端+aclgraph的情景下，由于部分算子还不支持，需将环境变量NPU_INDUCTOR_FALLBACK_ALL设置为1。

#Enable NPUGraph
export TORCHINDUCTOR_CUDAGRAPHS=1
#Set NPU Inductor FallBack for operators
export NPU_INDUCTOR_FALLBACK_ALL=1

此外，在配置了aclgraph的场景下，TASK_QUEUE_ENABLE不支持为2。若要使用aclgraph，需要对torchtitan_npu_patch.py文件作出修改：

##set environment variable to improve the performence on NPUs
def set_environ_variable():
    logger.warning('set_environ_variable is called')
    os.environ['PYTORCH_NPU_ALLOC_CONF'] = 'expandable_segments:True'
    os.environ['TASK_QUEUE_ENABLE'] = '1'
    os.environ['MULTI_STREAM_MEMORY_REUSE'] = '2'
    os.environ['CPU_AFFINITY_CONF'] = '2'

SimpleFSDP

SimpleFSDP是TorchTitan在torchtitan/experiments/simple_fsdp中提供的轻量FSDP实验实现。SimpleFSDP基于torch.compile能力，支持模型整网入图。当前该能力在Pytorch Ascend的2.9与2.10版本上被支持。

SimpleFSDP的具体说明与支持情况可参考官方教程：TorchTitan SimpleFSDP。

SimpleFSDP当前依赖于Pytorch社区的RMSNorm算子。因此需要在torchtitan_npu_patch.py文件最后跳过这处patch：

set_environ_variable()
torchtitan.tools.utils.get_peak_flops = get_peak_flops
profiling.maybe_enable_profiling = maybe_enable_profiling
# nn.RMSNorm = RMSNorm

使用SimpleFSDP启动Llama3模型

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.name simple_fsdp.llama3 --compile.enable --job.custom_config_module=torchtitan.experiments.simple_fsdp.job_config

编译自动微分

编译自动微分（Compiled Autograd）用于对反向图进行编译，通过直接与自动微分（autograd）结合，可配合torch.compile进一步提升反向计算入图覆盖率。使用前需将compiled_autograd.py放置到torchtitan/train.py同级目录下，并在train.py开头添加：

from . import compiled_autograd
# before import torch

然后在模型配置文件的[compile]中同时开启enable和compiled_autograd_enable：

[compile]
enable=true
backend="inductor" # ["inductor", "aot_eager"]
components = ["model", "loss"]
compiled_autograd_enable=true

其中compiled_autograd_enable默认关闭，只有在enable和compiled_autograd_enable都为true时，编译自动微分功能才会生效。使能后，可以在执行训练脚本时通过TORCH_LOGS="compiled_autograd"开启日志打印来查看编译自动微分的详细信息。

环境变量与训练启动

激活环境变量，如果已激活，无需重复执行。

# source env of CANN, the real path on your server maybe different
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# If encounter a version mismatch issue between ATB and torch_npu, you can add and try switching the parameter --cxx_abi=0 or --cxx_abi=1.
source /usr/local/Ascend/nnal/atb/set_env.sh

# activate python env

运行脚本开始训练。

# cd torchtitan source root
bash ./run_train.sh

通过修改run_train.sh中torchrun命令可运行双机任务，训练时先运行主节点脚本再运行副节点脚本，具体修改方法如下

主节点配置

torchrun --nnodes 2 --node_rank 0 --local-ranks-filter ${LOG_RANK} --nproc_per_node=${NGPU} --master_addr="master_node_address" --master_port 29500 -m ${TRAIN_FILE} --job.config_file ${CONFIG_FILE} "$@"

副节点配置

torchrun --nnodes 2 --node_rank 1 --local-ranks-filter ${LOG_RANK} --nproc_per_node=${NGPU} --master_addr="master_node_address" --master_port 29500 -m ${TRAIN_FILE} --job.config_file ${CONFIG_FILE} "$@"