ascend-robotadd Index-TTS-vLLM-v2 npu adapter

文件	最后提交记录	最后更新时间
README.md	add Index-TTS-vLLM-v2 npu adapter Co-authored-by: qiupei1<qiupei3@huawei.com> # message auto-generated for no-merge-commit merge: !7579 merge master into master add Index-TTS-vLLM-v2 npu adapter Created-by: qiupei1 Commit-by: qiupei1 Merged-by: ascend-robot Description: ## Motivation Index-TTS-vLLM-v2模型适配代码合入，适用于A2 + index-tts-vllm(2.0版本的模型) ## Modification 包含所有适配的代码，已经打成patch，以及使用文档 ## Self-test (Optional) 自测通过，精度正常 ![image.png](https://raw.gitcode.com/user-images/assets/8112803/b3f950c8-1071-4a91-bcb9-69cbd82d204c/image.png 'image.png') ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7579	7 天前
diff_index_tts_vllm_v2.patch	add Index-TTS-vLLM-v2 npu adapter Co-authored-by: qiupei1<qiupei3@huawei.com> # message auto-generated for no-merge-commit merge: !7579 merge master into master add Index-TTS-vLLM-v2 npu adapter Created-by: qiupei1 Commit-by: qiupei1 Merged-by: ascend-robot Description: ## Motivation Index-TTS-vLLM-v2模型适配代码合入，适用于A2 + index-tts-vllm(2.0版本的模型) ## Modification 包含所有适配的代码，已经打成patch，以及使用文档 ## Self-test (Optional) 自测通过，精度正常 ![image.png](https://raw.gitcode.com/user-images/assets/8112803/b3f950c8-1071-4a91-bcb9-69cbd82d204c/image.png 'image.png') ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7579	7 天前
diff_torchaudio_kaldi.patch	add Index-TTS-vLLM-v2 npu adapter Co-authored-by: qiupei1<qiupei3@huawei.com> # message auto-generated for no-merge-commit merge: !7579 merge master into master add Index-TTS-vLLM-v2 npu adapter Created-by: qiupei1 Commit-by: qiupei1 Merged-by: ascend-robot Description: ## Motivation Index-TTS-vLLM-v2模型适配代码合入，适用于A2 + index-tts-vllm(2.0版本的模型) ## Modification 包含所有适配的代码，已经打成patch，以及使用文档 ## Self-test (Optional) 自测通过，精度正常 ![image.png](https://raw.gitcode.com/user-images/assets/8112803/b3f950c8-1071-4a91-bcb9-69cbd82d204c/image.png 'image.png') ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist Before PR: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized After PR: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!7579	7 天前

Index-TTS-vLLM-v2 NPU 推理适配指导

概述
推理环境准备
快速上手
适配修改说明
性能

概述

本项目在 index-tts-vllm 的基础上适配昇腾NPU，使其能够在 Ascend NPU 环境下使用 vllm-ascend 加速 GPT 模型推理，通过 FastAPI 提供 TTS 合成服务。Index-TTS-vLLM-v2 是 Index 团队推出的 v2 版本模型，在 v1 基础上增强了情感控制和语义理解能力。

版本说明：

url=https://github.com/Ksuriuri/index-tts-vllm
commit_id=5f9d0244f626f6a51fbbcee8f266d7f8927cf981
model_name=IndexTTS-2-vLLM

推理环境准备

该模型需要以下插件与驱动

表 1 版本配套表

配套版本

固件与驱动 25.5.1+

CANN 8.5.1

Python 3.11.14

PyTorch / torch_npu 2.9.0

vllm 0.18.0

vllm-ascend 0.18.0rc1

torchaudio 2.9.0

soundfile 0.13.1

说明：Atlas 800I A2 推理卡请以 CANN 版本选择实际固件与驱动版本。

配套	版本
固件与驱动	25.5.1+
CANN	8.5.1
Python	3.11.14
PyTorch / torch_npu	2.9.0
vllm	0.18.0
vllm-ascend	0.18.0rc1
torchaudio	2.9.0
soundfile	0.13.1

快速上手

获取源码

获取 index-tts-vllm 源码（对应 v2 版本）

git clone https://github.com/Ksuriuri/index-tts-vllm.git
cd index-tts-vllm
git checkout 5f9d0244f626f6a51fbbcee8f266d7f8927cf981

获取本仓适配补丁

git clone https://gitcode.com/Ascend/ModelZoo-PyTorch.git
cp ModelZoo-PyTorch/ACL_PyTorch/built-in/audio/Index-TTS-vLLM-v2/diff_index_tts_vllm_v2.patch .

应用适配补丁

cd index-tts-vllm
git apply diff_index_tts_vllm_v2.patch

依赖安装

pip install soundfile>=0.13.1

下载模型权重

From ModelScope（国内推荐）：

modelscope download --model kusuriuri/IndexTTS-2-vLLM --local_dir ./checkpoints/IndexTTS-2-vLLM

启动API服务

OMP_NUM_THREADS=1 python3 api_server_v2.py \
    --model_dir ./checkpoints/IndexTTS-2-vLLM \
    --host 0.0.0.0 \
    --port 6008 \
    --gpu_memory_utilization 0.35 \
    --qwenemo_gpu_memory_utilization 0.10

启动参数说明

参数	说明	默认值
`--model_dir`	模型权重路径	`./checkpoints/IndexTTS-2-vLLM`
`--host`	服务监听地址	`0.0.0.0`
`--port`	服务监听端口	`6008`
`--gpu_memory_utilization`	vllm 显存占用率	`0.35`
`--qwenemo_gpu_memory_utilization`	QwenEmotion 模型显存占用率	`0.10`
`--is_fp16`	是否启用 FP16 推理	`False`

环境变量说明

环境变量	说明	推荐值
`OMP_NUM_THREADS`	OMP 线程数	`1`
`PYTORCH_NPU_ALLOC_CONF`	NPU 内存分配策略	`expandable_segments:True`
`VLLM_ASCEND_BALANCE_SCHEDULING`	Ascend 调度均衡开关	`1`
`ATB_USE_TILING_COPY_STREAM`	ATB tiling copy stream	`1`

调用推理接口

服务启动成功后，可以调用如下代码进行推理验证：

python api_example_v2.py

适配修改说明

本项目对原始 index-tts-vllm 做了以下 NPU 适配修改：

1. api_server_v2.py

① 环境变量设置

os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("PYTORCH_NPU_ALLOC_CONF", "expandable_segments:True")
os.environ.setdefault("VLLM_ASCEND_BALANCE_SCHEDULING", "1")
os.environ.setdefault("ATB_USE_TILING_COPY_STREAM", "1")

2. indextts/infer_vllm_v2.py

① vLLM Engine 参数优化

_compilation_config = {
    "cudagraph_mode": "FULL_DECODE_ONLY",
    "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, ...],
    "pass_config": {"fuse_rope_kvcache": True}
}
engine_args = AsyncEngineArgs(
    ...
    optimization_level=1,
    enable_chunked_prefill=True,
    enable_prefix_caching=True,
    additional_config={
        "fuse_muls_add": True,
        "ascend_compilation_config": {"enable_npugraph_ex": True},
    },
)

② 设备检测增加 NPU 分支

elif hasattr(torch, "npu") and torch.npu.is_available():
    self.device = "npu:0"
    self.is_fp16 = is_fp16
    self.use_cuda_kernel = False

③ Speaker Cache 优化

相同 speaker 参考音频的编码结果复用，减少重复计算。

④ 音频加载替换

librosa.load 替换为 soundfile，提升加载速度并避免 NPU 兼容性问题。

⑤ 错误日志增强

logger.error(f"TTS inference failed:\n{tb_str}")

3. indextts/gpt/model_vllm_v2.py

① GPT2 配置

attn_implementation="eager"

NPU 上禁用 flash attention，使用 eager 模式确保兼容性。

4. patch_vllm.py

vllm-ascend GPUModelRunner patch

先调用原版 _prepare_inputs，再追加 GPT2TTSModel 专属的 position_ids 偏移逻辑：

from vllm_ascend.worker.model_runner_v1 import GPUModelRunner

_original_prepare_inputs = GPUModelRunner._prepare_inputs

def _prepare_inputs(self, scheduler_output, num_scheduled_tokens):
    result = _original_prepare_inputs(self, scheduler_output, num_scheduled_tokens)

    model = self.get_model()
    if isinstance(model, GPT2TTSModel):
        # GPT2TTSModel position 修正
        total_num_scheduled_tokens = scheduler_output.total_num_scheduled_tokens
        num_reqs = self.input_batch.num_reqs
        req_indices = np.repeat(self.arange_np[:num_reqs], num_scheduled_tokens)

        prompt_tokens_offset = []
        for req_id in self.input_batch.req_ids:
            prompt_tokens_offset.append(-(len(self.requests[req_id].prompt_token_ids) - 1))
        positions_np = self.positions.np[:total_num_scheduled_tokens]
        np.add(np.array(prompt_tokens_offset)[req_indices],
                positions_np,
                out=positions_np)
        self.positions.copy_to_gpu(total_num_scheduled_tokens)

    return result

GPUModelRunner._prepare_inputs = _prepare_inputs

5. indextts/s2mel/modules/flow_matching.py

CFG Cache 优化

CFM 推理时每 N 步才完整计算一次 CFG（Conditional Free Guidance），中间步骤复用缓存结果，显著降低计算量：

# 每 cfg_cache_interval 步计算一次完整 CFG
need_full_cfg = (cfg_cache_interval <= 1) or (step % cfg_cache_interval == 1)

性能

RTF（Real-Time Factor，实时率）= 推理耗时 / 生成音频时长，衡量合成速度。RTF < 1 表示合成速度优于实时，值越小性能越好。

硬件	生成音频时长	推理耗时	RTF
A2	3.81s	2.95s	0.77

各阶段耗时分布：

阶段	耗时	占比
gpt_gen_time（GPT decode生成）	1.53s	51.9%
s2mel_time（CFM扩散）	0.75s	25.4%
bigvgan_time（声码器）	0.08s	2.7%
其他（音频加载/编码等）	0.59s	20%

主要优化点：

vLLM Engine 参数调优（optimization_level、fuse_rope_kvcache、prefix_caching）
Speaker Cache 复用相同说话人的编码结果
CFG Cache 减少 CFM 扩散步数计算量
环境变量优化（OMP_NUM_THREADS、ATB_USE_TILING_COPY_STREAM 等）