文件最后提交记录最后更新时间
add Index-TTS-vLLM-v2 npu adapter Co-authored-by: qiupei1<qiupei3@huawei.com> # message auto-generated for no-merge-commit merge: !7579 merge master into master add Index-TTS-vLLM-v2 npu adapter Created-by: qiupei1 Commit-by: qiupei1 Merged-by: ascend-robot Description: ## Motivation Index-TTS-vLLM-v2模型适配代码合入,适用于A2 + index-tts-vllm(2.0版本的模型) ## Modification 包含所有适配的代码,已经打成patch,以及使用文档 ## Self-test (Optional) 自测通过,精度正常 ![image.png](https://raw.gitcode.com/user-images/assets/8112803/b3f950c8-1071-4a91-bcb9-69cbd82d204c/image.png 'image.png') ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!75797 天前
add Index-TTS-vLLM-v2 npu adapter Co-authored-by: qiupei1<qiupei3@huawei.com> # message auto-generated for no-merge-commit merge: !7579 merge master into master add Index-TTS-vLLM-v2 npu adapter Created-by: qiupei1 Commit-by: qiupei1 Merged-by: ascend-robot Description: ## Motivation Index-TTS-vLLM-v2模型适配代码合入,适用于A2 + index-tts-vllm(2.0版本的模型) ## Modification 包含所有适配的代码,已经打成patch,以及使用文档 ## Self-test (Optional) 自测通过,精度正常 ![image.png](https://raw.gitcode.com/user-images/assets/8112803/b3f950c8-1071-4a91-bcb9-69cbd82d204c/image.png 'image.png') ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!75797 天前
add Index-TTS-vLLM-v2 npu adapter Co-authored-by: qiupei1<qiupei3@huawei.com> # message auto-generated for no-merge-commit merge: !7579 merge master into master add Index-TTS-vLLM-v2 npu adapter Created-by: qiupei1 Commit-by: qiupei1 Merged-by: ascend-robot Description: ## Motivation Index-TTS-vLLM-v2模型适配代码合入,适用于A2 + index-tts-vllm(2.0版本的模型) ## Modification 包含所有适配的代码,已经打成patch,以及使用文档 ## Self-test (Optional) 自测通过,精度正常 ![image.png](https://raw.gitcode.com/user-images/assets/8112803/b3f950c8-1071-4a91-bcb9-69cbd82d204c/image.png 'image.png') ## BC-breaking (Optional) If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR. ## Checklist **Before PR**: - [ ] The new code needs to comply with the Clean Code specification. - [ ] The PR content is self-checked, and the expression can be clear and the writing standardized **After PR**: - [ ] CLA has been signed and all committers have signed the CLA in this PR. - [ ] The ci-pipeline is passed, Code Check is passed. See merge request: Ascend/ModelZoo-PyTorch!75797 天前
README.md

Index-TTS-vLLM-v2 NPU 推理适配指导


概述

  本项目在 index-tts-vllm 的基础上适配昇腾NPU,使其能够在 Ascend NPU 环境下使用 vllm-ascend 加速 GPT 模型推理,通过 FastAPI 提供 TTS 合成服务。Index-TTS-vLLM-v2 是 Index 团队推出的 v2 版本模型,在 v1 基础上增强了情感控制和语义理解能力。

  • 版本说明:
    url=https://github.com/Ksuriuri/index-tts-vllm
    commit_id=5f9d0244f626f6a51fbbcee8f266d7f8927cf981
    model_name=IndexTTS-2-vLLM
    

推理环境准备

  • 该模型需要以下插件与驱动

    表 1 版本配套表

    配套 版本
    固件与驱动 25.5.1+
    CANN 8.5.1
    Python 3.11.14
    PyTorch / torch_npu 2.9.0
    vllm 0.18.0
    vllm-ascend 0.18.0rc1
    torchaudio 2.9.0
    soundfile 0.13.1

    说明:Atlas 800I A2 推理卡请以 CANN 版本选择实际固件与驱动版本。

快速上手

获取源码

  1. 获取 index-tts-vllm 源码(对应 v2 版本)

    git clone https://github.com/Ksuriuri/index-tts-vllm.git
    cd index-tts-vllm
    git checkout 5f9d0244f626f6a51fbbcee8f266d7f8927cf981
    
  2. 获取本仓适配补丁

    git clone https://gitcode.com/Ascend/ModelZoo-PyTorch.git
    cp ModelZoo-PyTorch/ACL_PyTorch/built-in/audio/Index-TTS-vLLM-v2/diff_index_tts_vllm_v2.patch .
    

应用适配补丁

cd index-tts-vllm
git apply diff_index_tts_vllm_v2.patch

依赖安装

pip install soundfile>=0.13.1

下载模型权重

From ModelScope(国内推荐):

modelscope download --model kusuriuri/IndexTTS-2-vLLM --local_dir ./checkpoints/IndexTTS-2-vLLM

启动API服务

OMP_NUM_THREADS=1 python3 api_server_v2.py \
    --model_dir ./checkpoints/IndexTTS-2-vLLM \
    --host 0.0.0.0 \
    --port 6008 \
    --gpu_memory_utilization 0.35 \
    --qwenemo_gpu_memory_utilization 0.10

启动参数说明

参数 说明 默认值
--model_dir 模型权重路径 ./checkpoints/IndexTTS-2-vLLM
--host 服务监听地址 0.0.0.0
--port 服务监听端口 6008
--gpu_memory_utilization vllm 显存占用率 0.35
--qwenemo_gpu_memory_utilization QwenEmotion 模型显存占用率 0.10
--is_fp16 是否启用 FP16 推理 False

环境变量说明

环境变量 说明 推荐值
OMP_NUM_THREADS OMP 线程数 1
PYTORCH_NPU_ALLOC_CONF NPU 内存分配策略 expandable_segments:True
VLLM_ASCEND_BALANCE_SCHEDULING Ascend 调度均衡开关 1
ATB_USE_TILING_COPY_STREAM ATB tiling copy stream 1

调用推理接口

服务启动成功后,可以调用如下代码进行推理验证:

python api_example_v2.py

适配修改说明

本项目对原始 index-tts-vllm 做了以下 NPU 适配修改:

1. api_server_v2.py

① 环境变量设置

os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("PYTORCH_NPU_ALLOC_CONF", "expandable_segments:True")
os.environ.setdefault("VLLM_ASCEND_BALANCE_SCHEDULING", "1")
os.environ.setdefault("ATB_USE_TILING_COPY_STREAM", "1")

2. indextts/infer_vllm_v2.py

① vLLM Engine 参数优化

_compilation_config = {
    "cudagraph_mode": "FULL_DECODE_ONLY",
    "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, ...],
    "pass_config": {"fuse_rope_kvcache": True}
}
engine_args = AsyncEngineArgs(
    ...
    optimization_level=1,
    enable_chunked_prefill=True,
    enable_prefix_caching=True,
    additional_config={
        "fuse_muls_add": True,
        "ascend_compilation_config": {"enable_npugraph_ex": True},
    },
)

② 设备检测增加 NPU 分支

elif hasattr(torch, "npu") and torch.npu.is_available():
    self.device = "npu:0"
    self.is_fp16 = is_fp16
    self.use_cuda_kernel = False

③ Speaker Cache 优化

相同 speaker 参考音频的编码结果复用,减少重复计算。

④ 音频加载替换

librosa.load 替换为 soundfile,提升加载速度并避免 NPU 兼容性问题。

⑤ 错误日志增强

logger.error(f"TTS inference failed:\n{tb_str}")

3. indextts/gpt/model_vllm_v2.py

① GPT2 配置

attn_implementation="eager"

NPU 上禁用 flash attention,使用 eager 模式确保兼容性。

4. patch_vllm.py

vllm-ascend GPUModelRunner patch

先调用原版 _prepare_inputs,再追加 GPT2TTSModel 专属的 position_ids 偏移逻辑:

from vllm_ascend.worker.model_runner_v1 import GPUModelRunner

_original_prepare_inputs = GPUModelRunner._prepare_inputs

def _prepare_inputs(self, scheduler_output, num_scheduled_tokens):
    result = _original_prepare_inputs(self, scheduler_output, num_scheduled_tokens)

    model = self.get_model()
    if isinstance(model, GPT2TTSModel):
        # GPT2TTSModel position 修正
        total_num_scheduled_tokens = scheduler_output.total_num_scheduled_tokens
        num_reqs = self.input_batch.num_reqs
        req_indices = np.repeat(self.arange_np[:num_reqs], num_scheduled_tokens)

        prompt_tokens_offset = []
        for req_id in self.input_batch.req_ids:
            prompt_tokens_offset.append(-(len(self.requests[req_id].prompt_token_ids) - 1))
        positions_np = self.positions.np[:total_num_scheduled_tokens]
        np.add(np.array(prompt_tokens_offset)[req_indices],
                positions_np,
                out=positions_np)
        self.positions.copy_to_gpu(total_num_scheduled_tokens)

    return result

GPUModelRunner._prepare_inputs = _prepare_inputs

5. indextts/s2mel/modules/flow_matching.py

CFG Cache 优化

CFM 推理时每 N 步才完整计算一次 CFG(Conditional Free Guidance),中间步骤复用缓存结果,显著降低计算量:

# 每 cfg_cache_interval 步计算一次完整 CFG
need_full_cfg = (cfg_cache_interval <= 1) or (step % cfg_cache_interval == 1)

性能

RTF(Real-Time Factor,实时率)= 推理耗时 / 生成音频时长,衡量合成速度。RTF < 1 表示合成速度优于实时,值越小性能越好。

硬件 生成音频时长 推理耗时 RTF
A2 3.81s 2.95s 0.77

各阶段耗时分布:

阶段 耗时 占比
gpt_gen_time(GPT decode生成) 1.53s 51.9%
s2mel_time(CFM扩散) 0.75s 25.4%
bigvgan_time(声码器) 0.08s 2.7%
其他(音频加载/编码等) 0.59s 20%

主要优化点:

  • vLLM Engine 参数调优(optimization_level、fuse_rope_kvcache、prefix_caching)
  • Speaker Cache 复用相同说话人的编码结果
  • CFG Cache 减少 CFM 扩散步数计算量
  • 环境变量优化(OMP_NUM_THREADS、ATB_USE_TILING_COPY_STREAM 等)