Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM
Contributed by Yuekai Zhang (NVIDIA).
Quick Start
Launch the service directly with Docker Compose:
docker compose -f docker-compose.cosyvoice3.yml up
Build the Docker Image
To build the image from scratch:
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
Run a Docker Container
your_mount_dir=/mnt:/mnt
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
Understanding run_cosyvoice3.sh
The run_cosyvoice3.sh script orchestrates the entire workflow through numbered stages.
You can run a subset of stages with:
bash run_cosyvoice3.sh <start_stage> <stop_stage>
<start_stage>: The stage to start from.<stop_stage>: The stage to stop after.
Stages:
- Stage -1: Clones the
CosyVoicerepository. - Stage 0: Downloads the
Fun-CosyVoice3-0.5B-2512model and its HuggingFace LLM checkpoint. - Stage 1: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
- Stage 2: Creates the Triton model repository, including configurations for
cosyvoice3,token2wav,vocoder,audio_tokenizer, andspeaker_embedding. - Stage 3: Launches the Triton Inference Server for Token2Wav module and uses
trtllm-serveto deploy CosyVoice3 LLM. - Stage 4: Runs the gRPC benchmark client for performance testing.
- Stage 5: Runs the offline TTS inference benchmark test.
Export Models and Launch Server
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
# This command runs stages 0, 1, 2, and 3
bash run_cosyvoice3.sh 0 3
Benchmark with client-server mode
To benchmark the running Triton server, run stage 4:
bash run_cosyvoice3.sh 4 4
# You can customize parameters such as the number of tasks inside the script.
The following results were obtained by decoding on a single L20 GPU.
Streaming TTS (Concurrent Tasks = 4)
First Chunk Latency
| Concurrent Tasks | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) |
|---|---|---|---|---|---|
| 4 | 750.42 | 740.31 | 941.05 | 977.55 | 1002.37 |
Benchmark with offline inference mode
For offline inference mode benchmark, please run stage 5:
bash run_cosyvoice3.sh 5 5
Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT)
| Backend | LLM Batch Size | llm_time (s) | token2wav_time (s) | pipeline_time (s) | RTF |
|---|---|---|---|---|---|
| TRTLLM | 1 | 13.21 | 5.72 | 19.48 | 0.1091 |
| TRTLLM | 2 | 8.46 | 6.02 | 14.91 | 0.0822 |
| TRTLLM | 4 | 5.07 | 5.95 | 11.43 | 0.0630 |
| TRTLLM | 8 | 2.98 | 6.11 | 9.53 | 0.0562 |
| TRTLLM | 16 | 2.12 | 6.27 | 8.83 | 0.0501 |