rasko1Init: upload initial source code for xsched

a2f7f099创建于 2025年11月28日历史提交

文件	最后提交记录	最后更新时间
README.md	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
trt-xsched-2206.patch	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前

XSched Integration for NVIDIA Triton Inference Server

This demonstrates how XSched can be integrated into NVIDIA Triton Inference Server to enable priority-based scheduling of multi-model inference tasks.

Basic Idea

Triton Server allows users to set the priority of each serving model. We modify the TensorRT-Backend of the Triton Server to create an XQueue for each CUDA stream and inherit the priority of the model using the stream. Then, we use local scheduler and highest priority first policy to schedule these XQueues. With XSched, the inference tasks of the model using the higher-priority XQueues can preempt the lower-priority ones (Triton does not support task preemption currently), so that their latencies can be significantly reduced.

Usage

Apply Integration Patch

# commit id: 7f94a8ee1daab23046ef4d689bd56411101f207c
git clone https://github.com/triton-inference-server/tensorrt_backend -b r22.06
cd tensorrt_backend
git apply <xsched_dir>/integration/triton/trt-xsched-2206.patch

Build XSched

# Setup Triton Server container and mount the xsched and tensorrt_backend directory (replace `<xsched_dir>` and `<tensorrt_backend_dir>` with your own location)
docker run --privileged -itd --name xsched-triton-server --gpus all --net=host -v <xsched_dir>:/xsched -v <tensorrt_backend_dir>:/root/tensorrt_backend nvcr.io/nvidia/tritonserver:22.06-py3 bash
docker exec -it xsched-triton-server bash

# Install Miniconda in the container
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

source ~/miniconda3/bin/activate
conda init --all

# Install dependencies, we need cmake 3.17+ and rapidjson
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda install -y cmake rapidjson

# Build XSched
cd /xsched
make cuda

Build TensorRT-Backend

docker exec -it xsched-triton-server bash
source ~/miniconda3/bin/activate

# Build TensorRT-Backend
cd /root/tensorrt_backend

mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_COMMON_REPO_TAG=r22.06 -DTRITON_CORE_REPO_TAG=r22.06 -DTRITON_BACKEND_REPO_TAG=r22.06 -DXSched_DIR=/xsched/output/lib/cmake/XSched ..
make -j$(nproc)
make install

Run Triton Server

In config.pbtxt, you can set the priority of each model.

optimization {
  priority: PRIORITY_MAX
}

or,

optimization {
  priority: PRIORITY_MIN
}

Then start the server

export LD_LIBRARY_PATH=/xsched/output/lib:$LD_LIBRARY_PATH
export XSCHED_POLICY=HPF
tritonserver --backend-directory ./install/backends --model-repository=<model_repo_dir> --strict-model-config false

Example

See inference serving example for concrete details.