rasko1Init: upload initial source code for xsched

a2f7f099创建于 2025年11月28日历史提交

文件	最后提交记录	最后更新时间
BERT	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
client	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
model-repo	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
python	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
results	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
server	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
.gitignore	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前
README.md	Init: upload initial source code for xsched Signed-off-by: Guotianyi <guotianyi@sjtu.edu.cn>	6 个月前

Use XSched to Schedule Tasks inside one Inference Server Process

This example demonstrates how XSched can be integrated into Triton Inference Server to enable priority-based scheduling of multi-model inference tasks. This example is one of the experiments (Figure 15a) in the XSched paper and is included in the XSched Artifacts.

The example requires three docker images:

TensorRT docker image: nvcr.io/nvidia/tensorrt:22.06-py3
Triton Server docker image: nvcr.io/nvidia/tritonserver:22.06-py3
Triton Client docker image: nvcr.io/nvidia/tritonserver:22.06-py3-sdk

Steps:

Download the BERT model from NGC
Build TensorRT model
Build TensorRT-Backend for Triton Server
Run Triton Client (Standalone, Triton, Triton + Priority Config)
Build Triton Server with XSched & Run Triton Client (XSched)
Plot the results

Basic Idea

Triton Server allows users to set the priority of each serving model, see config.pbtxt for example. With only 10 lines of code change, we modify the TensorRT-Backend to create an XQueue for each CUDA stream and inherit the priority of the model using the stream. Then, we use local scheduler and highest priority first policy to schedule these XQueues, similar to the Intra-Process Scheduling Example. With XSched, the inference tasks of the model using the higher-priority XQueues can preempt the lower-priority ones (Triton does not support task preemption currently), so that their latencies can be significantly reduced.

Download the BERT model from NGC

bash BERT/download.sh

Build TensorRT model

Convert the TensorFlow model to TensorRT executable model.

# Use TensorRT 8.5 docker image and mount the xsched directory (replace `<xsched_dir>` with your own location)
docker run --privileged -it --rm --gpus all --net=host -v <xsched_dir>:/xsched nvcr.io/nvidia/tensorrt:22.06-py3 bash

# Install dependencies
python3 -m pip install --upgrade pip
pip install onnx==1.16.0 tensorflow==2.13.0

# Build TensorRT model
cd /xsched/examples/5_infer_serving

python python/builder.py -m BERT/bert_large_tf1_ckpt_mode-qa_ds-squad11_insize-384_19.03.1/model.ckpt -b 1 -s 384 -c BERT/bert_large_tf1_ckpt_mode-qa_ds-squad11_insize-384_19.03.1/ --fp16 -o model.plan

# Copy the model to the model repository
cp model.plan /xsched/examples/5_infer_serving/model-repo/bert-high/1/
cp model.plan /xsched/examples/5_infer_serving/model-repo/bert-low/1/
cp model.plan /xsched/examples/5_infer_serving/model-repo/bert-norm/1/

Build TensorRT-Backend for Triton Server

# Setup Triton Server container and mount the xsched directory (replace `<xsched_dir>` with your own location)
docker run --privileged -itd --name xsched-triton-server --gpus all --net=host -v <xsched_dir>:/xsched nvcr.io/nvidia/tritonserver:22.06-py3 bash
docker exec -it xsched-triton-server bash

# Install Miniconda in the container
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

source ~/miniconda3/bin/activate
conda init --all

# Install dependencies
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda install -y cmake rapidjson

# Build TensorRT-Backend for Triton Server

cd /root

# commit id: 7f94a8ee1daab23046ef4d689bd56411101f207c
git clone https://github.com/triton-inference-server/tensorrt_backend -b r22.06
cd tensorrt_backend

mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_COMMON_REPO_TAG=r22.06 -DTRITON_CORE_REPO_TAG=r22.06 -DTRITON_BACKEND_REPO_TAG=r22.06 ..
make -j$(nproc)
make install

# Run Triton Server
tritonserver --backend-directory ./install/backends --model-repository=/xsched/examples/5_infer_serving/model-repo/ --strict-model-config false

# You can see the output of the server in the container terminal, like this:
# +----------+---------------------------------------------------+--------------------------------------------------------------------------------------------------+
# | Backend  | Path                                              | Config                                                                                           |
# +----------+---------------------------------------------------+--------------------------------------------------------------------------------------------------+
# | tensorrt | ./install/backends/tensorrt/libtriton_tensorrt.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory |
# |          |                                                   | ":"./install/backends","default-max-batch-size":"4"}}                                            |
# +----------+---------------------------------------------------+--------------------------------------------------------------------------------------------------+

# I0608 15:29:19.380337 2941 server.cc:626] 
# +-----------+---------+--------+
# | Model     | Version | Status |
# +-----------+---------+--------+
# | bert-high | 1       | READY  |
# | bert-low  | 1       | READY  |
# | bert-norm | 1       | READY  |
# +-----------+---------+--------+

# I0608 15:29:19.391999 2941 metrics.cc:650] Collecting metrics for GPU 0: Quadro GV100
# I0608 15:29:19.392079 2941 tritonserver.cc:2159] 
# +----------------------------------+------------------------------------------------------------------------------------------------------------------------------+
# | Option                           | Value                                                                                                                        |
# +----------------------------------+------------------------------------------------------------------------------------------------------------------------------+
# | server_id                        | triton                                                                                                                       |
# | server_version                   | 2.23.0                                                                                                                       |
# | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shar |
# |                                  | ed_memory cuda_shared_memory binary_tensor_data statistics trace                                                             |
# | model_repository_path[0]         | /xsched/examples/5_infer_serving/model-repo/                                                                                 |
# | model_control_mode               | MODE_NONE                                                                                                                    |
# | strict_model_config              | 0                                                                                                                            |
# | rate_limit                       | OFF                                                                                                                          |
# | pinned_memory_pool_byte_size     | 268435456                                                                                                                    |
# | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                     |
# | response_cache_byte_size         | 0                                                                                                                            |
# | min_supported_compute_capability | 6.0                                                                                                                          |
# | strict_readiness                 | 1                                                                                                                            |
# | exit_timeout                     | 30                                                                                                                           |
# +----------------------------------+------------------------------------------------------------------------------------------------------------------------------+

# I0608 15:29:19.396067 2941 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
# I0608 15:29:19.396875 2941 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
# I0608 15:29:19.437883 2941 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Run Triton Client (Standalone, Triton, Triton + Priority Config)

# mount the xsched directory (replace `<xsched_dir>` with your own location)
# Evaluate the standalone performance
docker run --privileged -it --rm --gpus all --net=host -v <xsched_dir>:/xsched nvcr.io/nvidia/tritonserver:22.06-py3-sdk bash -c "cd /xsched/examples/5_infer_serving; bash client/run.sh standalone"

# Evaluate the performance with triton server
docker run --privileged -it --rm --gpus all --net=host -v <xsched_dir>:/xsched nvcr.io/nvidia/tritonserver:22.06-py3-sdk bash -c "cd /xsched/examples/5_infer_serving; bash client/run.sh triton"

# Evaluate the performance with triton + priority config
docker run --privileged -it --rm --gpus all --net=host -v <xsched_dir>:/xsched nvcr.io/nvidia/tritonserver:22.06-py3-sdk bash -c "cd /xsched/examples/5_infer_serving; bash client/run.sh triton-p"

Results are saved in <xsched_dir>/examples/5_infer_serving/results/raw.

Build Triton Server with XSched & Run Triton Client (XSched)

Let's go back to the Triton Server container (xsched-triton-server) and build XSched (remember to stop the triton server by typing Ctrl-C).

# Build XSched

cd /xsched
make cuda


# Integrate XSched into Triton Server TensorRT-Backend
cd /root/tensorrt_backend

git apply /xsched/examples/5_infer_serving/server/triton-backend-xsched.patch

# Rebuild the TensorRT-Backend
cd build
cmake -DXSched_DIR=/xsched/output/lib/cmake/XSched ..
make
make install

Now, let's start the Triton Server again with XSched.

export LD_LIBRARY_PATH=/xsched/output/lib:$LD_LIBRARY_PATH
export XSCHED_POLICY=HPF
tritonserver --backend-directory ./install/backends --model-repository=/xsched/examples/5_infer_serving/model-repo/ --strict-model-config false

Finally, let's run the evaluation script again.

docker run --privileged -it --rm --gpus all --net=host -v <xsched_dir>:/xsched nvcr.io/nvidia/tritonserver:22.06-py3-sdk bash -c "cd /xsched/examples/5_infer_serving; bash client/run.sh xsched"

Plot the results

Raw results are under <xsched_dir>/examples/5_infer_serving/results/raw. To plot the figure in the paper, please install gnuplot and epstopdf.

sudo apt update
sudo apt install gnuplot texlive-font-utils

# process data and plot
python3 <xsched_dir>/examples/5_infer_serving/results/scripts/plot.py

Figure 13a are under <xsched_dir>/examples/5_infer_serving/results/plot.