vLLM Serving Prometheus Metric Monitoring Tool User Guide
Introduction
This metric monitoring tool enhances the native monitoring capability of the vLLM-Ascend inference framework. While vLLM-Ascend provides basic metrics out of the box, this tool adds the following capabilities:
- KVCache monitoring: tracks total blocks, idle block count, and the number of blocks allocated per DP domain.
- Token and throughput: monitors the number of input/output tokens and total tokens per DP domain.
- Custom metric: adds a timer metric for the execution duration of any function.
Supported Products
Note
For details about Ascend product models, see Ascend Product Models.
| Product Type | Supported (Yes/No) |
|---|---|
| Atlas A3 training products and Atlas A3 inference products | Yes |
| Atlas A2 training products and Atlas A2 inference products | Yes |
| Atlas 200I/500 A2 inference products | Yes |
| Atlas inference products | Yes |
| Atlas training products | No |
Preparations
Environment Setup
- In the Ascend environment, install the matching CANN Toolkit and ops operator packages, and configure CANN environment variables. For details, see CANN Installation Guide.
- Install vLLM and vLLM-Ascend. Verify that vLLM-ascend can run properly and the metrics endpoint is accessible. For details, see vLLM-Ascend Installation Guide.
- Build the .run package from the source code and upgrade the tool. For details, see the section Upgrade in msServiceProfiler Installation Guide.
Restrictions
- Version compatibility: Ensure that vLLM-Ascend, CANN, and collection tool versions meet the requirements in the Appendix.
- Resource usage: Data monitoring requires enabling Prometheus multi-process mode (
PROMETHEUS_MULTIPROC_DIR). This may impact inference performance. - Function restrictions: Some advanced features may require specific vLLM-Ascend versions.
Instructions
Quick Start
Follow these steps to complete the metric monitoring process:
- Set environment variables and start the service (with Prometheus multi-process mode enabled).
- Enable the collection function by modifying the
metric_enablefield in the configuration file (independent of theenablefield). - Send an inference request.
- View metrics by accessing the metrics endpoint or Grafana.
Step 1: Environment Variable Setup and Service Startup
Before starting the inference service, set the following environment variables:
| Environment Variable | Description |
|---|---|
SERVICE_PROF_CONFIG_PATH |
Path to the performance profiling configuration file. |
METRIC_SYMBOLS_PATH |
(Optional) Path to the symbol/event tracking configuration file. If the path is not specified, the default configuration is used. |
PROMETHEUS_MULTIPROC_DIR |
(Mandatory) Prometheus multi-process mode directory (an empty directory must be created in advance.) |
cd ${path_to_store_config_files}
export SERVICE_PROF_CONFIG_PATH=ms_service_profiler_config.json
export METRIC_SYMBOLS_PATH=service_metrics_symbols.yaml
# (Mandatory) Enable the Prometheus multi-process mode.
mkdir -p /path/to/your/prometheus/dir
export PROMETHEUS_MULTIPROC_DIR=/path/to/your/prometheus/dir
# Start the vLLM service.
vllm serve Qwen/Qwen2.5-0.5B-Instruct &
ms_service_profiler_config.jsonis the collection configuration file (shared with performance profiling).metric_enablecontrols Prometheus metric collection, andenablecontrols performance profiling. If the file does not exist, the default configuration is automatically generated. For details, see Data Collection.service_metrics_symbols.yamlconfigures symbols. For details about how to customize symbols, see [Symbol Configuration User Guide](#Symbol Configuration User Guide).- 8000 is the default port for vLLM serving inference startup. This document uses port 8000 as an example. To change the serving startup port, you can use the
--portcommand line parameter to specify a port when starting the vLLM service. For details, see vllm serve Command Parameter Description.
Step 2 Profiler Collection Enabling
In this tool, metric_enable in the ms_service_profiler_config.json configuration file controls metric collection. metric_enable is independent of the enable field (which controls performance profiling), so you can enable them separately or together.
Enable metric collection by setting metric_enable in ms_service_profiler_config.json to 1. If the field is not set (default) or does not exist, it is set to 0, metric collection is disabled. If the JSON file does not contain the metric_enable field, manually add it and change the value to 1.
sed -i 's/"metric_enable":\s*0/"metric_enable": 1/' ./ms_service_profiler_config.json
The modification takes effect without restarting the service. When the function is enabled or disabled, the corresponding information (for example, "Metric collection enabled" or "Metric collection disabled") is recorded in the log.
Step 3: Request Sending
Send an inference request to generate monitoring data.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a",
"max_tokens": 5,
"temperature": 0
}' | python3 -m json.tool
Step 4: Metrics Viewing
Obtain data through the metrics endpoint of the vLLM service.
# Replace localhost:8000 with the serving IP address and port.
curl -s http://localhost:8000/metrics
You can also configure Prometheus to scrape metrics from this endpoint and visualize the metrics using Grafana.
Symbol Configuration User Guide
Function
The symbol configuration file defines the functions/methods to be monitored, and supports flexible configuration and custom attribute collection.
Precautions
- When the collection symbols are updated, you need to restart the vLLM service to load the new configuration.
- All custom metric names are automatically prefixed with
vllm_profiling_and must comply with Prometheus Metric and label naming.
Configuration Fields
| Field | Description | Example |
|---|---|---|
| symbol | Python import path + attribute chain (format: module:class.method) |
vllm.v1.core.kv_cache_manager:KVCacheManager.free |
| min_version | Earliest compatible version | "0.9.1" |
| max_version | Latest compatible version | "0.11.0" |
| metrics | List of custom metrics. Currently, the timer type is supported (which measures function execution duration). |
See the following example. |
Configuration Example
# ===== Custom Metrics =====
- symbol: vllm.entrypoints.openai.serving_chat:OpenAIServingChat.create_chat_completion
min_version: "0.9.1"
metrics:
- name: server:create_chat_completion:duration
type: timer
Result Description
Output Example
You can access the metrics endpoint (for example, http://localhost:8000/metrics) of the vLLM service to view the current metrics. The following is an example output of the custom server:create_chat_completion:duration (timer type), which can be visualized using visualization tools like Grafana.
# Replace localhost:8000 with the serving IP address and port.
curl -s http://localhost:8000/metrics | grep -E "server:create_chat_completion:duration"
# HELP vllm_profiling_server:create_chat_completion:duration Execution duration of server:create_chat_completion:duration
# TYPE vllm_profiling_server:create_chat_completion:duration histogram
vllm_profiling_server:create_chat_completion:duration_sum{dp="-1"} 15.44140076637268
vllm_profiling_server:create_chat_completion:duration_bucket{dp="-1",le="0.001"} 0.0
...
vllm_profiling_server:create_chat_completion:duration_bucket{dp="-1",le="0.2"} 1.0
...
vllm_profiling_server:create_chat_completion:duration_bucket{dp="-1",le="+Inf"} 9.0
vllm_profiling_server:create_chat_completion:duration_count{dp="-1"} 9.0
Built-in Metric Symbols
The following are metric symbols built into the tool. All metric names are automatically prefixed with vllm_profiling_ and contain the dp label. (The scheduling process is the actual DP domain ID, and the request process is -1.)
Metrics for scheduling and batch processing
| Metric Name | Type | Description |
|---|---|---|
| batch_size | Histogram | Number of requests that are being executed |
| waiting_batch_size | Histogram | Number of requests to be scheduled |
| num_spec_tokens | Gauge | Number of draft tokens in speculative decoding |
Token metrics
| Metric Name | Type | Description |
|---|---|---|
| total_tokens | Histogram | Sum of prompt and generation tokens in a single iteration |
| input | Counter | Number of tokens in the input prompt |
| output | Counter | Number of tokens in the generated output |
| second_token_latency | Histogram | Latency for generating the second token |
| fine_grained_ttft | Histogram | Fine-grained time to first token (TTFT) |
| fine_grained_tpot | Histogram | Fine-grained time per output token (TPOT) |
KVCache metrics
| Metric Name | Type | Description |
|---|---|---|
| total_kvcache_blocks | Gauge | Total number of KVCache blocks in the current DP domain. |
| free_kvcache_blocks | Gauge | Number of idle KVCache blocks in the current DP domain. |
| allocated_kvcache_blocks | Gauge | Number of blocks allocated to the KVCache in the current DP domain. |
Note: The KVCache usage can be approximately calculated as
(1 - free_kvcache_blocks / total_kvcache_blocks) * 100%. This helps monitor memory usage and load balancing.
Relevant Documents
- [vLLM Service Profiler] (./vLLM_service_oriented_performance_collection_tool.md): Performance Profiling and Trace Analysis
- [Data Collection Configuration Description](./msserviceprofiler_serving_tuning_instruct.md#Data Collection):
ms_service_profiler_config.jsonConfiguration Description