提供AI推理场景下的路由方案，包含基础路由模块，用户配置以及基于kv-aware的带资源与负载感知的路由策略，支撑AI推理优化。

文件	最后提交记录	最后更新时间
api	feat: add shared inflight lifecycle tracking	29 天前
build	feat: update Dockerfile to align with openfuyao builder	10 天前
charts	!64 refactor(charts): align Helm charts with openFuyao image conventions From: @lileqi Reviewed-by: @ See merge request: openFuyao/hermes-router!64	2 天前
cmd	feat: upgrade Hermes Router to GIE v1.5.0	29 天前
examples	feat(tokenizer): expose tokenizerSource and wire Helm charts - Rename EPP tokenizer plugin JSON key to tokenizerSource for source/identity split - Render tokenizerSource in all routing profile Helm branches - Document the option in values and example profiles - Replace prediction sidecar noqa ANN001 with explicit type annotations Co-authored-by: lileqi<lileqi@huawei.com>	10 天前
pkg	fix(score): normalize PD group metrics across the candidate pool Per-group queue and running maxima collapse to binary busy/idle signals in 1P1D topologies, so groups with different load become indistinguishable. Collect pool-wide limits once per cycle and thread them through PDGroup and prediction fallback scoring.	10 天前
sidecar	feat(tokenizer): expose tokenizerSource and wire Helm charts - Rename EPP tokenizer plugin JSON key to tokenizerSource for source/identity split - Render tokenizerSource in all routing profile Helm branches - Document the option in values and example profiles - Replace prediction sidecar noqa ANN001 with explicit type annotations Co-authored-by: lileqi<lileqi@huawei.com>	10 天前
.gitignore	feat(prediction): training pipeline rework, sidecar perf, and Helm modelVolume deployment - Rework offline training: unified queue slots, per-target features, monotone constraints, ranking diagnostics, and Qwen3-32B example bundle (#36) - Optimize Predict hot path: batch slot inference, vectorized features, worker thread pool (#37) - Helm: modelVolume mount API, sidecar resource bounds, drop misleading defaults (#38) Co-authored-by: lileqi<lileqi@huawei.com>	10 天前
AGENTS.md	docs: sync repository guidance and ignore rules	29 天前
CLAUDE.md	docs: sync repository guidance and ignore rules	29 天前
LICENSE	fix: add license and move dockerfile	6 个月前
OAT.xml	fix: add OAT.xml to fix license check	6 个月前
README.md	docs: document Istio TLS insecureSkipVerify security risk Add README guidance on the default DestinationRule setting that skips EPP certificate verification, including production hardening recommendations. Co-authored-by: Cursor <cursoragent@cursor.com>	7 天前
go.mod	refactor: align shared helper cleanup with routing stack	29 天前
go.sum	feat: upgrade Hermes Router to GIE v1.5.0	29 天前

Hermes-router

Hermes-router 是一个基于 K8s GIE 框架构建的智能路由 EPP（Endpoint Picker）组件，支持 kv cache aware、prediction、bucket 等多种高级路由策略，通过将推理请求路由到最合适的后端服务实例，优化大语言模型（LLM）推理服务的效率和性能。

关键特性

设计遵循 K8s GIE 框架，天然支持 K8s 网关体系，可集成 Istio 等多种开源网关，支持以可插拔组件形式加入已有网关的集群，提升AI推理性能。
提供多种创新路由策略，支持聚合/PD分离等多种推理后端架构。
- kv-cache-aware(PD aggregate/disaggregate)：支持自定义得分函数的多因素KV Cache感知路由策略，在KV Cache高重复率场景（工具/智能体、多轮对话等）显著提升推理性能。
- pd-bucket (PD disaggregate)：支持自定义参数的分桶调度策略，在长短请求、中高并发场景提升推理吞吐量。
- prediction (PD aggregate/disaggregate)：基于算力饱和度与时延预测的多因素路由策略，实时采集算力饱和度指标并结合 TTFT/TPOT 时延预测数据评分，在 NPU 高负载场景下显著降低尾延迟。
动态推理服务发现，支持在运行时新增/删除推理后端，灵活进行路由调度。
基于 Istio 等开源网关提供容灾能力。

整体架构

服务发现：Hermes-router 通过 InferencePool 资源发现匹配的后端模型服务 Pod；
模型匹配：当请求到达 Gateway 时，Gateway 通过 HTTPRoute 将请求转发到 InferencePool；
路由决策：InferencePool 调用 Hermes-router 作为 EPP，经由数据层、请求控制层、调度层三层插件协同完成路由决策（详见下方插件体系）；
请求转发：Gateway 根据 Hermes-router 的决策结果，将请求转发到选中的后端 Pod；
请求重试（可选）：Gateway 检测到请求异常，重新选取后端 Pod 发送请求。

插件体系

GIE v1.5.0 将插件分为三个层次，各层职责独立、可单独扩展：

数据层（Data Layer）：周期性从外部系统（如 NPU exporter）采集指标，将原始数据解析为结构化 endpoint 属性，供上层插件消费；
请求控制层（Request Control Layer）：在路由决策前对请求进行预处理：
- PreRequest 插件：请求到达时执行前置操作（如 PD 角色 header 注入）；
- DataProducer 插件：为 endpoint 生成请求级数据（分词、prefix cache 匹配、inflight 计数、预测特征提取等）；
调度层（Scheduling Layer）：基于上述数据执行路由决策：
- Filter 插件：过滤不符合条件的 endpoint（如 Prefill/Decode 角色、prefix cache 命中率）；
- Scorer 插件：对候选 endpoint 进行多因素评分（KV Cache、算力饱和度、时延预测等）；
- Picker 插件：根据评分结果选出最优 endpoint 或 endpoint 组；

快速开始

部署时可按接入方式选择以下两种模式：

standalone：内置 Envoy sidecar，无需额外 Gateway，适合已有推理后端、只需接入路由层的场景；
gateway 集成模式（charts/hermes-router）：通过 Gateway、HTTPRoute 和 InferencePool 接入，适合统一纳入 Istio 等网关体系。

本节演示前者，即以 standalone 模式部署 Hermes-router，并启用 PD KVCache Aware 路由策略。

场景说明

本示例选择 standalone 模式，是因为它内置 Envoy sidecar，便于在已有推理后端之上快速验证路由能力。如需通过 Istio 等网关统一管理 HTTPRoute 和 InferencePool，请改用 charts/hermes-router。

在 PD KVCache Aware 路由模式下，Hermes-router 会区分 Prefill/Decode 角色，并实时查询 KV Cache 管理器 cache-indexer，获取各 Prefill endpoint 的 prefix cache 命中情况，优先将请求路由到命中率更高的实例，从而减少重复计算，提升推理效率。

前置条件

部署前，请确认 vLLM 推理后端满足以下条件：

监听端口与 inferencePool.targetPorts 一致：InferencePool 会将流量转发到 targetPorts 中配置的端口（示例中为 8000），确保 vLLM Pod 监听同一端口，或按实际端口修改 inferencePool.targetPorts；
Pod 打上 PD 角色标签：PD KVCache Aware 策略依赖 openfuyao.com/pdRole 区分 Prefill/Decode 角色、依赖 openfuyao.com/pdGroupID 将 Prefill-Decode 配对成组；标签名可通过 routing.pd.pdLabelName / routing.pd.pdGroupLabelName 自定义，但 vLLM Pod 上需打上对应的标签值；
Pod 标签与 inferencePool.modelServers.matchLabels 一致：InferencePool 通过该 selector 发现候选 Pod，selector 与 Pod 上的标签需完全匹配；
cache-indexer 服务已部署并可访问：KV Cache 感知路由在请求控制阶段查询 cache-indexer 获取 prefix cache 命中信息，需在 routing.cacheIndexer.address 中配置正确的服务地址。

开始部署

进入仓库根目录
```
cd /path/to/hermes-router
```

准备配置

charts/standalone 默认已启用 PD KVCache Aware 路由，只需按实际环境覆盖以下参数，创建 my-values.yaml：

inferenceExtension:
  routing:
    cacheIndexer:
      address: http://<cache-indexer-service>:8080  # 替换为实际地址

inferencePool:
  targetPorts:
    - number: 8000
  modelServerType: vllm
  modelServers:
    matchLabels:
      openfuyao.com/model: qwen-qwen3-8b  # 与 vLLM Pod 上的标签保持一致

部署 Hermes-router

helm dependency update charts/standalone

helm install hermes-standalone charts/standalone \
  -n <namespace> --create-namespace \
  -f my-values.yaml

如需开启容灾能力，详见独立部署时开启容灾

验证部署（可选）

# 检查 InferencePool 状态
kubectl get inferencepool -n <namespace>

# 检查 Hermes-router Pod 状态
kubectl get pods -n <namespace> -l inferencepool=hermes-standalone-epp

# 检查 vLLM Pod 标签
kubectl get pods -n <namespace> -l openfuyao.com/pdRole

发送测试请求（可选）

# 获取 EPP Pod IP（Envoy sidecar 监听 8081 端口）
EPP_POD_IP=$(kubectl get pod -n <namespace> \
  -l inferencepool=hermes-standalone-epp \
  -o jsonpath='{.items[0].status.podIP}')

# 发送推理请求
curl -X POST http://${EPP_POD_IP}:8081/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Hello, how are you?",
    "max_tokens": 100
  }'

安全说明

Istio 网关集成模式下的 TLS 证书校验

使用 charts/hermes-router 并以 provider.name: istio 部署时，Chart 会为 EPP 服务创建一条 Istio DestinationRule，其中 TLS 策略默认为：

trafficPolicy:
  tls:
    mode: SIMPLE
    insecureSkipVerify: true

该默认值会跳过对 EPP 服务端证书的校验，主要目的是兼容集群内使用自签名证书或未完整配置 PKI 的场景，降低初次接入时的配置门槛。

安全风险：在跳过校验的情况下，Gateway 到 EPP 之间的 TLS 连接无法确认对端身份，存在中间人攻击（MITM）、服务冒充等风险。因此该默认配置不适合直接用于对安全合规有要求的生产环境。

生产环境建议：

为 EPP 服务配置由可信 CA（或集群内部 CA）签发的有效证书，并确保 Gateway 侧信任该 CA；
关闭证书跳过校验（insecureSkipVerify: false），启用完整的 TLS 身份验证；
在变更前评估集群内 Istio、mTLS 及证书分发（如 cert-manager）的现有配置，避免与全局安全策略冲突。

项目介绍

提供AI推理场景下的路由方案，包含基础路由模块，用户配置以及基于kv-aware的带资源与负载感知的路由策略，支撑AI推理优化。

MulanPSL-2.0 Go 279提交数

定制我的领域

README

规则集

hermes-router:基于 K8s GIE 框架的智能路由 EPP 组件