InferNex

End-to-end one-click integrated deployment for the openFuyao AI inference service framework

Updates

[26-06] The inference backend switched to LeaderWorkerSet (LWS) deployment orchestration, natively supporting multi-DP coordination; PD-Orchestrator's elastic-scaler added the APA scaling algorithm, supporting multi-metric scaling; Hermes-router added routing strategies based on compute saturation and latency prediction; cache-indexer implemented L3-level KV-aware perception, collaborating with Mooncake to support global KVCache indexing; eagle-eye added weight distribution and Lingqu network dynamic metric collection; InferNex added a Helm pre-deployment validation tool, covering NPU driver, hardware resources, and network communication environment checks to identify deployment risks in advance.
[26-05] Added InferNex-Bridge component, compatible with KServe integration into the InferNex inference suite, supporting dual CRD declarative deployment via LLMInferenceService and InferNexService, with the adaptation layer automatically completing orchestration and routing integration.
[26-03] Added PD-Orchestrator component, supporting dynamic PD group scaling; intelligent routing added disaster recovery capabilities, including automatic traffic switching, fault awareness, and request retry; inference backend component refactored and updated, supporting configuration of different vLLM inference engine versions and non-Hugging Face models.
[25-12] Added inference observability sub-component; intelligent routing implemented gateway plugin based on the GIE framework.
[25-09] Released the AI inference integrated deployment alpha version! Supporting features such as intelligent routing Hermes-router with KVCache aware and other strategies, xPyD disaggregated inference engine, global KVCache metadata management, and Mooncake distributed KVCache management system integration.

Overview

This project is built on mainstream LLM inference technology stacks and the K8s official project GIE (Gateway API Inference Extension), integrating the following K8s-native, high-performance, and scalable sub-features, aiming to improve inference throughput and reduce latency, providing efficient and reliable technical support for AI service deployment.

Intelligent Routing System (Hermes-router): A gateway plugin implemented based on the GIE framework, featuring dynamic request distribution and load balancing capabilities; supports multi-dimensional perception capabilities including diverse compute load awareness, KV hit awareness, request pressure awareness, request length awareness, and semantic awareness. Users can leverage built-in strategy extensions (KVCache aware strategy, PD long/short request bucketing strategy) to achieve optimal node routing for inference requests.

xPyD Disaggregated Inference Engine: An AI inference backend built on the high-performance vLLM inference engine, supporting xPyD architecture, inference node auto-discovery (Proxy Server), Mooncake KVCache storage, and flexible multi-instance deployment; the inference engine is deployed via LeaderWorkerSet (LWS), natively supporting multi-DP coordination, and DP load balancing strategies can be configured through dataParallelSize and dataParallelSizeLocal.

PD-Orchestrator: An elastic orchestration component for PD disaggregated scenarios, integrating tidal algorithms, scaling decision frameworks, and dynamic PD group scaling capabilities; supports metric-driven and event-driven resource scaling, group-level and intra-group proportional or custom strategy scaling, and exposes extensibility for user-defined decision and resource management logic.

Distributed KVCache Management: Utilizes the Mooncake Hccl Transfer Engine for high-speed KVCache transfer between PD nodes.

Global KVCache Index (cache-indexer): Based on vLLM's KV Event mechanism and providing RESTful interfaces, builds a distributed global KVCache metadata prefix tree, enabling routing KV-aware perception capabilities for efficient utilization of global KVCache resources.

Inference Observability System (eagle-eye): Based on Prometheus standard data collection and reporting formats, provides key observability metrics for inference scenarios, covering business runtime metrics, system runtime metrics, and resource health metrics; through NATS asynchronous message queue publish-subscribe mode, provides real-time (millisecond-level) observability of key business runtime metrics, supporting near-real-time decision capabilities for critical acceleration modules.

Dependent Components

This integrated deployment solution includes the following sub-components and their version information:

Component	Version	Optional	Description
inference-backend	latest	Required	Inference engine backend based on vllm/vllm-ascend
pd-orchestrator	latest	Required	PD dynamic scaling component
Hermes-router	latest	Optional	Intelligent routing system
cache-indexer	latest	Optional	Distributed global KVCache metadata management component
eagle-eye	latest	Optional	Observability system
vLLM-Ascend	0.18.0	Optional	Inference engine framework. Users can configure other versions of vLLM-Ascend images.
Mooncake	0.3.8	Optional	Distributed KVCache management. Changes with vLLM-Ascend image version.

Note: Some components are optional and must be enabled through configuration. For detailed configuration, refer to the Configuration section.

Quick Start

Prerequisites

Kubernetes v1.29.0 and above (v1.33.0 and above recommended).
npu-operator component installed.
LWS component installed.
Metrics server v0.8.0 and above must be installed in the cluster.

Binary Deployment

Pull the project installation package.
```
helm pull oci://cr.openfuyao.cn/charts/infernex --version xxx
```
Replace xxx with the specific project installation package version, such as 0.0.0-latest. The pulled package is in compressed format.
Decompress the installation package.
```
tar -xzvf infernex-xxx.tgz
```
Replace xxx with the specific project installation package version, such as 0.0.0-latest.
Install and deploy.

Using namespace ai-inference and release name infernex as an example, execute the following command in the directory at the same level as infernex:
```
helm install -n ai-inference infernex ./infernex
```

Source Deployment

Clone the project from the repository.

git clone https://gitcode.com/openFuyao/InferNex.git

Install and deploy.

Using namespace ai-inference and release name infernex as an example, execute the following command in the directory at the same level as InferNex:
```
cd InferNex/charts/infernex
helm dependency build
helm install -n ai-inference infernex .
```

Configuration

For detailed configuration instructions, refer to the Configuring AI Inference Integrated Deployment section of the InferNex User Guide.

Performance

Fixed-length system prompt reuse scenario: This scenario simulates a workload with fixed-length system prompts where KVCache can be reused. The dataset contains 120 types of 8k-length system prompts, each repeated 4 times, totaling 480 requests. Request times follow a Poisson distribution with concurrency of 8. Using random routing as the performance baseline, after enabling InferNex optimization capabilities, aggregate deployment shows an average TTFT reduction of approximately 54% and an average TPS improvement of approximately 20%, demonstrating significant performance gains.
Multi-turn conversation scenario: This scenario simulates multiple users engaging in continuous conversations with an LLM. A single session contains multiple rounds of requests, with 120 independent users generating 480 requests total. The first round requests 16k tokens, each round returns 128 tokens, and subsequent rounds append 1k tokens for 4 rounds total, averaging approximately 17.5k per request. Request times follow a Poisson distribution with concurrency of 8. Using random routing as the performance baseline, after enabling InferNex optimization capabilities, aggregate deployment shows an average TTFT reduction of approximately 60% and an average TPS improvement of approximately 44%, demonstrating significant performance gains.
Inference observability system: Pod resource consumption CPU <20m, MEM 3000~3500M; second-level reporting publisher average collection duration <10ms, subscriber average log receiving latency <1ms. For details, see the eagle-eye Performance Test Report.

For detailed performance data and comparative analysis of each optimization strategy, see the InferNex Performance Test Report.

Components

InferNex-Checker: InferNex pre-deployment validation tool, checking hardware, K8s clusters, and configuration environments before install to identify deployment risks in advance.
InferNex-Bridge: Adaptation layer for integrating InferNex with KServe, supporting dual CRD declarative deployment of InferNex via LLMInferenceService / InferNexService. See Component README.

Roadmap

[26-06] Hermes-router intelligent routing supports perception and scheduling based on instance resource saturation, and secondary scheduling among instances under PD disaggregated architecture.
[26-06] InferNex-Deployer improves the continuous integration pipeline, adding production-grade validation for large model PD disaggregated and multi-instance deployment.
[26-06] Elastic-scaler focuses on workload distribution acceleration (weights, images, process startup), supporting high-performance elastic scenarios, while complementing event signal-driven scaling and compute-aware strategies.
[26-X] Eagle-Eye near-real-time observability will extend dynamic network resource metrics, broaden hardware health and sub-health perception, adapt to A5 generation specifications, and advance error code standardization.
[26-X] KVCacheX covers Cache-indexer / conductor and Lingqu enablement-related directions; plans to incorporate iterative enhancements such as DSA and Hybrid Attention KV offloading capabilities.
[26-X] Plan KServe integration adaptation for unified management of different types of inference Serving (predictive, LLM, etc.) and compute stack traffic (InferNex, llm-d, etc.); collaborate with the vLLM-ascend community to publish recommended deployment cases based on InferNex.