GGitHubfeat: vLLM plugin support (#53 )

873efd5e创建于 2025年5月30日历史提交

文件	最后提交记录	最后更新时间
benchmark	feat[integration]: impl plugin for vllm, need for test	1 年前
expertkit_vllm	feat: vLLM plugin support (#53) * feat: add config for ek-vllm plugin feat: integrate vLLM with ek framework and fix token output issues fix: add missing shared expert output summation in DeepSeek MoE forward pass * typo: unify env variables from EXPERTKIT to EK fix: correct vllm output * doc: Update vLLM plugin docs for latest configuration method * fix: Change the example model path from a local directory path to a Hugging Face model name.	1 年前
tests	feat: vLLM plugin support (#53) * feat: add config for ek-vllm plugin feat: integrate vLLM with ek framework and fix token output issues fix: add missing shared expert output summation in DeepSeek MoE forward pass * typo: unify env variables from EXPERTKIT to EK fix: correct vllm output * doc: Update vLLM plugin docs for latest configuration method * fix: Change the example model path from a local directory path to a Hugging Face model name.	1 年前
README.md	feat: vLLM plugin support (#53) * feat: add config for ek-vllm plugin feat: integrate vLLM with ek framework and fix token output issues fix: add missing shared expert output summation in DeepSeek MoE forward pass * typo: unify env variables from EXPERTKIT to EK fix: correct vllm output * doc: Update vLLM plugin docs for latest configuration method * fix: Change the example model path from a local directory path to a Hugging Face model name.	1 年前
buf.gen.yaml	feat: update proto for expertkit-vllm, need fix some bug	1 年前
buf.yaml	feat: update proto for expertkit-vllm, need fix some bug	1 年前
gen_proto_py.sh	feat: update proto for expertkit-vllm, need fix some bug	1 年前
setup.py	feat: impl mock work server feat[ExpertKitMoE]: add drop duplicate pad token from forward hidden state feat[ExpertKitMoE]: finish test for expertkit-vllm feat: test client for mock server TODO: seens bugs in vllm mla (tensor not in same device)	1 年前

vLLM ExpertMesh Plugin

ExpertMesh Plugin for vLLM framework.

Installation

Install the plugin in development mode:

pip install -e .

Usage

1. Setup Expert-Kit Service

First, ensure your Expert-Kit service is running and accessible. Refer to Deploying Qwen3-30B-A3B with Expert-Kit for details.

2. Model Configuration

Expert-Kit configuration can be set through model configuration parameters or environment variables. The plugin supports the following configuration options:

Configuration Parameters

ek_mode: Operation mode, default: "expert_mode"
ek_backend_addr: Address of your Expert-Kit service, default: "localhost:5002"
ek_debug_mode: Enable debug mode, default: False
ek_client_timeout: gRPC timeout in seconds, default: 2
ek_model_name: Model name for Expert-Kit service (required)

Method 1: Model Configuration

When loading a model with vLLM, add Expert-Kit parameters to your model configuration:

from vllm import LLM

# Configure Expert-Kit through model config
model_config = {
    "ek_mode": "expert_mode",
    "ek_backend_addr": "localhost:5002",
    "ek_debug_mode": False,
    "ek_client_timeout": 2,
    "ek_model_name": "Qwen/Qwen3-MoE-A3B"
}

# Create LLM with Expert-Kit configuration
llm = LLM(
    model="Qwen/Qwen3-MoE-A3B", 
    tensor_parallel_size=1,
    trust_remote_code=True,
    model_config=model_config
)

Method 2: Environment Variables

Alternatively, configure Expert-Kit using environment variables:

export EK_ENABLE=1
export EK_MODE="expert_mode"
export EK_ADDR="localhost:5002"
export EK_DEBUG_MODE="0"
export EK_CLIENT_TIMEOUT="2"
export EK_MODEL_NAME="Qwen/Qwen3-MoE-A3B"

Note: Environment variables take precedence over model configuration parameters.

3. Enable Expert-Kit Plugin

Set the EK_ENABLE environment variable to activate the plugin:

export EK_ENABLE=1

4. Generate Text

Generate text as you normally would with vLLM:

# Enable ExpertKit
import os
os.environ["EK_ENABLE"] = "1"

from vllm import LLM

# Method 1: Using model config
llm = LLM(
    model="Qwen/Qwen3-MoE-A3B",
    tensor_parallel_size=1,
    trust_remote_code=True,
    model_config={
        "ek_backend_addr": "localhost:5002",
        "ek_model_name": "Qwen/Qwen3-MoE-A3B"
    }
)

# Generate text
outputs = llm.generate("Hello, world!", max_tokens=100)
print(outputs[0].outputs[0].text)

Supported Models

This plugin currently supports:

Qwen3-MoE-A3B: Qwen/Qwen3-MoE-A3B (requires vLLM >= 0.8.4)
DeepSeek-V2: deepseek-ai/deepseek-v2-base

Architecture

This plugin replaces the DeepseekV2MoE implementation with ExpertKitMoE, which routes expert computation to Expert-Kit service.

Requirements

vLLM >= 0.8.4 (required for Qwen3-MoE support)
grpcio >= 1.71.0
Protobuf >= 5.29.4

Configuration Priority

Configuration parameters are resolved in the following order (higher priority overrides lower):

Environment variables (highest priority)
Model configuration parameters
Default values (lowest priority)

Deployment Example

from vllm import LLM
import os

os.environ["VLLM_MLA_DISABLE"] = "1"

os.environ["EK_ENABLE"] = "0"
os.environ["EK_MODEL_NAME"] = "qwen3"
os.environ["EK_MODE"] = "expert_mode"
os.environ["EK_ADDR"] = "localhost:5002"
os.environ["EK_CLIENT_TIMEOUT"] = "2"
os.environ["EK_DEBUG_MODE"] = "0"

prompts = [
    "Hello, my name is",
    "The president of the United",
]

llm = LLM(
        model="Qwen/Qwen3-30B-A3B",
        trust_remote_code=True,

        max_model_len=16,
        enforce_eager=True,
        cpu_offload_gb=64,
        max_num_batched_tokens=1024
    )

outputs = llm.generate(prompts)

Troubleshooting

Common Issues

Missing EK_MODEL_NAME: Ensure ek_model_name is set in model config or EK_MODEL_NAME environment variable is set.
Connection timeout: Increase ek_client_timeout value if your Expert-Kit service is slow to respond.
Debug mode: Set ek_debug_mode=True or EK_DEBUG_MODE=1 to enable detailed logging.