快速入门
本文档通过简单快速的部署案例(以Atlas 800I A2服务器、Qwen3-8B模型、P/D实例各一个的场景为例)指导开发者体验基于MindIE-Motor的PD分离服务部署流程。
如果详细的PD分离部署指导,请参考PD分离部署指导。
什么是PD分离?
模型推理的Prefill阶段和Decode阶段分别实例化部署在不同的硬件资源上进行推理,提升推理性能,其特性介绍详情请参见PD分离部署。
环境要求
-
支持Atlas 800I A2或者Atlas 800 A3 超节点服务器。
-
至少需要1台已完成环境准备的服务器。
模型下载
请自行下载Qwen3-8B模型的权重文件并将权重文件上传至服务器任意目录(以/mnt/weight为例)。执行以下命令,修改文件权限:
chmod -R 755 /mnt/weight
镜像准备
进入昇腾官方镜像仓库,在搜索框查询 motor,进入搜索结果后根据设备型号下载对应的MindIE-Motor镜像。
服务部署
-
准备服务启动脚本。
MindIE-Motor官方完整镜像内已保存服务启动脚本(
/tmp/motor/examples),可通过以下命令将镜像内的文件拷贝至宿主机。```bash IMAGE="<镜像名或镜像ID>" cid=$(docker create "$IMAGE") docker cp "$cid:/tmp/motor/examples" ./examples docker rm "$cid" ```请将上述脚本目录(examples目录)上传至k8s集群的管理节点(master节点),后续部署操作均在管理节点执行。
-
配置服务化参数。
在管理节点执行以下命令,进入服务启动脚本所在目录并修改配置文件。
cd examples/deployer/ vim ../infer_engines/vllm/user_config.jsonuser_config.json文件完整示例如下(可直接复制使用,4项xxxxxx内容需用户自行修改,如需了解各字段含义可参考 user_config 全量参数说明。):
{ "version": "v2.0", "motor_deploy_config": { "p_instances_num": 1, "d_instances_num": 1, "single_p_instance_pod_num": 1, "single_d_instance_pod_num": 1, "p_pod_npu_num": 4, "d_pod_npu_num": 4, "image_name": "xxxxxxx 镜像名称。例如:mindie-motor-vllm:dev-26.1.0.B050-800I-A2-py311-Ubuntu24.04-lts-aarch64", "job_id": "mindie-motor", "hardware_type": "xxxxxx 硬件类型。A2:800I_A2 A3:800I_A3", "weight_mount_path": "/mnt/weight/" }, "motor_controller_config": {}, "motor_coordinator_config": {}, "motor_engine_prefill_config": { "engine_type": "vllm", "motor_nodemanger_config": {}, "engine_config": { "served_model_name": "qwen3-8B", "model": "xxxxxx。权重文件路径。例如:/mnt/weight/qwen3_8B", "gpu_memory_utilization": 0.9, "data_parallel_size": 1, "tensor_parallel_size": 2, "pipeline_parallel_size": 1, "data_parallel_rpc_port": 9000, "enable_expert_parallel": false, "enforce-eager": true, "max_model_len": 2048, "kv_transfer_config": { "kv_connector": "MooncakeLayerwiseConnector", "kv_buffer_device": "npu", "kv_role": "kv_producer", "kv_parallel_size": 1, "kv_port": "30001", "engine_id": "0", "kv_rank": 0, "kv_connector_extra_config": {} } } }, "motor_engine_decode_config": { "engine_type": "vllm", "motor_nodemanger_config": {}, "engine_config": { "served_model_name": "qwen3-8B", "model": "xxxxxx。权重文件路径。例如:/mnt/weight/qwen3_8B", "gpu_memory_utilization": 0.9, "data_parallel_size": 1, "tensor_parallel_size": 2, "pipeline_parallel_size": 1, "data_parallel_rpc_port": 9000, "enable_expert_parallel": false, "max_model_len": 2048, "kv_transfer_config": { "kv_connector": "MooncakeLayerwiseConnector", "kv_buffer_device": "npu", "kv_role": "kv_consumer", "kv_parallel_size": 1, "kv_port": "30001", "engine_id": "0", "kv_rank": 0, "kv_connector_extra_config": {} } } } } -
配置环境变量。
执行以下命令修改环境变量配置文件。
vim ../infer_engines/vllm/env.jsonenv.json文件完整示例如下(可直接复制使用):
{ "version": "2.0.0", "motor_common_env": { "CANN_INSTALL_PATH": "/usr/local/Ascend", "MOTOR_LOG_ROOT_PATH": "/root/ascend/log" }, "motor_controller_env": {}, "motor_coordinator_env": {}, "motor_engine_prefill_env": { "HCCL_BUFFSIZE": 200, "PYTORCH_NPU_ALLOC_CONF": "expandable_segments:True", "HCCL_OP_EXPANSION_MODE": "AIV", "OMP_PROC_BIND": "false", "OMP_NUM_THREADS": 100, "ASCEND_BUFFER_POOL": "0:0" }, "motor_engine_decode_env": { "HCCL_BUFFSIZE": 200, "PYTORCH_NPU_ALLOC_CONF": "expandable_segments:True", "HCCL_OP_EXPANSION_MODE": "AIV", "OMP_PROC_BIND": "false", "OMP_NUM_THREADS": 100, "ASCEND_BUFFER_POOL": "0:0" }, "motor_kv_cache_pool_env": {}, "motor_kv_conductor_env": {} } -
启动与终止服务
创建命名空间(namespace),namespace 的值必须与
user_config.json中的job_id字段相同(默认值为mindie-motor)。kubectl create ns mindie-motor执行以下命令,部署PD分离服务:
python3 deploy.py --config_dir ../infer_engines/vllm需要终止服务时,执行以下命令即可:
bash delete.sh 命名空间(填入手动创建的命名空间名称,例如:mindie-motor) -
查看日志。
执行
vim log_collect/log_config.ini命令,将name_space填写为命名空间名称(例如:mindie-motor),然后执行以下命令收集日志:bash show_log.sh所有业务日志(controller、coordinator、P/D实例)均会保存于
examples/deployer/log_collect/log目录下,并持续刷新,直到服务被终止。
推理验证
新建一个命令行窗口,在k8s集群的管理节点(master节点)执行以下命令:
curl -X POST http://127.0.0.1:31015/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8B",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "who are you?"
}
],
"max_tokens":36,
"stream":true
}'
返回结果如果如下,则说明尚未启动就绪:
{"detail":"Service is not available"}
等待一段时间后再次尝试。回显类似如下内容说明推理服务已就绪:
data: {"id":"17658563046856100000c836403d","object":"chat.completion.chunk","created":1765856304,"model":"qwen3","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
data: {"id":"17658563046856100000c836403d","object":"chat.completion.chunk","created":1765856304,"model":"qwen3","choices":[{"index":0,"delta":{"content":"<think>"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"17658563046856100000c836403d","object":"chat.completion.chunk","created":1765856304,"model":"qwen3","choices":[{"index":0,"delta":{"content":"\n"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"17658563046856100000c836403d","object":"chat.completion.chunk","created":1765856304,"model":"qwen3","choices":[{"index":0,"delta":{"content":"Okay"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
...
data: [DONE]