A2单机PD混部一键部署

以下配置以cann B080, deepseekR1, w4a8为例,NPU_MEMORY_FRACTION=0.92适配deepseekR1, w8a8的环境,其他参数,权重路径需要根据模型实际情况调整

1. 配置环境变量

需要修改libjemalloc.so路径后

source A2_single_machine.sh

1.1 环境变量说明

  • NPU_MEMORY_FRACTION:参数为显存比例因子,默认参数为0.92,高并发场景下建议 <=0.92。
    • 建议配置方案:建议将该值设置为可拉起服务的最小值。具体方法是,按照默认配置启动服务,若无法拉起服务,则上调参数至可拉起为止;若拉起服务成功,则下调该参数至刚好拉起服务为止。总之,在服务能正常拉起的前提下,更低的值可以保障更高的服务系统稳定性。

2. 配置性能测试mindie_llm/conf/config.json文件

打开mindie_llm/conf/config.json文件,修改以下参数:

origin change
"httpsEnabled" : true, "httpsEnabled" : false,
"interCommTLSEnabled" : true, "interCommTLSEnabled" : false,
"tokenTimeout" : 600, "tokenTimeout" : 3600,
"e2eTimeout" : 600, "e2eTimeout" : 65535,
"npuDeviceIds" : [[0,1,2,3]], "npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"multiNodesInferEnabled" : false, "multiNodesInferEnabled" : true,
"interNodeTLSEnabled" : true, "interNodeTLSEnabled" : false,
"maxSeqLen" : 2560, "maxSeqLen" : 28672,
"maxInputTokenLen" : 2560, "maxInputTokenLen" : 28672,
"modelName" : "llama_65b", "modelName" : "dsr1",
"modelWeightPath" : "/data/atb_testdata/weights/llama1-65b-safetensors", "modelWeightPath" : "/mnt/mindie_data/weight/Deepseek_w4a8_0625",
"worldSize" : 4, "worldSize" : 8,
"maxPrefillTokens" : 8192, "maxPrefillTokens" : 28672,
"maxIterTimes" : 512, "maxIterTimes" : 28672,

在"async_scheduler_wait_time": 120,后面加上

"moe_ep": 1,
"moe_tp": 8,
"sp": 1,
"tp": 8,
"dp": 1,
"ignore_eos": true,
"plugin_params": "{\"plugin_type\":\"mtp\",\"num_speculative_tokens\": 1}",
"models": {
    "deepseekv2":{
        "enable_mlapo_prefetch": true,
        "kv_cache_options": {"enable_nz": true}
    }
}

3.配置性能测试脚本ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py

需要修改host_ip后,直接复制

from ais_bench.benchmark.models import VLLMCustomAPIChatStream

models=[
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="/mnt/mindie_data/weight/Deepseek_w4a8_0625",
        model="dsr1",
        request_rate=0,
        retry=2,
        host_ip="host_ip",
        host_port=1025,
        max_out_len=1500,
        batch_size=25,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.6,
            ignore_eos=True,
        )
    )
]

4. 配置精度测试mindie_llm/conf/config.json文件

只需将性能测试config.json里的"ignore_eos": true,改成"ignore_eos": false,

5.配置精度测试脚本ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py

需要修改host_ip后,直接复制

from ais_bench.benchmark.models import VLLMCustomAPIChat

models=[
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='vllm-api-general-chat',
        path="/mnt/mindie_data/weight/Deepseek_w4a8_0625",
        model="dsr1",
        request_rate=0,
        retry=2,
        host_ip="host_ip",
        host_port=1025,
        max_out_len=24576,
        batch_size=2,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.6,
        )
    )
]

6.性能测试指令

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --summarizer default_perf --mode perf --num-prompts 224

7.精度测试指令

ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt --summarizer example