yanzhicong1Fix: vllm-cpu 0.20.1 build failure (#2197 )

文件	最后提交记录	最后更新时间
0.10.1	update Dockerfile Signed-off-by: zhihang <zhihang161013@outlook.com>	9 个月前
0.14.0	update vllm to 0.14.0 for 24.03 LTS SP3 Signed-off-by: zhihang <zhihang161013@outlook.com>	4 个月前
0.16.0	Fix Dockerfile parse error in vllm-cpu 0.16.0 patch step The multiline python3 -c "..." block caused Docker to interpret each continuation line as a Dockerfile instruction, failing with: "dockerfile parse error: unknown instruction: import" Rewrite the mla_decode.cpp patch as a true single-line python3 -c command using shell single-quotes and Python \n escape sequences. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	3 个月前
0.18.0	24.03-lts-sp3 update	3 个月前
0.20.1	Fix: vllm-cpu 0.20.1 build failure (#2197)	1 个月前
0.8.3	add vllm-cpu 0.8.3 for 22.03-LTS-SP4	1 年前
0.8.4	fix environment varible	1 年前
0.8.5	add vllm-cpu 0.8.5 and 0.9.0 for 22.03-LTS-SP4	1 年前
0.9.0	update	1 年前
0.9.1	add vLLM 0.9.1 for 22.03-LTS-SP4 and 24.03-LTS Signed-off-by: zhihang <zhihang161013@outlook.com>	11 个月前
doc	24.03-lts-sp3 update	1 个月前
README.md	24.03-lts-sp3 update	1 个月前
meta.yml	24.03-lts-sp3 update	1 个月前

Quick reference

The offical vLLM Ascend docker images
Maintained by: openEuler CloudNative SIG
Where to get help: openEuler CloudNative SIG, openEuler

vLLM | openEuler

Current vLLM docker images are built on the openEuler⁠. This repository is free to use and exempted from per-user rate limits.

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

Read more about vLLM at vLLM paper (SOSP 2023) and explore the vLLM technical documentation at docs.vllm.ai

Supported tags and respective Dockerfile links

The tag of each vLLM docker image is consist of the version of vLLM and the version of basic image. The details are as follows

Tags	Currently	Architectures
0.20.1-oe2403sp3	vllm 0.20.1 on openEuler 24.03-LTS-SP3	amd64, arm64
0.18.0-oe2403sp3	vllm 0.18.0 on openEuler 24.03-LTS-SP3	amd64, arm64
0.16.0-oe2403sp3	vllm 0.16.0 on openEuler 24.03-LTS-SP3	amd64, arm64
0.6.3-oe2403lts	vLLM 0.6.3 on openEuler 24.03-LTS	amd64
0.8.3-oe2203sp4	vLLM 0.8.3 on openEuler 22.03-LTS-SP4	amd64, arm64
0.8.3-oe2403lts	vLLM 0.8.3 on openEuler 24.03-LTS	amd64, arm64
0.8.4-oe2203sp4	vLLM 0.8.4 on openEuler 22.03-LTS-SP4	amd64
0.8.4-oe2403lts	vLLM 0.8.4 on openEuler 24.03-LTS	amd64
0.8.5-oe2203sp4	vLLM 0.8.5 on openEuler 22.03-LTS-SP4	amd64, arm64
0.8.5-oe2403lts	vLLM 0.8.5 on openEuler 24.03-LTS	amd64, arm64
0.9.0-oe2203sp4	vLLM 0.9.0 on openEuler 22.03-LTS-SP4	amd64, arm64
0.9.0-oe2403lts	vLLM 0.9.0 on openEuler 24.03-LTS	amd64, arm64
0.9.1-oe2203sp4	vLLM 0.9.1 on openEuler 22.03-LTS-SP4	amd64, arm64
0.9.1-oe2403lts	vLLM 0.9.1 on openEuler 24.03-LTS	amd64, arm64
0.10.1-oe2203sp4	vLLM 0.10.1 on openEuler 22.03-LTS-SP4	amd64, arm64
0.10.1-oe2403lts	vLLM 0.10.1 on openEuler 24.03-LTS	amd64, arm64
0.14.0-oe2403sp3	vllm 0.14.0 on openEuler 24.03-LTS-SP3	amd64, arm64

Usage

Quick start 1: supported devices

Intel/AMD x86
ARM AArch64

Quick start 2: setup environment using container

# Update the vllm image
docker run --rm --name vllm -p 8000:8000 -it --entrypoint bash openeuler/vllm-cpu:latest

Quick start 3: offline inference

You can use Modelscope mirror to speed up download:

export VLLM_USE_MODELSCOPE=true

With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).

Try to run below Python script directly or use python3 shell to generate texts:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# The first run will take about 3-5 mins (10 MB/s) to download models
llm = LLM(model="Qwen/Qwen3-8B")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Question and answering

If you have any questions or want to use some special features, please submit an issue or a pull request on openeuler-docker-images⁠.