Atlas 300I DUO
Atlas 300I DUO does not support `triton` or `triton-ascend`.
Run vLLM on Atlas 300I DUO
Install Notes
If installing from source, vllm and vllm-ascend may automatically pull in triton and triton-ascend dependencies, which may cause unexpected issues on Atlas 300I DUO. Please uninstall them before running on Atlas 300I DUO:
pip uninstall -y triton-ascend triton
Graph Mode Notes
The current release supports `FULL_DECODE_ONLY` graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:
- When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
- There is no such limitation when TP=1.
- We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.
Deployment
Run docker container:
:substitutions:
# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0-310p
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8080:8080 \
-it $IMAGE bash
Set up environment variables:
export VLLM_USE_MODELSCOPE=True
Online Inference on NPU
For Atlas 300I DUO (310P), do not rely on `max-model-len` auto detection
(that is, do not omit the `--max-model-len` argument), because it may cause OOM.
Reason, based on the current 310P attention path:
- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
to `AttentionMaskBuilder310`.
- `AttentionMaskBuilder310` builds a full causal mask with shape
`[max_model_len, max_model_len]` in float16, then converts it to FRACTAL_NZ.
- In the 310P `attention_v1` prefill/chunked-prefill path
(`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
this explicit mask tensor is consumed directly, and there is currently
no compressed-mask path.
If auto detection resolves to a large context length, the mask allocation
(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
Always set an explicit and conservative value, for example `--max-model-len 16384`.
Run the following commands to start the vLLM server on NPU for the Qwen3 Dense series.
Prepare Model Weights
Use the W8A8SC quantized weights from the Eco-Tech official ModelScope repository.
:header-rows: 1
* - Model
- ModelScope Link
* - Qwen3-8B-W8A8SC-310
- [Eco-Tech/Qwen3-8B-w8a8sc-310-vllm](https://www.modelscope.cn/models/Eco-Tech/Qwen3-8B-w8a8sc-310-vllm)
* - Qwen3-14B-W8A8SC-310
- [Eco-Tech/Qwen3-14B-w8a8sc-310-vllm](https://www.modelscope.cn/models/Eco-Tech/Qwen3-14B-w8a8sc-310-vllm)
* - Qwen3-32B-W8A8SC-310
- [Eco-Tech/Qwen3-32B-w8a8sc-310-vllm](https://www.modelscope.cn/models/Eco-Tech/Qwen3-32B-w8a8sc-310-vllm)
Qwen3-8B-W8A8SC
:substitutions:
vllm serve Eco-Tech/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1 \
--host 127.0.0.1 \
--port 8080 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.90 \
--max_num_seqs 32 \
--served_model_name qwen \
--dtype float16 \
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
--quantization ascend \
--max_model_len 16384 \
--no-enable-prefix-caching \
--load_format sharded_state
Qwen3-14B-W8A8SC
:substitutions:
vllm serve Eco-Tech/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1 \
--host 127.0.0.1 \
--port 8080 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.90 \
--max_num_seqs 16 \
--served_model_name qwen \
--dtype float16 \
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
--quantization ascend \
--max_model_len 16384 \
--no-enable-prefix-caching \
--load_format sharded_state
Qwen3-32B-W8A8SC
:substitutions:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
vllm serve Eco-Tech/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4 \
--host 127.0.0.1 \
--port 8080 \
--tensor-parallel-size 4 \
--gpu_memory_utilization 0.90 \
--max_num_seqs 32 \
--served_model_name qwen \
--dtype float16 \
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
--quantization ascend \
--max_model_len 20480 \
--no-enable-prefix-caching \
--load_format sharded_state
Once the server is started, you can query the model with input prompts:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_completion_tokens": 64,
"temperature": 0.0
}'
If the script runs successfully, you can see the generated result.
Offline Inference
Run the following script, example.py, to execute offline inference on NPU.
:::::{tab-set} :sync-group: inference
::::{tab-item} Qwen3-8B-W8A8SC :selected: :sync: qwen3-8b
:substitutions:
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel,
)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(
max_completion_tokens=100,
temperature=0.0,
)
llm = LLM(
model="Eco-Tech/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1",
tensor_parallel_size=1,
max_model_len=16384,
dtype="float16",
quantization="ascend",
load_format="sharded_state",
additional_config={
"ascend_compilation_config": {
"fuse_norm_quant": False,
}
},
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32],
},
enable_prefix_caching=False,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
::::
::::{tab-item} Qwen3-14B-W8A8SC :sync: qwen3-14b
:substitutions:
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel,
)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(
max_completion_tokens=100,
temperature=0.0,
)
llm = LLM(
model="Eco-Tech/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1",
tensor_parallel_size=1,
max_model_len=16384,
dtype="float16",
quantization="ascend",
load_format="sharded_state",
additional_config={
"ascend_compilation_config": {
"fuse_norm_quant": False,
}
},
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
"cudagraph_capture_sizes": [1, 2, 4, 8, 16],
},
enable_prefix_caching=False,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
::::
::::{tab-item} Qwen3-32B-W8A8SC :sync: qwen3-32b
:substitutions:
import gc
import os
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel,
)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1,2,3"
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(
max_completion_tokens=100,
temperature=0.0,
)
llm = LLM(
model="Eco-Tech/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4",
tensor_parallel_size=4,
max_model_len=20480,
dtype="float16",
quantization="ascend",
load_format="sharded_state",
additional_config={
"ascend_compilation_config": {
"fuse_norm_quant": False,
}
},
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
"cudagraph_capture_sizes": [16, 32],
},
enable_prefix_caching=False,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
:::: :::::
Run script:
python example.py
If the script runs successfully, you can see the generated result.
Closing Notes
For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.