GGitHub[Doc][310p] Add the 310p guide (#8640 )

Atlas 300I DUO

Atlas 300I DUO does not support `triton` or `triton-ascend`.

Run vLLM on Atlas 300I DUO

Install Notes

If installing from source, vllm and vllm-ascend may automatically pull in triton and triton-ascend dependencies, which may cause unexpected issues on Atlas 300I DUO. Please uninstall them before running on Atlas 300I DUO:

pip uninstall -y triton-ascend triton

Graph Mode Notes

The current release supports `FULL_DECODE_ONLY` graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

- When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
- There is no such limitation when TP=1.
- We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.

Deployment

Run docker container:

   :substitutions:

# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0-310p

docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8080:8080 \
-it $IMAGE bash

Set up environment variables:

export VLLM_USE_MODELSCOPE=True

Online Inference on NPU

For Atlas 300I DUO (310P), do not rely on `max-model-len` auto detection
(that is, do not omit the `--max-model-len` argument), because it may cause OOM.

Reason, based on the current 310P attention path:

- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
  to `AttentionMaskBuilder310`.
- `AttentionMaskBuilder310` builds a full causal mask with shape
  `[max_model_len, max_model_len]` in float16, then converts it to FRACTAL_NZ.
- In the 310P `attention_v1` prefill/chunked-prefill path
  (`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
  this explicit mask tensor is consumed directly, and there is currently
  no compressed-mask path.

If auto detection resolves to a large context length, the mask allocation
(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
Always set an explicit and conservative value, for example `--max-model-len 16384`.

Run the following commands to start the vLLM server on NPU for the Qwen3 Dense series.

Prepare Model Weights

Use the W8A8SC quantized weights from the Eco-Tech official ModelScope repository.

:header-rows: 1

* - Model
  - ModelScope Link
* - Qwen3-8B-W8A8SC-310
  - [Eco-Tech/Qwen3-8B-w8a8sc-310-vllm](https://www.modelscope.cn/models/Eco-Tech/Qwen3-8B-w8a8sc-310-vllm)
* - Qwen3-14B-W8A8SC-310
  - [Eco-Tech/Qwen3-14B-w8a8sc-310-vllm](https://www.modelscope.cn/models/Eco-Tech/Qwen3-14B-w8a8sc-310-vllm)
* - Qwen3-32B-W8A8SC-310
  - [Eco-Tech/Qwen3-32B-w8a8sc-310-vllm](https://www.modelscope.cn/models/Eco-Tech/Qwen3-32B-w8a8sc-310-vllm)

Qwen3-8B-W8A8SC

   :substitutions:

vllm serve Eco-Tech/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1 \
    --host 127.0.0.1 \
    --port 8080 \
    --tensor-parallel-size 1 \
    --gpu_memory_utilization 0.90 \
    --max_num_seqs 32 \
    --served_model_name qwen \
    --dtype float16 \
    --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
    --quantization ascend \
    --max_model_len 16384 \
    --no-enable-prefix-caching \
    --load_format sharded_state

Qwen3-14B-W8A8SC

   :substitutions:

vllm serve Eco-Tech/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1 \
    --host 127.0.0.1 \
    --port 8080 \
    --tensor-parallel-size 1 \
    --gpu_memory_utilization 0.90 \
    --max_num_seqs 16 \
    --served_model_name qwen \
    --dtype float16 \
    --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
    --quantization ascend \
    --max_model_len 16384 \
    --no-enable-prefix-caching \
    --load_format sharded_state

Qwen3-32B-W8A8SC

   :substitutions:

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

vllm serve Eco-Tech/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4 \
    --host 127.0.0.1 \
    --port 8080 \
    --tensor-parallel-size 4 \
    --gpu_memory_utilization 0.90 \
    --max_num_seqs 32 \
    --served_model_name qwen \
    --dtype float16 \
    --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
    --quantization ascend \
    --max_model_len 20480 \
    --no-enable-prefix-caching \
    --load_format sharded_state

Once the server is started, you can query the model with input prompts:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The future of AI is",
    "max_completion_tokens": 64,
    "temperature": 0.0
  }'

If the script runs successfully, you can see the generated result.

Offline Inference

Run the following script, example.py, to execute offline inference on NPU.

:::::{tab-set} :sync-group: inference

::::{tab-item} Qwen3-8B-W8A8SC :selected: :sync: qwen3-8b

   :substitutions:

import gc
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel,
)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()


prompts = [
    "Hello, my name is",
    "The future of AI is",
]

sampling_params = SamplingParams(
    max_completion_tokens=100,
    temperature=0.0,
)

llm = LLM(
    model="Eco-Tech/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1",
    tensor_parallel_size=1,
    max_model_len=16384,
    dtype="float16",
    quantization="ascend",
    load_format="sharded_state",
    additional_config={
        "ascend_compilation_config": {
            "fuse_norm_quant": False,
        }
    },
    compilation_config={
        "cudagraph_mode": "FULL_DECODE_ONLY",
        "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32],
    },
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()

::::

::::{tab-item} Qwen3-14B-W8A8SC :sync: qwen3-14b

   :substitutions:

import gc
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel,
)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()


prompts = [
    "Hello, my name is",
    "The future of AI is",
]

sampling_params = SamplingParams(
    max_completion_tokens=100,
    temperature=0.0,
)

llm = LLM(
    model="Eco-Tech/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1",
    tensor_parallel_size=1,
    max_model_len=16384,
    dtype="float16",
    quantization="ascend",
    load_format="sharded_state",
    additional_config={
        "ascend_compilation_config": {
            "fuse_norm_quant": False,
        }
    },
    compilation_config={
        "cudagraph_mode": "FULL_DECODE_ONLY",
        "cudagraph_capture_sizes": [1, 2, 4, 8, 16],
    },
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()

::::

::::{tab-item} Qwen3-32B-W8A8SC :sync: qwen3-32b

   :substitutions:

import gc
import os
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel,
)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()


os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1,2,3"

prompts = [
    "Hello, my name is",
    "The future of AI is",
]

sampling_params = SamplingParams(
    max_completion_tokens=100,
    temperature=0.0,
)

llm = LLM(
    model="Eco-Tech/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4",
    tensor_parallel_size=4,
    max_model_len=20480,
    dtype="float16",
    quantization="ascend",
    load_format="sharded_state",
    additional_config={
        "ascend_compilation_config": {
            "fuse_norm_quant": False,
        }
    },
    compilation_config={
        "cudagraph_mode": "FULL_DECODE_ONLY",
        "cudagraph_capture_sizes": [16, 32],
    },
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()

:::: :::::

Run script:

python example.py

If the script runs successfully, you can see the generated result.

Closing Notes

For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.