Foundation Model Support Matrix

Notes:

  • You can click the link in the Model Type column to go to the best practices page recommended by msModelSlim. This page describes the quantization commands and configurations in detail.
  • √ indicates that the quantization strategy is officially verified by msModelSlim. It features complete functions and stable performance and is recommended for use.
  • - indicates that the quantization strategy has not been officially verified by msModelSlim. You can configure and try the strategy as needed, but the quantization effect and function stability cannot be guaranteed.
  • The combination of "model name-quantization mode (such as w8a8s)" marked with "quick quantization" can be used to execute the following quick quantization command to perform model quantization after installation.
  • Because newer versions of the Qwen series have been released with more powerful capabilities, the Qwen1.5-14B/32B/72B models have exceeded the maintenance period. Old models in this series will be sunset, and maintenance support for the quantization modes of their live-network versions will be discontinued.
msmodelslim quant --model_path ${MODEL_PATH} --save_path ${SAVE_PATH} --device npu --model_type ${MODEL_TYPE} --quant_type ${QUANT_TYPE} --trust_remote_code True
  • For best practices not marked with "quick quantization", read the best practices page of the corresponding model type. Execute the command in the subdirectory of the model type within the example directory, such as DeepSeek or Qwen3.

Quantization Mode Naming Convention

The format of the quantization mode name is W{weight_bit}A{activation_bit}[C{cache_bit}][S], where

  • {weight_bit} indicates weight quantization bits (such as 8, 4, or 16).
  • {activation_bit} indicates activation quantization bits (such as 8 or 16).
  • {cache_bit} (optional) indicates KV cache quantization bits (such as 8).
  • S (optional) indicates sparse quantization.

Supported LLMs

!!! info "Note" This table contains a large amount of data. If it is not fully visible, use the scroll bar at the bottom or hold the mouse wheel to scroll horizontally.

Model Type Model Name Dependency Library w8a161 w8a8 w4a8 w4a16 w8a8c82 w4a8c82 w8a8s (Sparse Quantization)3 w16a16s (Floating-point Sparse Quantization)3 w4a4
DeepSeek series DeepSeek-V2-16B - - - - - - - -
DeepSeek-V2-236B - - - - - - - -
DeepSeek-Coder-33B - - - - - - -
DeepSeek-V3 transformers==4.48.2 - - - - - - - -
DeepSeek-V3.1 transformers==4.48.2 - - √ (quick quantization) - - -
DeepSeek-V3.2-Exp transformers==4.48.2 - √ (quick quantization) √ (quick quantization) - - - - - -
DeepSeek-V3.2 transformers==4.48.2 - √ (quick quantization) - - - - - - -
DeepSeek-R1 transformers==4.48.2 - - - - - -
DeepSeek-R1-0528 transformers==4.48.2 - √ (quick quantization) - √ (quick quantization) - - -
DeepSeek-R1-Distill series DeepSeek-R1-Distill-Llama-8B - - - - - - - -
DeepSeek-R1-Distill-Llama-70B - - - - - - - - -
DeepSeek-R1-Distill-Qwen-1.5B - - - - - - - -
DeepSeek-R1-Distill-Qwen-7B - - - - - - - -
DeepSeek-R1-Distill-Qwen-14B - - - - - - - -
DeepSeek-R1-Distill-Qwen-32B - - - - - - - -
Qwen3 series Qwen3-8B transformers==4.51.0 - - - - - - √ (quick quantization) - -
Qwen3-14B transformers==4.51.0 - √ (quick quantization, supported only by MindIE)4 - - - - √ (quick quantization) - -
Qwen3-32B transformers==4.51.0 - √ (quick quantization, supported only by MindIE)4 - - √ (quick quantization) - √ (quick quantization) √ (quick quantization)
Qwen3-MOE series Qwen3-30B-A3B transformers==4.51.0 - √ (quick quantization) √ (quick quantization) - - - - - -
Qwen3-235B-A22B transformers==4.51.0 - √ (quick quantization) - - - - - -
Qwen3-Coder-480B-A35B transformers==4.51.0 - - √ (quick quantization) - - - - - -
Qwen3.5 series Qwen3.5-397B-A17B transformers==5.2.0 - √ (quick quantization) √ (quick quantization) - - - - - -
Qwen3.5-122B-A10B transformers==5.2.0 - √ (quick quantization) - - - - - - -
Qwen3.5-35B-A3B transformers==5.2.0 - √ (quick quantization) - - - - - - -
Qwen3.5-27B transformers==5.2.0 - √ (quick quantization) - - - - - - -
Qwen3-Next series Qwen3-Next-80B-A3B-Instruct transformers>=4.57.0 - √ (quick quantization, supported only by vLLM Ascend) - - - - - - -
Qwen2.5 series Qwen2.5-7B-Instruct - - √ (quick quantization) - - - - - -
Qwen2.5-14B-Instruct - - - - - - - -
Qwen2.5-32B-Instruct - - √ (quick quantization) - - - - - - -
Qwen2.5-72B-Instruct - - - - √ (quick quantization) - - - -
Qwen2.5-Coder-7B-Instruct - - - - - - - √ (quick quantization) - -
Qwen2 series Qwen2-7B - - - - - - - -
Qwen2-72B - - - - - -
Qwen series Qwen-7B - - - - - - - - -
Qwen-14B - - - - - - - - -
Qwen-72B - - - - - - - - -
Qwen1.5-14B - - - - - - - -
Qwen1.5-32B - - - - - - - - -
Qwen1.5-72B - - - - - - - - -
Qwen1.5-110B - - - - - - - - -
QwQ series QwQ-32B - - √ (quick quantization) - - - - √ (quick quantization) - -
GLM5-MOE series GLM-5 transformers==5.2.0 - - - - - - -
GLM series GLM-4-9B - - - - - - -
GLM4-MOE series GLM-4.7 transformers==4.57.3 - √ (quick quantization, supported only by vLLM Ascend) - - - - - - -
HunYuan series Hunyuan-A52B-Instruct transformers>=4.48.2 - - - - - - - -
InternLM series InternLM2-20B - - - - - - -
Llama series LLaMA-33B - - - - - - - - -
LLaMA-65B - - - - - - - - -
LLaMA2-13B - - - - - - - -
LLaMA2-7B - - - - - - - -
LLaMA2-70B - - - - - - - -
LLaMA3-70B - - - - - - - - -
LLaMA3.1-8B - - - - - - - - -
LLaMA3.1-70B - - - - - - -

Notes:

  • 1 The w8a16 quantization mode is supported only by MindIE.
  • 2 Both KVCache and FA3 quantization are categorized under c8, as both methods quantize the K and V caches within the LLM. Only MindIE supports c8 quantization modes, such as w8a8c8 and w4a8c8.
  • 3 For optimal performance, use the decompression features of the Atlas 300I Duo products after compression. Only MindIE supports sparse quantization modes, including w8a8s and w16a16s.
  • 4 Only MindIE supports best practices that employ the PDMIX quantization scheme.

Supported MLLMs

Model Type Model Name Dependency Library w8a8 w8a8c8/w8a8f8 w8a8s (Sparse Quantization)1
Qwen3-VL series Qwen3-VL-4B-Instruct transformers==4.57.1 - -
Qwen3-VL-8B-Instruct transformers==4.57.1 - -
Qwen3-VL-32B-Instruct transformers==4.57.1 - -
Qwen3-VL-MoE series Qwen3-VL-235B-A22B transformers==4.57.1, flax √ (quick quantization) - -
Qwen3-Omni series Qwen3-Omni-30B-A3B-Thinking transformers==4.57.3 - -
Qwen3-Omni-30B-A3B-Instruct transformers==4.57.3 - -
Qwen2.5-VL series Qwen2.5-VL-7B transformers==4.49.0, qwen_vl_utils - -
Qwen2.5-VL-72B transformers==4.49.0, qwen_vl_utils - -
Qwen2.5-Omni series Qwen2.5-Omni-7B transformers==4.57.3 - -
Qwen2-VL series Qwen2-VL-7B transformers==4.46.0, qwen_vl_utils - -
Qwen2-VL-72B transformers==4.46.0, qwen_vl_utils - -
Qwen-VL series Qwen-VL transformers-stream-generator - -
InternVL2 series InternVL2-8B transformers==4.46.0, timm, fastchat - -
InternVL2-40B transformers==4.46.0, timm, fastchat - -
LLaVA series LLaVA-1.5-7B transformers==4.37.2 - -
GLM-4.1V series GLM-4.1V-9B-Thinking transformers==4.53.0 - -
GLM-4.6V GLM-4.6V transformers==5.0.0rc0 - -
Multimodal generative models SD3-Medium diffusers - -
Open-Sora-Plan v1.2 huggingface_hub==0.25.2 - -
FLUX.1-dev - √ (quick quantization) -
HunyuanVideo - √ (quick quantization) -
Wan2.1 - √ (quick quantization) - -
Wan2.2 - - √ (quick quantization) -
Qwen-Image-Edit - - √ (quick quantization) -

Notes:

  • 1 For optimal performance, use the decompression features of the Atlas 300I Duo series after compression. Only MindIE supports sparse quantization modes.
  • 2 FLUX.1-dev, HunyuanVideo, Wan2.2, and Qwen-Image-Edit-2509 support MXFP quantization.