Foundation Model Support Matrix

Notes:

You can click the link in the Model Type column to go to the best practices page recommended by msModelSlim. This page describes the quantization commands and configurations in detail.
√ indicates that the quantization strategy is officially verified by msModelSlim. It features complete functions and stable performance and is recommended for use.
- indicates that the quantization strategy has not been officially verified by msModelSlim. You can configure and try the strategy as needed, but the quantization effect and function stability cannot be guaranteed.
The combination of "model name-quantization mode (such as w8a8s)" marked with "quick quantization" can be used to execute the following quick quantization command to perform model quantization after installation.
Because newer versions of the Qwen series have been released with more powerful capabilities, the Qwen1.5-14B/32B/72B models have exceeded the maintenance period. Old models in this series will be sunset, and maintenance support for the quantization modes of their live-network versions will be discontinued.

msmodelslim quant --model_path ${MODEL_PATH} --save_path ${SAVE_PATH} --device npu --model_type ${MODEL_TYPE} --quant_type ${QUANT_TYPE} --trust_remote_code True

For best practices not marked with "quick quantization", read the best practices page of the corresponding model type. Execute the command in the subdirectory of the model type within the example directory, such as DeepSeek or Qwen3.

Quantization Mode Naming Convention

The format of the quantization mode name is W{weight_bit}A{activation_bit}[C{cache_bit}][S], where

{weight_bit} indicates weight quantization bits (such as 8, 4, or 16).
{activation_bit} indicates activation quantization bits (such as 8 or 16).
{cache_bit} (optional) indicates KV cache quantization bits (such as 8).
S (optional) indicates sparse quantization.

Supported LLMs

!!! info "Note" This table contains a large amount of data. If it is not fully visible, use the scroll bar at the bottom or hold the mouse wheel to scroll horizontally.

Model Type	Model Name	Dependency Library	w8a16¹	w8a8	w4a8	w4a16	w8a8c8²	w4a8c8²	w8a8s (Sparse Quantization)³	w16a16s (Floating-point Sparse Quantization)³	w4a4
DeepSeek series	DeepSeek-V2-16B	-	√	√	-	-	-	-	-	-	-
	DeepSeek-V2-236B	-	√	√	-	-	-	-	-	-	-
	DeepSeek-Coder-33B	-	√	√	-	-	√	-	-	-	-
	DeepSeek-V3	transformers==4.48.2	-	√	-	-	-	-	-	-	-
	DeepSeek-V3.1	transformers==4.48.2	-	√	√	-	√	√ (quick quantization)	-	-	-
	DeepSeek-V3.2-Exp	transformers==4.48.2	-	√ (quick quantization)	√ (quick quantization)	-	-	-	-	-	-
	DeepSeek-V3.2	transformers==4.48.2	-	√ (quick quantization)	-	-	-	-	-	-	-
	DeepSeek-R1	transformers==4.48.2	-	√	√	-	√	-	-	-	-
	DeepSeek-R1-0528	transformers==4.48.2	-	√	√ (quick quantization)	-	√	√ (quick quantization)	-	-	-
DeepSeek-R1-Distill series	DeepSeek-R1-Distill-Llama-8B	-	-	√	-	-	-	-	√	-	-
	DeepSeek-R1-Distill-Llama-70B	-	-	√	-	-	-	-	-	-	-
	DeepSeek-R1-Distill-Qwen-1.5B	-	-	√	-	-	-	-	√	-	-
	DeepSeek-R1-Distill-Qwen-7B	-	-	√	-	-	-	-	√	-	-
	DeepSeek-R1-Distill-Qwen-14B	-	-	√	-	-	-	-	√	-	-
	DeepSeek-R1-Distill-Qwen-32B	-	-	√	-	-	-	-	√	-	-
Qwen3 series	Qwen3-8B	transformers==4.51.0	-	-	-	-	-	-	√ (quick quantization)	-	-
	Qwen3-14B	transformers==4.51.0	-	√ (quick quantization, supported only by MindIE)⁴	-	-	-	-	√ (quick quantization)	-	-
	Qwen3-32B	transformers==4.51.0	-	√ (quick quantization, supported only by MindIE)⁴	-	-	√ (quick quantization)	-	√ (quick quantization)	√ (quick quantization)	√
Qwen3-MOE series	Qwen3-30B-A3B	transformers==4.51.0	-	√ (quick quantization)	√ (quick quantization)	-	-	-	-	-	-
	Qwen3-235B-A22B	transformers==4.51.0	-	√	√ (quick quantization)	-	-	-	-	-	-
	Qwen3-Coder-480B-A35B	transformers==4.51.0	-	-	√ (quick quantization)	-	-	-	-	-	-
Qwen3.5 series	Qwen3.5-397B-A17B	transformers==5.2.0	-	√ (quick quantization)	√ (quick quantization)	-	-	-	-	-	-
	Qwen3.5-122B-A10B	transformers==5.2.0	-	√ (quick quantization)	-	-	-	-	-	-	-
	Qwen3.5-35B-A3B	transformers==5.2.0	-	√ (quick quantization)	-	-	-	-	-	-	-
	Qwen3.5-27B	transformers==5.2.0	-	√ (quick quantization)	-	-	-	-	-	-	-
Qwen3-Next series	Qwen3-Next-80B-A3B-Instruct	transformers>=4.57.0	-	√ (quick quantization, supported only by vLLM Ascend)	-	-	-	-	-	-	-
Qwen2.5 series	Qwen2.5-7B-Instruct	-	-	√ (quick quantization)	-	-	-	-	√	-	-
	Qwen2.5-14B-Instruct	-	-	√	-	-	-	-	√	-	-
	Qwen2.5-32B-Instruct	-	-	√ (quick quantization)	-	-	-	-	-	-	-
	Qwen2.5-72B-Instruct	-	-	-	-	√	√ (quick quantization)	-	-	-	-
	Qwen2.5-Coder-7B-Instruct	-	-	-	-	-	-	-	√ (quick quantization)	-	-
Qwen2 series	Qwen2-7B	-	-	√	-	-	-	-	√	-	-
Qwen2 series	Qwen2-72B	-	√	√	-	-	√	-	√	-	-
Qwen series	Qwen-7B	-	-	√	-	-	-	-	-	-	-
	Qwen-14B	-	-	√	-	-	-	-	-	-	-
	Qwen-72B	-	√	-	-	-	-	-	-	-	-
	Qwen1.5-14B	-	-	√	-	-	-	-	√	-	-
	Qwen1.5-32B	-	-	√	-	-	-	-	-	-	-
	Qwen1.5-72B	-	√	-	-	-	-	-	-	-	-
	Qwen1.5-110B	-	√	-	-	-	-	-	-	-	-
QwQ series	QwQ-32B	-	-	√ (quick quantization)	-	-	-	-	√ (quick quantization)	-	-
GLM5-MOE series	GLM-5	transformers==5.2.0	-	√	√	-	-	-	-	-	-
GLM series	GLM-4-9B	-	-	√	-	-	√	-	√	-	-
GLM4-MOE series	GLM-4.7	transformers==4.57.3	-	√ (quick quantization, supported only by vLLM Ascend)	-	-	-	-	-	-	-


HunYuan series	Hunyuan-A52B-Instruct	transformers>=4.48.2	-	√	-	-	-	-	-	-	-
InternLM series	InternLM2-20B	-	√	√	-	-	√	-	-	-	-
Llama series	LLaMA-33B	-	-	-	-	-	-	-	√	-	-
	LLaMA-65B	-	√	-	-	-	-	-	-	-	-
	LLaMA2-13B	-	-	√	-	-	-	-	√	-	-
	LLaMA2-7B	-	-	√	-	-	-	-	√	-	-
	LLaMA2-70B	-	√	√	-	-	-	-	-	-	-
	LLaMA3-70B	-	√	-	-	-	-	-	-	-	-
	LLaMA3.1-8B	-	-	√	-	-	-	-	-	-	-
	LLaMA3.1-70B	-	-	√	-	-	√	-	√	-	-

Notes:

¹ The w8a16 quantization mode is supported only by MindIE.
² Both KVCache and FA3 quantization are categorized under c8, as both methods quantize the K and V caches within the LLM. Only MindIE supports c8 quantization modes, such as w8a8c8 and w4a8c8.
³ For optimal performance, use the decompression features of the Atlas 300I Duo products after compression. Only MindIE supports sparse quantization modes, including w8a8s and w16a16s.
⁴ Only MindIE supports best practices that employ the PDMIX quantization scheme.

Supported MLLMs

Model Type	Model Name	Dependency Library	w8a8	w8a8c8/w8a8f8	w8a8s (Sparse Quantization)¹
Qwen3-VL series	Qwen3-VL-4B-Instruct	transformers==4.57.1	√	-	-
	Qwen3-VL-8B-Instruct	transformers==4.57.1	-	-	√
	Qwen3-VL-32B-Instruct	transformers==4.57.1	√	-	-
Qwen3-VL-MoE series	Qwen3-VL-235B-A22B	transformers==4.57.1, flax	√ (quick quantization)	-	-
Qwen3-Omni series	Qwen3-Omni-30B-A3B-Thinking	transformers==4.57.3	√	-	-
Qwen3-Omni series	Qwen3-Omni-30B-A3B-Instruct	transformers==4.57.3	√	-	-
Qwen2.5-VL series	Qwen2.5-VL-7B	transformers==4.49.0, qwen_vl_utils	√	-	-
Qwen2.5-VL series	Qwen2.5-VL-72B	transformers==4.49.0, qwen_vl_utils	√	-	-
Qwen2.5-Omni series	Qwen2.5-Omni-7B	transformers==4.57.3	√	-	-
Qwen2-VL series	Qwen2-VL-7B	transformers==4.46.0, qwen_vl_utils	√	-	-
Qwen2-VL series	Qwen2-VL-72B	transformers==4.46.0, qwen_vl_utils	√	-	-
Qwen-VL series	Qwen-VL	transformers-stream-generator	√	-	-
InternVL2 series	InternVL2-8B	transformers==4.46.0, timm, fastchat	√	-	-
InternVL2 series	InternVL2-40B	transformers==4.46.0, timm, fastchat	√	-	-
LLaVA series	LLaVA-1.5-7B	transformers==4.37.2	√	-	-
GLM-4.1V series	GLM-4.1V-9B-Thinking	transformers==4.53.0	-	-	√
GLM-4.6V	GLM-4.6V	transformers==5.0.0rc0	√	-	-
Multimodal generative models	SD3-Medium	diffusers	√	-	-
	Open-Sora-Plan v1.2	huggingface_hub==0.25.2	√	-	-
	FLUX.1-dev	-	√	√ (quick quantization)	-
	HunyuanVideo	-	√	√ (quick quantization)	-
	Wan2.1	-	√ (quick quantization)	-	-
	Wan2.2	-	-	√ (quick quantization)	-
	Qwen-Image-Edit	-	-	√ (quick quantization)	-

Notes:

¹ For optimal performance, use the decompression features of the Atlas 300I Duo series after compression. Only MindIE supports sparse quantization modes.
² FLUX.1-dev, HunyuanVideo, Wan2.2, and Qwen-Image-Edit-2509 support MXFP quantization.