Foundation Model Support Matrix
Notes:
- You can click the link in the Model Type column to go to the best practices page recommended by msModelSlim. This page describes the quantization commands and configurations in detail.
- √ indicates that the quantization strategy is officially verified by msModelSlim. It features complete functions and stable performance and is recommended for use.
- - indicates that the quantization strategy has not been officially verified by msModelSlim. You can configure and try the strategy as needed, but the quantization effect and function stability cannot be guaranteed.
- The combination of "model name-quantization mode (such as w8a8s)" marked with "quick quantization" can be used to execute the following quick quantization command to perform model quantization after installation.
- Because newer versions of the Qwen series have been released with more powerful capabilities, the Qwen1.5-14B/32B/72B models have exceeded the maintenance period. Old models in this series will be sunset, and maintenance support for the quantization modes of their live-network versions will be discontinued.
msmodelslim quant --model_path ${MODEL_PATH} --save_path ${SAVE_PATH} --device npu --model_type ${MODEL_TYPE} --quant_type ${QUANT_TYPE} --trust_remote_code True
- For best practices not marked with "quick quantization", read the best practices page of the corresponding model type. Execute the command in the subdirectory of the model type within the example directory, such as DeepSeek or Qwen3.
Quantization Mode Naming Convention
The format of the quantization mode name is W{weight_bit}A{activation_bit}[C{cache_bit}][S], where
{weight_bit}indicates weight quantization bits (such as 8, 4, or 16).{activation_bit}indicates activation quantization bits (such as 8 or 16).{cache_bit}(optional) indicates KV cache quantization bits (such as 8).S(optional) indicates sparse quantization.
Supported LLMs
!!! info "Note" This table contains a large amount of data. If it is not fully visible, use the scroll bar at the bottom or hold the mouse wheel to scroll horizontally.
| Model Type | Model Name | Dependency Library | w8a161 | w8a8 | w4a8 | w4a16 | w8a8c82 | w4a8c82 | w8a8s (Sparse Quantization)3 | w16a16s (Floating-point Sparse Quantization)3 | w4a4 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek series | DeepSeek-V2-16B | - | √ | √ | - | - | - | - | - | - | - |
| DeepSeek-V2-236B | - | √ | √ | - | - | - | - | - | - | - | |
| DeepSeek-Coder-33B | - | √ | √ | - | - | √ | - | - | - | - | |
| DeepSeek-V3 | transformers==4.48.2 | - | √ | - | - | - | - | - | - | - | |
| DeepSeek-V3.1 | transformers==4.48.2 | - | √ | √ | - | √ | √ (quick quantization) | - | - | - | |
| DeepSeek-V3.2-Exp | transformers==4.48.2 | - | √ (quick quantization) | √ (quick quantization) | - | - | - | - | - | - | |
| DeepSeek-V3.2 | transformers==4.48.2 | - | √ (quick quantization) | - | - | - | - | - | - | - | |
| DeepSeek-R1 | transformers==4.48.2 | - | √ | √ | - | √ | - | - | - | - | |
| DeepSeek-R1-0528 | transformers==4.48.2 | - | √ | √ (quick quantization) | - | √ | √ (quick quantization) | - | - | - | |
| DeepSeek-R1-Distill series | DeepSeek-R1-Distill-Llama-8B | - | - | √ | - | - | - | - | √ | - | - |
| DeepSeek-R1-Distill-Llama-70B | - | - | √ | - | - | - | - | - | - | - | |
| DeepSeek-R1-Distill-Qwen-1.5B | - | - | √ | - | - | - | - | √ | - | - | |
| DeepSeek-R1-Distill-Qwen-7B | - | - | √ | - | - | - | - | √ | - | - | |
| DeepSeek-R1-Distill-Qwen-14B | - | - | √ | - | - | - | - | √ | - | - | |
| DeepSeek-R1-Distill-Qwen-32B | - | - | √ | - | - | - | - | √ | - | - | |
| Qwen3 series | Qwen3-8B | transformers==4.51.0 | - | - | - | - | - | - | √ (quick quantization) | - | - |
| Qwen3-14B | transformers==4.51.0 | - | √ (quick quantization, supported only by MindIE)4 | - | - | - | - | √ (quick quantization) | - | - | |
| Qwen3-32B | transformers==4.51.0 | - | √ (quick quantization, supported only by MindIE)4 | - | - | √ (quick quantization) | - | √ (quick quantization) | √ (quick quantization) | √ | |
| Qwen3-MOE series | Qwen3-30B-A3B | transformers==4.51.0 | - | √ (quick quantization) | √ (quick quantization) | - | - | - | - | - | - |
| Qwen3-235B-A22B | transformers==4.51.0 | - | √ | √ (quick quantization) | - | - | - | - | - | - | |
| Qwen3-Coder-480B-A35B | transformers==4.51.0 | - | - | √ (quick quantization) | - | - | - | - | - | - | |
| Qwen3.5 series | Qwen3.5-397B-A17B | transformers==5.2.0 | - | √ (quick quantization) | √ (quick quantization) | - | - | - | - | - | - |
| Qwen3.5-122B-A10B | transformers==5.2.0 | - | √ (quick quantization) | - | - | - | - | - | - | - | |
| Qwen3.5-35B-A3B | transformers==5.2.0 | - | √ (quick quantization) | - | - | - | - | - | - | - | |
| Qwen3.5-27B | transformers==5.2.0 | - | √ (quick quantization) | - | - | - | - | - | - | - | |
| Qwen3-Next series | Qwen3-Next-80B-A3B-Instruct | transformers>=4.57.0 | - | √ (quick quantization, supported only by vLLM Ascend) | - | - | - | - | - | - | - |
| Qwen2.5 series | Qwen2.5-7B-Instruct | - | - | √ (quick quantization) | - | - | - | - | √ | - | - |
| Qwen2.5-14B-Instruct | - | - | √ | - | - | - | - | √ | - | - | |
| Qwen2.5-32B-Instruct | - | - | √ (quick quantization) | - | - | - | - | - | - | - | |
| Qwen2.5-72B-Instruct | - | - | - | - | √ | √ (quick quantization) | - | - | - | - | |
| Qwen2.5-Coder-7B-Instruct | - | - | - | - | - | - | - | √ (quick quantization) | - | - | |
| Qwen2 series | Qwen2-7B | - | - | √ | - | - | - | - | √ | - | - |
| Qwen2-72B | - | √ | √ | - | - | √ | - | √ | - | - | |
| Qwen series | Qwen-7B | - | - | √ | - | - | - | - | - | - | - |
| Qwen-14B | - | - | √ | - | - | - | - | - | - | - | |
| Qwen-72B | - | √ | - | - | - | - | - | - | - | - | |
| Qwen1.5-14B | - | - | √ | - | - | - | - | √ | - | - | |
| Qwen1.5-32B | - | - | √ | - | - | - | - | - | - | - | |
| Qwen1.5-72B | - | √ | - | - | - | - | - | - | - | - | |
| Qwen1.5-110B | - | √ | - | - | - | - | - | - | - | - | |
| QwQ series | QwQ-32B | - | - | √ (quick quantization) | - | - | - | - | √ (quick quantization) | - | - |
| GLM5-MOE series | GLM-5 | transformers==5.2.0 | - | √ | √ | - | - | - | - | - | - |
| GLM series | GLM-4-9B | - | - | √ | - | - | √ | - | √ | - | - |
| GLM4-MOE series | GLM-4.7 | transformers==4.57.3 | - | √ (quick quantization, supported only by vLLM Ascend) | - | - | - | - | - | - | - |
| HunYuan series | Hunyuan-A52B-Instruct | transformers>=4.48.2 | - | √ | - | - | - | - | - | - | - |
| InternLM series | InternLM2-20B | - | √ | √ | - | - | √ | - | - | - | - |
| Llama series | LLaMA-33B | - | - | - | - | - | - | - | √ | - | - |
| LLaMA-65B | - | √ | - | - | - | - | - | - | - | - | |
| LLaMA2-13B | - | - | √ | - | - | - | - | √ | - | - | |
| LLaMA2-7B | - | - | √ | - | - | - | - | √ | - | - | |
| LLaMA2-70B | - | √ | √ | - | - | - | - | - | - | - | |
| LLaMA3-70B | - | √ | - | - | - | - | - | - | - | - | |
| LLaMA3.1-8B | - | - | √ | - | - | - | - | - | - | - | |
| LLaMA3.1-70B | - | - | √ | - | - | √ | - | √ | - | - |
Notes:
- 1 The w8a16 quantization mode is supported only by MindIE.
- 2 Both KVCache and FA3 quantization are categorized under c8, as both methods quantize the K and V caches within the LLM. Only MindIE supports c8 quantization modes, such as w8a8c8 and w4a8c8.
- 3 For optimal performance, use the decompression features of the Atlas 300I Duo products after compression. Only MindIE supports sparse quantization modes, including w8a8s and w16a16s.
- 4 Only MindIE supports best practices that employ the PDMIX quantization scheme.
Supported MLLMs
| Model Type | Model Name | Dependency Library | w8a8 | w8a8c8/w8a8f8 | w8a8s (Sparse Quantization)1 |
|---|---|---|---|---|---|
| Qwen3-VL series | Qwen3-VL-4B-Instruct | transformers==4.57.1 | √ | - | - |
| Qwen3-VL-8B-Instruct | transformers==4.57.1 | - | - | √ | |
| Qwen3-VL-32B-Instruct | transformers==4.57.1 | √ | - | - | |
| Qwen3-VL-MoE series | Qwen3-VL-235B-A22B | transformers==4.57.1, flax | √ (quick quantization) | - | - |
| Qwen3-Omni series | Qwen3-Omni-30B-A3B-Thinking | transformers==4.57.3 | √ | - | - |
| Qwen3-Omni-30B-A3B-Instruct | transformers==4.57.3 | √ | - | - | |
| Qwen2.5-VL series | Qwen2.5-VL-7B | transformers==4.49.0, qwen_vl_utils | √ | - | - |
| Qwen2.5-VL-72B | transformers==4.49.0, qwen_vl_utils | √ | - | - | |
| Qwen2.5-Omni series | Qwen2.5-Omni-7B | transformers==4.57.3 | √ | - | - |
| Qwen2-VL series | Qwen2-VL-7B | transformers==4.46.0, qwen_vl_utils | √ | - | - |
| Qwen2-VL-72B | transformers==4.46.0, qwen_vl_utils | √ | - | - | |
| Qwen-VL series | Qwen-VL | transformers-stream-generator | √ | - | - |
| InternVL2 series | InternVL2-8B | transformers==4.46.0, timm, fastchat | √ | - | - |
| InternVL2-40B | transformers==4.46.0, timm, fastchat | √ | - | - | |
| LLaVA series | LLaVA-1.5-7B | transformers==4.37.2 | √ | - | - |
| GLM-4.1V series | GLM-4.1V-9B-Thinking | transformers==4.53.0 | - | - | √ |
| GLM-4.6V | GLM-4.6V | transformers==5.0.0rc0 | √ | - | - |
| Multimodal generative models | SD3-Medium | diffusers | √ | - | - |
| Open-Sora-Plan v1.2 | huggingface_hub==0.25.2 | √ | - | - | |
| FLUX.1-dev | - | √ | √ (quick quantization) | - | |
| HunyuanVideo | - | √ | √ (quick quantization) | - | |
| Wan2.1 | - | √ (quick quantization) | - | - | |
| Wan2.2 | - | - | √ (quick quantization) | - | |
| Qwen-Image-Edit | - | - | √ (quick quantization) | - |
Notes:
- 1 For optimal performance, use the decompression features of the Atlas 300I Duo series after compression. Only MindIE supports sparse quantization modes.
- 2 FLUX.1-dev, HunyuanVideo, Wan2.2, and Qwen-Image-Edit-2509 support MXFP quantization.