Quantization
MindIE SD provides two types of quantization capabilities, applied to different parts of the model:
- Linear Quantization: Applies low-bit processing (INT8/FP8/W8A16, etc.) to linear layer weights and activations, reducing model storage and computation overhead.
- FA Quantization: Applies FP8 block quantization to Q/K/V activations in attention computation, reducing memory bandwidth requirements.
The following two sections describe the principles and usage of each quantization type.
Linear Quantization
General Principles
Quantization is the process of mapping model weights and activations from high precision (e.g., FP32) to low precision (e.g., INT8, FP8). Low-precision computation reduces memory usage and bandwidth requirements, improving inference throughput.
Based on whether retraining is needed, quantization is divided into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). This section focuses on PTQ quantization, which falls into three main types:
- Dynamic Quantization: Only weights are quantized offline; activation quantization factors are dynamically computed during inference.
- Static Quantization: Both weights and activations are quantized offline.
- Time-Aware Quantization: Quantization strategy is dynamically adjusted based on the time dimension.
The following diagram shows an INT8 quantization example, mapping FP32 to INT8. [-max(xf), max(xf)] is the pre-quantization float range, and [-128, 127] is the post-quantization range.

Technical Features
This repository handles Linear quantization uniformly through the quantize API, supporting the following algorithms.
Weight Quantization (only weights are quantized; activations remain at original precision):
| Algorithm | Weight Precision | Description |
|---|---|---|
| W8A16 | INT8 | Basic weight quantization |
| W4A16 | INT4 | Higher compression ratio |
| W4A16_AWQ | INT4 + AWQ | Activation-aware weight quantization |
| W8A16_GPTQ | INT8 + GPTQ | GPTQ post-training weight quantization |
| W4A16_GPTQ | INT4 + GPTQ | Same as above, INT4 version |
Weight-Activation Quantization (both weights and activations are quantized; computation done in low precision):
| Algorithm | Quantization Granularity | Description |
|---|---|---|
| W8A8 | Per-layer | Basic INT8 weight-activation quantization |
| W8A8_TIMESTEP | Per-layer + timestep | Dynamically switch quantization strategy during inference |
| W8A8_DYNAMIC | Per-layer | Dynamic activation quantization |
| W8A8_PER_CHANNEL | Per-channel | Channel-granularity quantization |
| W8A8_PER_TENSOR | Per-tensor | Tensor-granularity quantization |
| W8A8_MXFP8 | Per-layer | MXFP8 format quantization |
| W4A4_DYNAMIC | Per-token + per-channel | INT4 weight-activation quantization |
| W4A4_MXFP4_SVD | Per-layer | MXFP4 format quantization |
| W4A4_MXFP4_DUALSCALE | Per-layer | MXFP4 dual-scale quantization |
| W4A4_MXFP4_DYNAMIC | Per-token + per-channel | MXFP4 dynamic quantization |
API and Usage
All Linear quantization algorithms are triggered uniformly through the quantize API.
from mindiesd import quantize
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model |
nn.Module |
Yes | - | Initialized floating-point model |
quant_json_path |
str |
Yes | - | Path to quantization descriptor JSON containing algorithm, layer configuration, etc. |
Usage Examples
Basic quantization:
model = from_pretrain()
model = quantize(model, "quant_model_description_w8a16_0.json")
model.to("npu")
Timestep quantization:
from mindiesd import TimestepManager
model = quantize(model, "quant_model_description_w8a8_timestep_0.json",
timestep_policy=TimestepPolicyConfig(...))
for i, t in enumerate(timesteps):
TimestepManager.set_timestep_idx(i)
...
Quantized Weight File Naming
Quantized weights and descriptor files are exported by the msmodelslim tool with the following naming convention:
- Weight file:
quant_model_weight_{quant_algo.lower()}_{rank}.safetensors - Descriptor file:
quant_model_description_{quant_algo.lower()}_{rank}.json
For single-card quantization, rank is 0. For multi-card parallelism, each rank corresponds to its respective number.
FA Quantization
General Principles
FA (Flash Attention) quantization applies low-bit processing to Q/K/V activations in attention computation. Quantizing Q/K/V to FP8 before feeding into attention computation kernels significantly reduces memory bandwidth requirements and improves inference throughput. Unlike weight quantization, FA quantization processes dynamically generated activations during inference and requires block-level dynamic quantization strategies to balance accuracy and acceleration.
Technical Features
This repository provides FA quantization via the FP8_DYNAMIC algorithm, with a three-step processing flow:
Rotate
Apply pre-trained rotation matrices (q_rot, k_rot) to Q and K, dispersing outliers across dimensions to mitigate FP8 quantization sensitivity to outliers.
Block Quantization
Dynamically quantize rotated Q/K/V into FP8 (float8_e4m3fn) block by block. Q uses a block size of 128, K/V use a block size of 256, performed via the npu_dynamic_block_quant operator.
FP8 Attention
Invoke the Ascend npu_fused_infer_attention_score_v2 kernel to perform attention computation in the FP8 domain, with outputs dequantized back to original precision.
API Reference
FA quantization is triggered uniformly through the quantize API; no separate FA quantization API call is needed.
from mindiesd import quantize
Usage Example
from mindiesd import quantize
# Load original floating-point model
model = from_pretrain()
# Execute quantization conversion (automatically identifies Attention layers and injects FA quantization)
model = quantize(model, "path/to/exported/quantization/config")
# Move model to NPU and run inference
model.to("npu")
quantize internally traverses model layers, automatically calling add_fa_quant on matching Attention layers, injecting FP8RotateQuantFA modules, and replacing the forward computation with the rotate -> block quantize -> FP8 Attention flow.
FA quantization layers are implemented through the FP8RotateQuantFA module. See the rotate -> block quantize -> FP8 Attention flow description in this section.
Notes
- Hardware requirement: Only Atlas 800I A2 inference servers support this feature.
- Q/K/V input layout supports both
BNSDandBSND. - FA quantization weights (
q_rot,k_rot) must be pre-exported using the msmodelslim model compression tool. See the msmodelslim tool documentation for details.