ascend-robot[Docs]Sync en/ docs to zh/, update README and README.en.md

Quantization

MindIE SD provides two types of quantization capabilities, applied to different parts of the model:

Linear Quantization: Applies low-bit processing (INT8/FP8/W8A16, etc.) to linear layer weights and activations, reducing model storage and computation overhead.
FA Quantization: Applies FP8 block quantization to Q/K/V activations in attention computation, reducing memory bandwidth requirements.

The following two sections describe the principles and usage of each quantization type.

Linear Quantization

General Principles

Quantization is the process of mapping model weights and activations from high precision (e.g., FP32) to low precision (e.g., INT8, FP8). Low-precision computation reduces memory usage and bandwidth requirements, improving inference throughput.

Based on whether retraining is needed, quantization is divided into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). This section focuses on PTQ quantization, which falls into three main types:

Dynamic Quantization: Only weights are quantized offline; activation quantization factors are dynamically computed during inference.
Static Quantization: Both weights and activations are quantized offline.
Time-Aware Quantization: Quantization strategy is dynamically adjusted based on the time dimension.

The following diagram shows an INT8 quantization example, mapping FP32 to INT8. [-max(xf), max(xf)] is the pre-quantization float range, and [-128, 127] is the post-quantization range.

Technical Features

This repository handles Linear quantization uniformly through the quantize API, supporting the following algorithms.

Weight Quantization (only weights are quantized; activations remain at original precision):

Algorithm	Weight Precision	Description
W8A16	INT8	Basic weight quantization
W4A16	INT4	Higher compression ratio
W4A16_AWQ	INT4 + AWQ	Activation-aware weight quantization
W8A16_GPTQ	INT8 + GPTQ	GPTQ post-training weight quantization
W4A16_GPTQ	INT4 + GPTQ	Same as above, INT4 version

Weight-Activation Quantization (both weights and activations are quantized; computation done in low precision):

Algorithm	Quantization Granularity	Description
W8A8	Per-layer	Basic INT8 weight-activation quantization
W8A8_TIMESTEP	Per-layer + timestep	Dynamically switch quantization strategy during inference
W8A8_DYNAMIC	Per-layer	Dynamic activation quantization
W8A8_PER_CHANNEL	Per-channel	Channel-granularity quantization
W8A8_PER_TENSOR	Per-tensor	Tensor-granularity quantization
W8A8_MXFP8	Per-layer	MXFP8 format quantization
W4A4_DYNAMIC	Per-token + per-channel	INT4 weight-activation quantization
W4A4_MXFP4_SVD	Per-layer	MXFP4 format quantization
W4A4_MXFP4_DUALSCALE	Per-layer	MXFP4 dual-scale quantization
W4A4_MXFP4_DYNAMIC	Per-token + per-channel	MXFP4 dynamic quantization

API and Usage

All Linear quantization algorithms are triggered uniformly through the quantize API.

from mindiesd import quantize

Parameters

Parameter	Type	Required	Default	Description
`model`	`nn.Module`	Yes	-	Initialized floating-point model
`quant_json_path`	`str`	Yes	-	Path to quantization descriptor JSON containing algorithm, layer configuration, etc.

Usage Examples

Basic quantization:

model = from_pretrain()
model = quantize(model, "quant_model_description_w8a16_0.json")
model.to("npu")

Timestep quantization:

from mindiesd import TimestepManager

model = quantize(model, "quant_model_description_w8a8_timestep_0.json",
                 timestep_policy=TimestepPolicyConfig(...))

for i, t in enumerate(timesteps):
    TimestepManager.set_timestep_idx(i)
    ...

Quantized Weight File Naming

Quantized weights and descriptor files are exported by the msmodelslim tool with the following naming convention:

Weight file: quant_model_weight_{quant_algo.lower()}_{rank}.safetensors
Descriptor file: quant_model_description_{quant_algo.lower()}_{rank}.json

For single-card quantization, rank is 0. For multi-card parallelism, each rank corresponds to its respective number.

FA Quantization

General Principles

FA (Flash Attention) quantization applies low-bit processing to Q/K/V activations in attention computation. Quantizing Q/K/V to FP8 before feeding into attention computation kernels significantly reduces memory bandwidth requirements and improves inference throughput. Unlike weight quantization, FA quantization processes dynamically generated activations during inference and requires block-level dynamic quantization strategies to balance accuracy and acceleration.

Technical Features

This repository provides FA quantization via the FP8_DYNAMIC algorithm, with a three-step processing flow:

Rotate

Apply pre-trained rotation matrices (q_rot, k_rot) to Q and K, dispersing outliers across dimensions to mitigate FP8 quantization sensitivity to outliers.

Block Quantization

Dynamically quantize rotated Q/K/V into FP8 (float8_e4m3fn) block by block. Q uses a block size of 128, K/V use a block size of 256, performed via the npu_dynamic_block_quant operator.

FP8 Attention

Invoke the Ascend npu_fused_infer_attention_score_v2 kernel to perform attention computation in the FP8 domain, with outputs dequantized back to original precision.

API Reference

FA quantization is triggered uniformly through the quantize API; no separate FA quantization API call is needed.

from mindiesd import quantize

Usage Example

from mindiesd import quantize

# Load original floating-point model
model = from_pretrain()

# Execute quantization conversion (automatically identifies Attention layers and injects FA quantization)
model = quantize(model, "path/to/exported/quantization/config")

# Move model to NPU and run inference
model.to("npu")

quantize internally traverses model layers, automatically calling add_fa_quant on matching Attention layers, injecting FP8RotateQuantFA modules, and replacing the forward computation with the rotate -> block quantize -> FP8 Attention flow.

FA quantization layers are implemented through the FP8RotateQuantFA module. See the rotate -> block quantize -> FP8 Attention flow description in this section.

Notes

Hardware requirement: Only Atlas 800I A2 inference servers support this feature.
Q/K/V input layout supports both BNSD and BSND.
FA quantization weights (q_rot, k_rot) must be pre-exported using the msmodelslim model compression tool. See the msmodelslim tool documentation for details.