toc_depth: 3
Quantization Algorithm Overview
msModelSlim supports multiple advanced quantization algorithms, covering all aspects from outlier suppression to low-bit optimization. The following tables summarize the currently supported core algorithms and their primary characteristics by category.
Outlier Suppression Algorithms
Outlier suppression algorithms aim to smooth the distribution of activation values and reduce the accuracy drop caused by quantization.
| Algorithm | Core Idea | Application Scenario | Description |
|---|---|---|---|
| QuaRot | Uses an orthogonal rotation matrix to smooth the distribution of activation values. | Scenarios requiring activation outlier suppression to improve model accuracy | View details |
| Adapt Rotation | Implements calibration-data-based iterative optimization of the Hadamard rotation matrix on top of QuaRot. | Scenarios where the rotation matrix must be optimized to further improve low-bit quantization accuracy | View details |
| SmoothQuant | Smooths activation and weight scaling to mitigate the impact of outliers. | Scenarios requiring activation outlier suppression | View details |
| Iterative Smooth | Applies iterative smooth scaling for precise distribution adjustments. | Scenarios requiring accuracy optimization involving complex distributions | View details |
| Flex Smooth Quant | Automatically searches for the optimal alpha and beta values across a two-phase grid search. | Scenarios requiring flexible adaptation to different architectures | View details |
| Flex AWQ SSZ | Combines AWQ and SSZ to evaluate errors using real quantizers. | Scenarios requiring automated searches for fine-grained smoothing parameters | View details |
| KV Smooth | Implements a smooth suppression algorithm tailored for the KV cache. | Scenarios requiring the reduction of GPU memory usage by the KV cache | View details |
| AWQ | Performs a grid search for the optimal scaling factor based on the statistical characteristics of activation values. | Scenarios requiring automated searches for fine-grained smoothing parameters | View details |
Quantization Algorithms
The following table summarizes the supported weight quantization, activation quantization, and structure-specific quantization solutions.
| Algorithm | Type | Core Idea | Application Scenario | Description |
|---|---|---|---|---|
| AutoRound | Weight quantization optimization | Optimizes the rounding offset based on SignSGD to reduce reconstruction error. | Scenarios requiring ultra-low-bit quantization such as 4-bit | View details |
| FA3 Quant | Activation quantization | Performs per-head INT8 quantization for attention activations. | Scenarios involving long sequences or the MLA architecture | View details |
| GPTQ | Weight quantization optimization | Minimizes quantization error through column-wise optimization and error compensation. | Scenarios requiring high-precision weight quantization | View details |
| KVCache Quant | KV cache quantization | Provides a quantization solution designed for the KV cache. | Scenarios requiring improved long-sequence inference efficiency | View details |
| Linear Quant | Basic quantization | Performs weight quantization and activation quantization on linear layers. | Basic quantization scenarios | View details |
| PDMIX | Phase-wise mixed quantization | Uses dynamic quantization for prefilling and static quantization for decoding. | Scenarios requiring large model inference acceleration to balance accuracy and performance | View details |
| Histogram | Activation quantization | Analyzes the histogram distribution to search for the optimal clipping interval | Scenarios requiring outlier filtering to improve accuracy | View details |
| MinMax | Basic quantization | Determines the quantization range based on the maximum and minimum statistics. | Basic quantization scenarios requiring low computational overhead | View details |
| SSZ | Weight quantization | Iteratively searches for the optimal scaling factor and offset | Scenarios requiring accuracy optimization for uneven weight distributions | View details |
| LAOS | Low-bit quantization | Optimizes ultra-low-bit scenarios such as W4A4 | Scenarios requiring extreme compression | View details |
| Float Sparse | Sparsification | Implements model floating-point sparsification based on the ADMM algorithm | Scenarios requiring high compression ratios | View details |
Automatic Tuning Strategies
The following section summarizes the strategies that enable automated searches for the optimal quantization configuration.
| Algorithm | Core Idea | Application Scenario | Description |
|---|---|---|---|
| Standing High | Combines outlier suppression strategies with a binary search approach to minimize the number of fallback layers while meeting accuracy requirements. | Scenarios requiring fine-grained control over templates and strategies alongside a complete quantization configuration | View details |
| Standing High With Experience | Automatically generates a quantization configuration based on expert knowledge, requiring only the quantization type and structural layout. | Scenarios where users are familiar with the model architecture and prefer not to provide a complete quantization configuration | View details |
Algorithm Selection Suggestions
- Beginners: Use Quick Quantization (V1), which integrates a suitable algorithm combination.
- For ultimate model accuracy: Use QuaRot and AutoRound in combination.
- For long-sequence inference: Enable FA3 Quant and KVCache Quant.