e6bb1aa1创建于 5 天前历史提交

文件	最后提交记录	最后更新时间
auto_tuning_strategies	【docs】：英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】：英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request！在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿，需要合入开源社区，后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!437	5 天前
figures	【docs】：英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】：英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request！在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿，需要合入开源社区，后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!437	5 天前
outlier_suppression_algorithms	【docs】：英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】：英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request！在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿，需要合入开源社区，后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!437	5 天前
quantization_algorithms	【docs】：英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】：英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request！在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿，需要合入开源社区，后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!437	5 天前
README.md	【docs】：英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】：英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request！在提交之前，请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿，需要合入开源社区，后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!437	5 天前

toc_depth: 3

Quantization Algorithm Overview

msModelSlim supports multiple advanced quantization algorithms, covering all aspects from outlier suppression to low-bit optimization. The following tables summarize the currently supported core algorithms and their primary characteristics by category.

Outlier Suppression Algorithms

Outlier suppression algorithms aim to smooth the distribution of activation values and reduce the accuracy drop caused by quantization.

Algorithm	Core Idea	Application Scenario	Description
QuaRot	Uses an orthogonal rotation matrix to smooth the distribution of activation values.	Scenarios requiring activation outlier suppression to improve model accuracy	View details
Adapt Rotation	Implements calibration-data-based iterative optimization of the Hadamard rotation matrix on top of QuaRot.	Scenarios where the rotation matrix must be optimized to further improve low-bit quantization accuracy	View details
SmoothQuant	Smooths activation and weight scaling to mitigate the impact of outliers.	Scenarios requiring activation outlier suppression	View details
Iterative Smooth	Applies iterative smooth scaling for precise distribution adjustments.	Scenarios requiring accuracy optimization involving complex distributions	View details
Flex Smooth Quant	Automatically searches for the optimal alpha and beta values across a two-phase grid search.	Scenarios requiring flexible adaptation to different architectures	View details
Flex AWQ SSZ	Combines AWQ and SSZ to evaluate errors using real quantizers.	Scenarios requiring automated searches for fine-grained smoothing parameters	View details
KV Smooth	Implements a smooth suppression algorithm tailored for the KV cache.	Scenarios requiring the reduction of GPU memory usage by the KV cache	View details
AWQ	Performs a grid search for the optimal scaling factor based on the statistical characteristics of activation values.	Scenarios requiring automated searches for fine-grained smoothing parameters	View details

Quantization Algorithms

The following table summarizes the supported weight quantization, activation quantization, and structure-specific quantization solutions.

Algorithm	Type	Core Idea	Application Scenario	Description
AutoRound	Weight quantization optimization	Optimizes the rounding offset based on SignSGD to reduce reconstruction error.	Scenarios requiring ultra-low-bit quantization such as 4-bit	View details
FA3 Quant	Activation quantization	Performs per-head INT8 quantization for attention activations.	Scenarios involving long sequences or the MLA architecture	View details
GPTQ	Weight quantization optimization	Minimizes quantization error through column-wise optimization and error compensation.	Scenarios requiring high-precision weight quantization	View details
KVCache Quant	KV cache quantization	Provides a quantization solution designed for the KV cache.	Scenarios requiring improved long-sequence inference efficiency	View details
Linear Quant	Basic quantization	Performs weight quantization and activation quantization on linear layers.	Basic quantization scenarios	View details
PDMIX	Phase-wise mixed quantization	Uses dynamic quantization for prefilling and static quantization for decoding.	Scenarios requiring large model inference acceleration to balance accuracy and performance	View details
Histogram	Activation quantization	Analyzes the histogram distribution to search for the optimal clipping interval	Scenarios requiring outlier filtering to improve accuracy	View details
MinMax	Basic quantization	Determines the quantization range based on the maximum and minimum statistics.	Basic quantization scenarios requiring low computational overhead	View details
SSZ	Weight quantization	Iteratively searches for the optimal scaling factor and offset	Scenarios requiring accuracy optimization for uneven weight distributions	View details
LAOS	Low-bit quantization	Optimizes ultra-low-bit scenarios such as W4A4	Scenarios requiring extreme compression	View details
Float Sparse	Sparsification	Implements model floating-point sparsification based on the ADMM algorithm	Scenarios requiring high compression ratios	View details

Automatic Tuning Strategies

The following section summarizes the strategies that enable automated searches for the optimal quantization configuration.

Algorithm	Core Idea	Application Scenario	Description
Standing High	Combines outlier suppression strategies with a binary search approach to minimize the number of fallback layers while meeting accuracy requirements.	Scenarios requiring fine-grained control over templates and strategies alongside a complete quantization configuration	View details
Standing High With Experience	Automatically generates a quantization configuration based on expert knowledge, requiring only the quantization type and structural layout.	Scenarios where users are familiar with the model architecture and prefer not to provide a complete quantization configuration	View details

Algorithm Selection Suggestions

Beginners: Use Quick Quantization (V1), which integrates a suitable algorithm combination.
For ultimate model accuracy: Use QuaRot and AutoRound in combination.
For long-sequence inference: Enable FA3 Quant and KVCache Quant.