文件最后提交记录最后更新时间
【docs】:英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】:英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request! 在提交之前,请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿,需要合入开源社区,后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!4375 天前
【docs】:英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】:英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request! 在提交之前,请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿,需要合入开源社区,后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!4375 天前
【docs】:英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】:英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request! 在提交之前,请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿,需要合入开源社区,后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!4375 天前
【docs】:英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】:英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request! 在提交之前,请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿,需要合入开源社区,后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!4375 天前
【docs】:英文开源资料合入 Co-authored-by: zzm30<zhengzhimin1@h-partners.com> # message auto-generated for no-merge-commit merge: !437 merge master into master 【docs】:英文开源资料合入 Created-by: zzm30 Commit-by: zzm30 Merged-by: ascend-robot Description: 感谢您贡献的Pull Request! 在提交之前,请务必阅读 [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md)。 Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ [CONTRIBUTING.md](https://gitcode.com/Ascend/msmodelslim/blob/master/CONTRIBUTING.md). ## PR描述 (What this PR does / why we need it?) 1. 英文开源资料翻译回稿,需要合入开源社区,后进行链接和内容的校准。 See merge request: Ascend/msmodelslim!4375 天前
README.md

toc_depth: 3

Quantization Algorithm Overview

msModelSlim supports multiple advanced quantization algorithms, covering all aspects from outlier suppression to low-bit optimization. The following tables summarize the currently supported core algorithms and their primary characteristics by category.

Outlier Suppression Algorithms

Outlier suppression algorithms aim to smooth the distribution of activation values and reduce the accuracy drop caused by quantization.

Algorithm Core Idea Application Scenario Description
QuaRot Uses an orthogonal rotation matrix to smooth the distribution of activation values. Scenarios requiring activation outlier suppression to improve model accuracy View details
Adapt Rotation Implements calibration-data-based iterative optimization of the Hadamard rotation matrix on top of QuaRot. Scenarios where the rotation matrix must be optimized to further improve low-bit quantization accuracy View details
SmoothQuant Smooths activation and weight scaling to mitigate the impact of outliers. Scenarios requiring activation outlier suppression View details
Iterative Smooth Applies iterative smooth scaling for precise distribution adjustments. Scenarios requiring accuracy optimization involving complex distributions View details
Flex Smooth Quant Automatically searches for the optimal alpha and beta values across a two-phase grid search. Scenarios requiring flexible adaptation to different architectures View details
Flex AWQ SSZ Combines AWQ and SSZ to evaluate errors using real quantizers. Scenarios requiring automated searches for fine-grained smoothing parameters View details
KV Smooth Implements a smooth suppression algorithm tailored for the KV cache. Scenarios requiring the reduction of GPU memory usage by the KV cache View details
AWQ Performs a grid search for the optimal scaling factor based on the statistical characteristics of activation values. Scenarios requiring automated searches for fine-grained smoothing parameters View details

Quantization Algorithms

The following table summarizes the supported weight quantization, activation quantization, and structure-specific quantization solutions.

Algorithm Type Core Idea Application Scenario Description
AutoRound Weight quantization optimization Optimizes the rounding offset based on SignSGD to reduce reconstruction error. Scenarios requiring ultra-low-bit quantization such as 4-bit View details
FA3 Quant Activation quantization Performs per-head INT8 quantization for attention activations. Scenarios involving long sequences or the MLA architecture View details
GPTQ Weight quantization optimization Minimizes quantization error through column-wise optimization and error compensation. Scenarios requiring high-precision weight quantization View details
KVCache Quant KV cache quantization Provides a quantization solution designed for the KV cache. Scenarios requiring improved long-sequence inference efficiency View details
Linear Quant Basic quantization Performs weight quantization and activation quantization on linear layers. Basic quantization scenarios View details
PDMIX Phase-wise mixed quantization Uses dynamic quantization for prefilling and static quantization for decoding. Scenarios requiring large model inference acceleration to balance accuracy and performance View details
Histogram Activation quantization Analyzes the histogram distribution to search for the optimal clipping interval Scenarios requiring outlier filtering to improve accuracy View details
MinMax Basic quantization Determines the quantization range based on the maximum and minimum statistics. Basic quantization scenarios requiring low computational overhead View details
SSZ Weight quantization Iteratively searches for the optimal scaling factor and offset Scenarios requiring accuracy optimization for uneven weight distributions View details
LAOS Low-bit quantization Optimizes ultra-low-bit scenarios such as W4A4 Scenarios requiring extreme compression View details
Float Sparse Sparsification Implements model floating-point sparsification based on the ADMM algorithm Scenarios requiring high compression ratios View details

Automatic Tuning Strategies

The following section summarizes the strategies that enable automated searches for the optimal quantization configuration.

Algorithm Core Idea Application Scenario Description
Standing High Combines outlier suppression strategies with a binary search approach to minimize the number of fallback layers while meeting accuracy requirements. Scenarios requiring fine-grained control over templates and strategies alongside a complete quantization configuration View details
Standing High With Experience Automatically generates a quantization configuration based on expert knowledge, requiring only the quantization type and structural layout. Scenarios where users are familiar with the model architecture and prefer not to provide a complete quantization configuration View details

Algorithm Selection Suggestions

  • Beginners: Use Quick Quantization (V1), which integrates a suitable algorithm combination.
  • For ultimate model accuracy: Use QuaRot and AutoRound in combination.
  • For long-sequence inference: Enable FA3 Quant and KVCache Quant.