Quick Quantization Results

Overview

This document describes the structure of the output files generated by the quick quantization feature, which helps you understand the organization of the quantization results and the functions of each file.

Generated Files

After executing the quick quantization command, the system generates the following files in the specified save_path directory:

├── config.json                          # Original model configuration file
├── generation_config.json               # Original generation configuration file
├── quant_model_description.json         # Description file for quantized weights
├── quant_model_weights.safetensors      # Quantized model weight file (sharded and indexed by index.json if the weights are large)
├── tokenizer_config.json                # Original tokenizer configuration file
├── tokenizer.json                       # Original tokenizer vocabulary
├── {model_type}_best_practice.yaml       # Quantization configuration protocol file
└── vocab.json                           # Original vocabulary mapping file (for some models)
├── optional/                            # Optional export directory (generated when certain algorithms are enabled)
│   └── quarot.safetensors               # QuaRot global rotation matrix (generated when export_extra_info is enabled)
└── debug_info/                          # Debug information directory (generated only when the --debug option is specified)
    ├── debug_info.json                  # Debug metadata in JSON format
    └── debug_info.safetensors           # Debug tensor data in SafeTensors format

File Description

File	Description
`config.json`	Original model configuration file, which contains key architectural parameters such as the model framework, number of layers, and hidden dimensions.
`generation_config.json`	Generation configuration file of the original model, which contains inference-related parameters such as the sampling strategy and maximum generation length.
`quant_model_description.json`	Quantization weight description file, which records the exact quantization type and metadata for each weight tensor.
`quant_model_weights.safetensors`	Quantized model weight file, which contains the quantized weight data of the model. If the weights are large, they are sharded across multiple files and indexed through `model.safetensors.index.json`.
`tokenizer_config.json`	Configuration file of the original tokenizer, which stores parameters such as special tokens and the vocabulary size.
`tokenizer.json`	Vocabulary file of the original tokenizer, which defines the mapping relationship between tokens and unique IDs.
`{model_type}_best_practice.yaml`	Quantization configuration protocol file, which records the complete configuration parameters used during execution. For details, see Quantization Configuration Protocol.
`vocab.json`	Original vocabulary mapping file, which is included automatically with certain architectures, such as GPT-style models.
`optional/quarot.safetensors`	QuaRot global rotation matrix file, which is generated only when QuaRot is used with `export_extra_info` set to `True`. It stores the global rotation matrix `Q`. For details, see QuaRot - Rotation Quantization.
`debug_info/`	Debug information directory (generated only when the `--debug` option is specified), which contains the quantization context for troubleshooting and algorithm analysis. For details, see Debug Information Output.

`quant_model_description.json`

The quant_model_description.json file is a quantization weight description file, which records the quantization type and related metadata for each weight tensor in a model. It serves as a critical configuration baseline for the inference framework to load the quantized model.

File Structure Example

{
  "model_quant_type": "W8A8",
  "version": "1.0.0",
  "group_size": 128,
  "kv_quant_type": "KV8",
  "model.layers.0.self_attn.qkv_proj.weight": "W8A8",
  "model.layers.0.self_attn.o_proj.weight": "W8A8",
  "model.layers.0.mlp.gate_proj.weight": "W8A8",
  "model.layers.0.mlp.up_proj.weight": "W8A8",
  "model.layers.0.mlp.down_proj.weight": "W8A8",
  "metadata": {},
  "optional": {}
}

Note

The *.weight field name is determined by the model architecture itself.

Field Description

Global Metadata Fields

Field	Type	Description
`model_quant_type`	`string`	Quantization type of the entire model, which identifies the quantization scheme.
`version`	`string`	Version number of the quantization tool, in the format of `x.x.x`.
`group_size`	`int`	Quantization group size, which applies to group quantization.
`kv_quant_type`	`string`	KV Cache quantization type, where `KV8` indicates 8-bit KV Cache.
`kv_cache_type`	`string`	Alias for the KV Cache quantization type, which shares an identical meaning with `kv_quant_type`.
`fa_quant_type`	`string`	Flash Attention quantization type, such as `FAQuant`.
`reduce_quant_type`	`string`	Communication quantization type, such as `per_channel`.
`metadata`	`object`	Additional metadata, such as the QuaRot information.
`optional`	`object`	Optional functional module information, such as the QuaRot global rotation matrix path.

Quantization Type Enumerated Values

Enumerated Value	Description
`FLOAT`	Floating-point (unquantized)
`W16A16S`	W16A16s sparse quantization with 16-bit sparse weights
`W8A8`	W8A8 quantization with 8-bit weights and 8-bit activations
`W8A8_DYNAMIC`	W8A8 dynamic quantization with 8-bit weights and per-token dynamic activations
`W8A8_MIX`	W8A8 mixed quantization combining static and dynamic quantization
`W8A16`	W8A16 quantization with 8-bit weights and 16-bit activations
`W4A4_DYNAMIC`	W4A4 dynamic quantization with 4-bit weights and per-token dynamic activations
`WFP8AFP8_DYNAMIC`	WFP8/AFP8 dynamic quantization
`W8A8_MXFP8`	W8A8 MXFP8 quantization
`W4A8_MXFP`	W4A8 MXFP quantization
`W4A4_MXFP4`	W4A4 MXFP4 quantization
`W4A8_DYNAMIC`	W4A8 dynamic quantization with 4-bit weights and per-token dynamic activations
`C8`	KV Cache 8-bit quantization
`FAQuant`	Flash Attention quantization
`FLATQUANT_DYNAMIC`	FlatQuant dynamic quantization
`FLATQUANT`	FlatQuant static quantization

Weight Tensor Quantization Type

In addition to the global metadata fields described above, other key-value pairs in the JSON file represent the quantization types of individual weight tensors in the model. The key is the weight name, and the value is the quantization type used by the weight.

Example:

{
  "model.layers.0.self_attn.qkv_proj.weight": "W8A8"
}

This pair indicates that the model.layers.0.self_attn.qkv_proj.weight weight uses the W8A8 quantization type.

Weight Type Details

This section describes the parameter structure of different quantization modes based on the fields in quant_model_description.json. Quantization parameters vary depending on the quantization mode, and the .safetensors weight files and .json description files differ accordingly.

Quantization Mode Parameters

The following content describes the parameter structure of each quantization mode based on the AscendV1Saver implementation.

FLOAT (Unquantized)

FLOAT indicates weights that have not undergone quantization. These weights retain their original floating-point precision.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	float16/bfloat16	Original floating-point weight
`bias`	float16/bfloat16	Bias (optional)

Typical Weight Names

model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.bias

quant_model_description.json Identifier: FLOAT

W16A16S (Sparse Quantization)

W16A16S is a floating-point weight sparse quantization mode, where weights retain floating-point precision and undergo sparsification processing.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	float16/bfloat16	Sparsified weights (non-zero values)
`scale`	float16/bfloat16	Scaling factor

Typical Weight Names

model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.scale

quant_model_description.json Identifier: W16A16S

W8A8 (Static Quantization)

W8A8 is a static quantization mode, which applies int8 quantization to both weights and activations.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8	Quantized weight data
`quant_bias`	int32	Quantization bias
`input_scale`	float16/bfloat16	Activation quantization scale factor
`input_offset`	float16/bfloat16	Activation quantization offset factor
`deq_scale`	int64/float32	Combined dequantization scale factor calculated as `input_scale` × `weight_scale`
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.quant_bias
model.layers.0.self_attn.qkv_proj.input_scale
model.layers.0.self_attn.qkv_proj.input_offset
model.layers.0.self_attn.qkv_proj.deq_scale

quant_model_description.json Identifier: W8A8

W8A8_DYNAMIC (Dynamic Quantization)

W8A8_DYNAMIC is a mode with int8 weight quantization and per-token dynamic activation quantization.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8	Quantized weight data
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor, which contains all zeros for symmetric quantization
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale
model.layers.0.self_attn.qkv_proj.weight_offset

Dequantization Formula

deq_weight = (weight - weight_offset) * weight_scale

quant_model_description.json Identifier: W8A8_DYNAMIC

W8A8_MIX (Mixed Quantization)

W8A8_MIX is a mixed mode that combines static and dynamic quantization.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8	Quantized weight data
`quant_bias`	int32	Quantization bias
`input_scale`	float16/bfloat16	Activation quantization scale factor
`input_offset`	float16/bfloat16	Activation quantization offset factor
`deq_scale`	int64/float32	Combined dequantization scale factor
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.quant_bias
model.layers.0.self_attn.qkv_proj.input_scale
model.layers.0.self_attn.qkv_proj.input_offset
model.layers.0.self_attn.qkv_proj.deq_scale
model.layers.0.self_attn.qkv_proj.weight_scale
model.layers.0.self_attn.qkv_proj.weight_offset

quant_model_description.json Identifier: W8A8_MIX

W8A16 (Weight Requantization)

W8A16 is a weight quantization mode that applies int8 quantization only to weights, while activations retain floating-point precision.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8	Quantized weight data
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor, which contains all zeros for symmetric quantization
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale
model.layers.0.self_attn.qkv_proj.weight_offset

Dequantization Formula

deq_weight = (weight - weight_offset) * weight_scale

quant_model_description.json Identifier: W8A16

W4A4_DYNAMIC (W4A4 Dynamic Quantization)

W4A4_DYNAMIC is an ultra-low-bit mode featuring int4 weight quantization and per-token dynamic activation quantization.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8	Quantized weight data packed as int4 storage
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale
model.layers.0.self_attn.qkv_proj.weight_offset

Note: Activation quantization parameters are calculated dynamically during inference and are not saved to the weight file.

quant_model_description.json Identifier: W4A4_DYNAMIC

WFP8AFP8_DYNAMIC (FP8 Dynamic Quantization)

WFP8AFP8_DYNAMIC is an FP8 floating-point dynamic quantization mode.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	float8_e4m3fn	Weight in FP8 format
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale
model.layers.0.self_attn.qkv_proj.weight_offset

quant_model_description.json Identifier: WFP8AFP8_DYNAMIC

W8A8_MXFP8 (MXFP8 Quantization)

W8A8_MXFP8 is a quantization mode using Microscaling (MX) formats and FP8.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	float8_e4m3fn	Weight in FP8 format
`weight_scale`	uint8	Scale factor stored with a +127 offset
`bias`	float32	Original floating-point bias (optional)

Note: The weight_scale parameter undergoes a +127 offset adjustment to shift its value range from -127 to 128 over to 0 to 255, which exactly covers the value range of uint8.

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale

quant_model_description.json Identifier: W8A8_MXFP8

W4A8_MXFP (W4A8 MXFP Quantization)

W4A8_MXFP is an MXFP quantization mode with int4 weights and int8 activations.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	float8_e4m3fn	Weight in FP8 format
`weight_scale`	uint8	Scale factor stored with a +127 offset
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale

quant_model_description.json Identifier: W4A8_MXFP

W4A4_MXFP4 (W4A4 MXFP4 Quantization)

W4A4_MXFP4 is an ultra-low-bit MXFP4 quantization mode.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	float8_e4m3fn	Weight in FP8 format
`weight_scale`	uint8	Scale factor stored with a +127 offset
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale

quant_model_description.json Identifier: W4A4_MXFP4

C8 (KV Cache Quantization)

KV Cache 8-bit quantization is a quantization mode for the Key-Value Cache.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`kv_cache_scale`	float32/float16	Scale factor for KV cache quantization
`kv_cache_offset`	float32/float16	Offset factor for KV cache quantization

Typical Weight Names

model.layers.0.self_attn.k_proj.kv_cache_scale
model.layers.0.self_attn.k_proj.kv_cache_offset
model.layers.0.self_attn.v_proj.kv_cache_scale
model.layers.0.self_attn.v_proj.kv_cache_offset

quant_model_description.json Identifier: C8

W4A8_DYNAMIC (W4A8 Dynamic Quantization)

W4A8_DYNAMIC is a mode with int4 weight quantization and per-token dynamic activation quantization.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8	Quantized weight data packed as int4 storage
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor
`scale_bias`	float32	Scale bias factor used for additional adjustments during dequantization
`bias`	float32	Original floating-point bias (optional)

Typical Weight Names

model.layers.0.self_attn.qkv_proj.weight
model.layers.0.self_attn.qkv_proj.weight_scale
model.layers.0.self_attn.qkv_proj.weight_offset
model.layers.0.self_attn.qkv_proj.scale_bias

Dequantization Formula

deq_weight = (weight - weight_offset) * weight_scale + scale_bias

quant_model_description.json Identifier: W4A8_DYNAMIC

FlatQuant_DYNAMIC (FlatQuant Dynamic Quantization)

FlatQuant_DYNAMIC is a FlatQuant dynamic quantization mode.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`weight`	int8/int32	Quantized weight data
`weight_scale`	float32	Weight quantization scale factor
`weight_offset`	float32	Weight quantization offset factor
`input_scale`	float32	Activation quantization scale factor (dynamic)
`input_offset`	float32	Activation quantization offset factor (dynamic)
`deq_scale`	float32	Combined dequantization scale factor
`quant_bias`	int32	Quantization bias
`left_trans`	float32	Left matrix for linear transformation
`right_trans`	float32	Right matrix for linear transformation
`clip_ratio`	float32	Clipping ratio factor
`bias`	float32	Original floating-point bias (optional)

Notes:

FlatQuant is a quantization method combined with linear transformation.
The left_trans and right_trans parameters are matrices used for feature transformation.
The clip_ratio parameter controls the quantization range.

quant_model_description.json Identifier: W8A8_FLATQUANT_DYNAMIC or W4A8_FLATQUANT_DYNAMIC

NonFusionSmoothQuant (Smooth Quantization)

NonFusionSmoothQuant is a smooth quantization mode that reduces quantization error.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`div.mul_scale`	float32	Smooth scale factor
Other parameters	-	Determined by the quantization type of the internal linear layer

Typical Weight Names

model.layers.0.self_attn.q_proj.div.mul_scale
model.layers.0.self_attn.q_proj.linear.weight

quant_model_description.json Identifier: FLOAT (Inner Weights)

FAQuant (Flash Attention Quantization)

FAQuant is a Flash Attention quantization mode.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`scale`	float16/bfloat16	Quantization scale factor
`offset`	float16/bfloat16	Quantization offset factor

Typical Weight Names

model.layers.0.self_attn.q_proj.scale
model.layers.0.self_attn.q_proj.offset

quant_model_description.json Identifier: FAQuant

Notes:

A checkmark (✓) indicates that the quantization mode contains this parameter, a hyphen (-) indicates that the quantization mode does not contain this parameter, and ✓ (+127) indicates that a +127 offset adjustment is required.

The parameters of NonFusionSmoothQuant are determined by the quantization type of the internal linear layer, and this mode additionally includes the div.mul_scale parameter.

The rotation matrix parameters of QuaRot may include a subset or the entirety of the rotation matrices based on the specific implementation.

QuaRot - Rotation Quantization

Parameter Description

QuaRot is a rotation-based quantization method that maintains the functional equivalence of a quantized model.

Quantization Parameters (Stored in .safetensors)

Parameter	Data Type	Description
`heads_rotation`	float32	Rotation matrix for multi-head attention
`kronecker_rotation_m`	float32	Kronecker rotation matrix `M`
`kronecker_rotation_n`	float32	Kronecker rotation matrix `N`
`global_rotation`	float32	Global rotation matrix saved in the `optional` directory

Notes:

The heads_rotation parameter handles rotation for multi-head attention.
The kronecker_rotation_m and kronecker_rotation_n parameters are used for rotation at the MLP layer.
The global_rotation parameter is saved in the optional/quarot.safetensors file.

File Description

optional/quarot.safetensors

When using the QuaRot algorithm with export_extra_info set to True, the quantization tool generates an additional optional/ subdirectory within the save_path directory to store the global rotation matrix Q in SafeTensors format. The directory structure is as follows:

optional/
└── quarot.safetensors       # QuaRot global rotation matrix file.

Global rotation matrix Q

Key	Data Type	Description
`global_rotation`	float32	QuaRot global rotation matrix `Q`

Description Fields in `quant_model_description.json`

Enable online: Add a metadata.quarot domain field to the quant_model_description.json file:

{
  "metadata": {                                 // Additional metadata information.
    "quarot": {                                 // Additional QuaRot export domain.
      "max_tp_size": 4,                         // Maximum TP size, configured by the max_tp_size parameter in the QuaRot quantization settings.
      "heads_rotation": {                       // Rotation matrix for multi-head attention.
        "layers": [                             // Layers that use online rotation (attention o layers).
          "model.layers.0.self_attn.o_proj.",
          "model.layers.1.self_attn.o_proj.",
          "model.layers.2.self_attn.o_proj."
        ]
      },
      "kronecker_rotation": {                   // Kronecker rotation matrix.
        "layers": [                             // Layers using online rotation (down layers), specified by down_proj_online_layers in the QuaRot configurations, and described by kronecker_rotation_m and kronecker_rotation_n in the safetensors file.
          "model.layers.2.mlp.down_proj."
        ]
      }
    }
  }
}

Enable export_extra_info: Add a new optional.quarot domain field to the quant_model_description.json file:

{
  "optional": {                                           // Primary entry for optional export assets
    "quarot": {                                           // Additional QuaRot export domain
      "rotation_map": {                                   // Rotation information mapping table
        "global_rotation": "optional/quarot.safetensors"  // Global rotation matrix file path (relative path)
      }
    }
  }
}

Application Scenarios

Inference framework loading: The inference framework reads the optional.quarot.rotation_map path inside the quant_model_description.json file and loads the global rotation matrix along that path to execute online rotation calculations.
Algorithm reproduction and debugging: Load the rotation matrix directly to verify the mathematical equivalence of the QuaRot transformation.

Debug Information Output

When the --debug option is specified in the quantization command, msModelSlim automatically saves the execution context during the quantization process to the debug_info directory after quantization is complete.

Debug Information Directory Structure

debug_info/
├── debug_info.json                  # Debug metadata in JSON format
└── debug_info.safetensors           # Debug tensor data in SafeTensors format

Debug Information Files

`debug_info.json`

Stores non-tensor data and tensor metadata generated during the quantization process. The data is organized by namespace.

File Structure Example

{
  "linear_quant_namespace": {
    "layer_name": "model.layers.0.self_attn.qkv_proj",
    "quant_config": {
      "weight_dtype": "int8",
      "act_dtype": "int8"
    },
    "statistics": {
      "weight_min": -0.5,
      "weight_max": 0.5
    },
    "scale_tensor": {
      "_type": "tensor",
      "_file": "debug_info.safetensors",
      "_key": "tensor_0"
    }
  },
  "iter_smooth_namespace": {
    "smoothing_factors": {
      "_type": "tensor",
      "_file": "debug_info.safetensors",
      "_key": "tensor_1"
    }
  }
}

Field Description

Namespace (namespace): An independent namespace is created by each processor or module to isolate the debug information of different phases.
Common fields: record scalar values such as integers, floating-point numbers, strings, and Boolean values directly.
Tensor fields: store reference information for PyTorch tensors.
- _type: indicates that the field is a tensor reference with the fixed value "tensor".
- _file: indicates the name of the file where the tensor data is stored (debug_info.safetensors).
- _key: indicates the key name of the tensor in the SafeTensors file.

`debug_info.safetensors`

Stores all tensor data generated during the quantization process in SafeTensors format, including:

Quantization parameters such as scale and zero_point
Statistics such as minimum values, maximum values, and histograms
Intermediate result tensors
Smoothing factors for the outlier suppression algorithm
Other tensors used for debugging

Features

Efficient storage: The SafeTensors format supports fast loading and memory mapping.
Cross-platform compatibility: The file can be shared across different frameworks and platforms.
Security: The format is more secure than the pickle format to avoid code injection risks.

Debug Information Usage

Debug information applies to the following scenarios:

Quantization accuracy tuning: Identify layers with significant quantization error and verify whether the outlier suppression algorithm takes effect.
Algorithm research and development: Compare different quantization algorithms and develop new quantization strategies.
Issue locating and reporting: Quickly locate issues and provide detailed diagnostic information to technical support.
Model analysis and optimization: Understand activation distribution characteristics across model layers and identify quantization-sensitive layers.

Debug Information Loading Example

import json
from safetensors import safe_open

# Load JSON metadata
with open("debug_info/debug_info.json", "r") as f:
    debug_meta = json.load(f)

# Load SafeTensors tensor data
with safe_open("debug_info/debug_info.safetensors", framework="pt") as f:
    # Obtain the key names of all tensors
    tensor_keys = f.keys()
    
    # Load specific tensors
    for key in tensor_keys:
        tensor = f.get_tensor(key)
        print(f"{key}: shape={tensor.shape}, dtype={tensor.dtype}")

Precautions

Debug information may occupy large drive space, typically 10% to 50% of the model size.
Enabling debug mode slightly increases quantization time, typically by 5% to 10%.
Debug information may contain sensitive model information. Keep it secure.

For details, see the Debug Mode User Guide.

Quick Quantization Results

Overview

Generated Files

File Description

quant_model_description.json

File Structure Example

Field Description

Global Metadata Fields

Quantization Type Enumerated Values

Weight Tensor Quantization Type

Weight Type Details

Quantization Mode Parameters

FLOAT (Unquantized)

W16A16S (Sparse Quantization)

W8A8 (Static Quantization)

W8A8_DYNAMIC (Dynamic Quantization)

W8A8_MIX (Mixed Quantization)

W8A16 (Weight Requantization)

W4A4_DYNAMIC (W4A4 Dynamic Quantization)

WFP8AFP8_DYNAMIC (FP8 Dynamic Quantization)

W8A8_MXFP8 (MXFP8 Quantization)

W4A8_MXFP (W4A8 MXFP Quantization)

W4A4_MXFP4 (W4A4 MXFP4 Quantization)

C8 (KV Cache Quantization)

W4A8_DYNAMIC (W4A8 Dynamic Quantization)

FlatQuant_DYNAMIC (FlatQuant Dynamic Quantization)

NonFusionSmoothQuant (Smooth Quantization)

FAQuant (Flash Attention Quantization)

QuaRot - Rotation Quantization

Parameter Description

File Description

optional/quarot.safetensors

Description Fields in quant_model_description.json

Application Scenarios

Debug Information Output

Debug Information Directory Structure

Debug Information Files

debug_info.json

debug_info.safetensors

Debug Information Usage

`quant_model_description.json`

Description Fields in `quant_model_description.json`

`debug_info.json`

`debug_info.safetensors`