cann-robotdocs: add English translation for all markdown documentation files

f23c2cc4创建于 21 天前历史提交

AMCT

Ascend Model Compression Toolkit

Ascend NPU Native Model Compression Toolkit

Quick Start · Features · Samples · FAQ · Contribution

🔥 Latest Updates

[2026/05/28] Added current mainstream LLM network quantization, PTQ algorithm support, and provided DeepSeek-V4 and Qwen3.6-MoE one-stop samples
[2026/04/24] Added DeepSeek-V4 model INT8 quantization support
[2026/04/17] Added HiFloat8 quantile quantization (Quantile) algorithm
[2026/03/02] Added HiFloat8 data direct conversion (Cast) algorithm
[2026/02/02] Added HiFloat8 / MXFP8 / MXFP4 data quantization
[2025/12/22] AMCT project first launched 🎉

🚀 Overview

AMCT is an Ascend NPU native model quantization compression tool. After quantization, the model size decreases, enabling low-bit operations on Ascend NPU, significantly improving inference performance. The deployment architecture is as follows:

Highlights:

🎯 Hardware Affinity —— Quantization results directly interface with Ascend NPU low-bit computation units
🔢 Multi-Precision Full Stack —— INT8 / INT4 / MXFP8 / MXFP4 / HiFloat8 available
🚀 Large Model Ready —— Native support for frontier models such as DeepSeek-V3.2 / V4

✨ Core Features

Feature Category	Introduction
PTQ Quantization Algorithms	Min-Max / AWQ / GPTQ / SmoothQuant and other post-training quantization algorithms, see Algorithm Introduction
HiFloat8 Quantization	Huawei self-developed 8-bit floating-point format, tapered precision + large dynamic range, see HiFloat8 Introduction
NPU Custom Operators	Self-developed operators based on NPU, Ascend C kernel implementation, see amct_ops
Large Model Quantization	DeepSeek-V3.2 / V4 quantization solutions, see DeepSeek-V4

📊 Performance Benefits

Quantization significantly reduces deployment costs:

Precision Format	Weight Only (W)	Full Quantization (W+A)	Benefits
INT8	✅ Min-Max / AWQ / GPTQ	✅ Min-Max / SmoothQuant	Size ↓50% · Throughput ↑
INT4	✅ AWQ / GPTQ	✅ FlatQuant	Size ↓75% · Low bandwidth friendly
HiFloat8	✅ Cast / Quantile / OFMR	✅ Cast / Quantile / OFMR	Size ↓50% · Large dynamic range
MXFP8	✅ MXQuant	✅ MXQuant	Size ↓50% · High precision
MXFP4	✅ MXQuant	✅ MXQuant	Size ↓75% · Micro-scaling floating-point

📦 Quick Start

Environment Requirements

Dependency	Version
Python	>=3.9
PyTorch	2.7.1 or 2.1.0 (requires matching `torch_npu`)
GCC / CMake / patch	≥ 7.3 / ≥ 3.16 (recommended 3.20) / ≥ 2.7
CANN (Toolkit & Ops)	≥ 8.5.0 (requires pre-installed NPU driver / firmware)

For complete environment deployment, please refer to Quick Installation.

Installation & Verification

# 1. Clone source code and install dependencies
git clone https://gitcode.com/cann/amct.git

# 2. Source code build packaging
cd amct
bash build.sh --torch

# 3. Install (artifact located in build_out/)
#    ${version} obtained from file name in build_out/ directory, such as amct_pytorch-1.1.0-py3-none-linux_aarch64.tar.gz
#    ${arch}    is CPU architecture, such as x86_64, aarch64
pip3 install build_out/amct_pytorch-${version}-py3-none-linux_${arch}.tar.gz --user

⚠️ Note: If pip version > 25.2, installation command needs to add --no-build-isolation, otherwise may encounter ModuleNotFoundError: No module named 'torch'.

# Verify AMCT installation
python3 -c "import amct_pytorch as amct; print(f'successfully installed AMCT ')"

For more build options and local verification, please refer to Build Guide.

🏃 One-Stop Platform Quick Experience

The "One-Stop Platform" is an NPU environment provided for developers, internally integrated with complete CANN environment, can be used directly. AMCT provides simplified "Quick Start" paths in corresponding sample READMEs for this platform, helping users complete NPU inference experience with minimal steps. Currently supported models are continuously expanding, please stay tuned:

Practice	Introduction
Qwen3.6-MoE	Complete Qwen3.6-MoE model quantization, data extraction, and PTQ in Atlas A3 environment, providing standard launch process and related configurations for one-stop platform scenarios, helping users quickly get started to complete an end-to-end NPU inference experience.
DeepSeek-V4	Complete DeepSeek-V4 Flash model single-card inference in Atlas A3 environment, providing standard launch process and related configurations for one-stop platform scenarios, helping users quickly get started to complete an end-to-end NPU inference experience.

📖 Documentation Samples

Topic	Content
Compression Concepts	Quantization, sparsity, distillation and other basic concepts
LLM Quantization	Quantization features for large language models (LLM)
Compression Features	Basic compression features supported by AMCT
API Documentation	Interface usage instructions
Algorithm Introduction	AWQ, GPTQ, SmoothQuant and other algorithm principles

🔍 Directory Structure

amct/
├── amct_pytorch/                  # PyTorch quantization compression core source code
│   ├── algorithms/                 # Quantization algorithm implementation
│   ├── cli/                        # Command-line entry
│   ├── common/                     # Common utilities, models, and data processing
│   ├── configs/                    # Quantization configuration templates
│   ├── experimental/               # Experimental features (HiFloat8, DeepSeek, etc.)
│   ├── quantization/               # Quantization data types and basic modules
│   └── workflows/                  # LLM quantization, evaluation, and deployment workflows
├── amct_ops/                      # AMCT custom NPU operators
├── examples/                      # End-to-end samples and usage examples
├── tests/                         # Unit tests
├── docs/                          # Tool documentation (concepts, APIs, algorithms, etc.)
├── cmake/                         # CMake build configuration
├── build.sh                       # Project compilation script
├── setup.py                       # Python package packaging entry
└── requirements.txt               # Python third-party dependencies

❓ FAQ

Algorithm Selection: When to use AWQ / GPTQ / SmoothQuant?

Algorithm	Applicable Scenario	Core Idea
AWQ	Large model PTQ, pursuing low quantization error	Activation-aware weight quantization, protecting ~1% significant weights
GPTQ	Large model PTQ, emphasizing layer-by-layer optimization	Hessian matrix-based weight fine-tuning, minimizing quantization error
SmoothQuant	Activation distribution difficult scenarios	Migrate activation quantization difficulty to weights, smooth activation outliers
Min-Max	Entry-level scenarios, simple and fast	Directly take max and min values to calculate quantization factors

Recommendation: For large model weight quantization, prefer AWQ or GPTQ; for W8A8 full quantization scenarios, recommend SmoothQuant; for entry-level learning, recommend Min-Max.

How to handle accuracy drop after quantization?

Handling Path (by priority):

Adjust calibration data amount: Increase batch_num (recommend batch_num × batch_size = 16 or 32)
Fallback sensitive layers: Identify quantization-sensitive layers (first layer, last layer, layers with few parameters), set quant_enable: false in configuration
Adjust quantization algorithm: Analyze model data distribution characteristics, use appropriate quantization algorithm
Try Quantization-Aware Training (QAT): If PTQ cannot meet accuracy, use QAT for retraining

"ModuleNotFoundError: No module named 'torch'" during installation?

Reason: pip version > 25.2, build isolation causes torch not recognized.

Solution:

# Solution 1: Lower pip version
pip install pip==25.2

# Solution 2: Add --no-build-isolation
pip3 install amct_pytorch_${version}-linux-${arch}.tar.gz --user --no-build-isolation

For more questions, please check Compression Features Documentation or ask in Issue.

💬 Community Discussion

Welcome to join the AMCT community and participate in discussion and exchange:

Platform	Usage
GitCode Issue	Problem feedback, feature suggestions, technical discussion
GitCode Discussions	Experience sharing, best practices, community interaction
SIG Discussions	Technical decisions, problem handling, project landing

🤝 Participate in Contribution

Welcome to contribute code, algorithms, and documentation, see Contributing Guide:

Simple bug fix: Submit PR directly
New features / interface changes: First discuss solution in Issue, reach consensus then submit PR
Code style: C/C++ follows Google standards (based on .clang-format), Python follows PEP8; enable pre-commit before submission

🙏 Acknowledgments

Thanks to all developers who contributed to AMCT!

This project is inspired by the following open source projects:

AWQ - Activation-aware weight quantization
GPTQ - GPTQ implementation reference
SmoothQuant - Smooth activation quantization
FlatQuant - Matrix flat quantization

📝 License

This project is open source under Apache 2.0 agreement. Please read Security Statement and Disclaimer before use.