Release Notes

Version Mapping

msModelSlim Version	CANN Version	PyTorch Version	torch_npu Version	Python Version	Transformers Version
26.0.0.alpha02	No specific version requirement	Depends on the specific model. See the corresponding model documentation.	Depends on the specific model. See the corresponding model documentation.	Python 3.10 and 3.11	Depends on the specific model. See the corresponding case description in the example directory.
26.0.0.alpha01	No specific version requirement	Depends on the specific model. See the corresponding model documentation.	Depends on the specific model. See the corresponding model documentation.	Python 3.10 and 3.11	Depends on the specific model. See the corresponding case description in the example directory.
8.3.0	8.2.RC1 or later	Depends on the specific model. See the corresponding model documentation.	Depends on the specific model. See the corresponding model documentation.	Python 3.10 and 3.11	Depends on the specific model. See the corresponding case description in the example directory.

Version	Download Link	Checksum
26.0.0-alpha.2	msmodelslim-26.0.0a2-py3-none-any.whl	4711edb30c4354fcb99fb69a2e0351561b013bb1298d6f54a0ee409bf979a264
26.0.0-alpha.1	msmodelslim-26.0.0a1-py3-none-any.whl	60383c42bf103cf2f78304b3b974e2dac0190f0f20706a5ef347e55855048f42

For more details, see release.

Refer to the preceding table for the compatibility information of each version.

Supports custom practice directories through an entry point, laying the groundwork for the plugin-based model_adapter capability.
Improves automatic tuning.
Supports W4A8 quantization for the Qwen3-Coder-480B model and W8A8 quantization for the Qwen3.5 MoE model.
Supports W8A8 quantization for the GLM-4.7 model and W4A8 quantization for the GLM-5 model.
Supports W8A8 quantization for the Qwen2.5-Omni-7B model and the Qwen3-Omni-30B-A3B model.

Supports W8A8 quantization for Qwen3-VL-32B-Instruct.
Supports automatic tuning based on quantization-accuracy feedback and can automatically search for the optimal quantization configuration based on accuracy requirements.
Supports self-managed quantization for multimodal understanding models and supports quantization integration for those models.
Quick quantization supports multi-card quantization and distributed layer-by-layer quantization, improving the efficiency of large-model quantization.
Supports W8A8 quantization for DeepSeek-V3.2. You can run it on a single card with 64 GB of accelerator memory and 100 GB of system memory.
Supports W4A8 quantization for DeepSeek-V3.2-Exp. You can run it on a single card with 64 GB of accelerator memory and 100 GB of system memory.
Supports W8A8 quantization for Qwen3-VL-235B-A22B.