CATLASS


⚠ Important Changes

At the first community meeting in March 2026, we officially confirmed that the CATLASS community mainline will add support for the next-generation Ascend hardware Ascend 950PR/Ascend 950DT. To distinguish the underlying interface implementations on different platforms, this new support will introduce a new compilation macro. Users need to adapt the corresponding build commands accordingly.

  • New macro: CATLASS_ARCH, used to specify the target architecture. You can query its value in SIMD BuiltIn Keywords (the __NPU_ARCH__ column).

    • Atlas A2 Training Series Products / Atlas A2 Inference Series Products: 2201
    • Atlas A3 Training Series Products / Atlas A3 Inference Series Products: 2201
    • Ascend 950PR/Ascend 950DT: 3510
  • Related scenario descriptions:

    • bisheng command-line scenario: bisheng ... -DCATLASS_ARCH=2201 ...
    • cmake scenario: add_compile_definitions(CATLASS_ARCH=2201)
    • msopgen/aclnn project scenario:
      • Old usage: add_ops_compile_options(ALL OPTIONS -DCATLASS_ARCH=2201 ...)
      • New usage: npu_op_kernel_options(ascendc_kernels ALL OPTIONS -DCATLASS_ARCH=2201) (in an msopgen project, the first parameter defaults to ascendc_kernels and can be adjusted as needed)
    • CATLASS source repository: bash scripts/build.sh -DCATLASS_ARCH=2201 ...
    • Code reference in the library: examples/CMakeLists.txt

Latest News

See CHANGELOG for detailed updates in current and historical versions.


📌 Introduction

CATLASS (CANN Templates for Linear Algebra Subroutines), known in Chinese as the Ascend Operator Template Library, is a code repository focused on providing base templates for high-performance matrix multiplication operators.

CATLASS templates matrix operator code through layered abstraction. Therefore, it enables white-box assembly of operator compute logic and makes operator code reusable, replaceable, and partially modifiable. It is designed for Ascend hardware characteristics and supports complex pipeline layouts for operators such as Flash Attention. In addition, it shares upper-layer code logic while supporting specialization for differences in underlying hardware.

The template library enables fast development for custom scenarios. It provides performance optimization modules for different scenarios, so developers can assemble and customize them. Under custom shapes, its performance can reach 0.98 to 1.2 times the benchmark performance of the corresponding operator.

Matmul Performance Comparison
GroupedMatmul Performance Comparison

This repository is the co-created repository for CATLASS. It combines the strengths of the Ascend ecosystem to jointly design and develop operator templates, and provides high-performance implementation code examples for typical operators. For an overview, see here.

⚡️ Quick Start

To quickly try CATLASS operator development and usage, see the following content.

  • Quick Start: Quickly get started with the template library, and compile and run existing operator examples.

  • Basic Development Guide: Uses the basic Matmul operator as an example to introduce CATLASS-based operator development practices.

  • Developer Practices: Provides practice examples from writing code at each operator layer to compilation and testing, then to Tiling tuning and operator optimization, from beginner to advanced levels.

📚 Advanced References

The following materials can help you further develop and tune CATLASS operators and implement GEMM-class operators with better performance.

  • CATLASS API: Introduces the layered features of CATLASS and the general matrix multiplication GEMM API.

  • CATLASS Design Summary: Summarizes documents such as example algorithm design, swizzle strategies, and TLA design in the CATLASS project.

📁 Directory Structure Description

The key directories are as follows. For the detailed directory structure, see Project Directory.

catlass
├── cmake                     # cmake project files
├── docs                      # Documentation directory
├── examples                  # Root directory for kernel operator examples
|   ├── 00_basic_matmul       # Single-operator example
|   |   ├── basic_matmul.cpp  # Host-side operator invocation
|   |   ├── CMakeLists.txt
|   |   └── README.md         # Operator description example
|   ├── ...  
|   └── python_extension      # Project component for calling CATLASS operators from Python
├── include                   # Template header file set
|   ├── catlass               # Operator implementation logic at different layers
|   └── tla                   # Basic data structures related to computation
├── scripts                   # Build scripts
|   └── build.sh              # Operator example build script
├── tests                     # Test cases
└── tools                     # Related tools
    └── tuner                 # Tiling auto-tuning tool

💻 Software and Hardware Requirements

CATLASS depends on the following software and hardware environments:

The hardware platforms supported by different CATLASS releases and the required minimum CANN versions are shown in the following table:

CATLASS Community Version Minimum Supported CANN Package Version Supported Ascend Products
Current 8.5.0
9.0.0.beta2 (Ascend 950PR/Ascend 950DT)
Atlas A2 Training Series Products / Atlas A2 Inference Series Products
Atlas A3 Training Series Products / Atlas A3 Inference Series Products
Ascend 950PR/Ascend 950DT
v1.5.0 8.2.RC1
9.0.0.beta2 (Ascend 950PR/Ascend 950DT)
Atlas A2 Training Series Products / Atlas A2 Inference Series Products
Atlas A3 Training Series Products / Atlas A3 Inference Series Products
Ascend 950PR/Ascend 950DT
v1.4.0v1.2.2 8.2.RC1 Atlas A2 Training Series Products / Atlas A2 Inference Series Products
Atlas A3 Training Series Products / Atlas A3 Inference Series Products
v1.2.1v1.0.0 8.2.RC1.alpha002 Atlas A2 Training Series Products / Atlas A2 Inference Series Products
Atlas A3 Training Series Products / Atlas A3 Inference Series Products

The following environments have been tested and support building current CATLASS:

System CANN gcc cmake python
Ubuntu 20.04.5 8.5.0 9.3 3.16 3.10
Ubuntu 22.04.5 8.5.0 11.3 3.22 3.10
openEuler 22.03 SP4 8.5.0 10.3 3.22 3.10
Ubuntu 22.04.5 (Compiling 950 Examples) 9.0.0.beta2 11.3 3.22 3.10

👥 Collaborators

South China University of Technology Professor Lu Lu's Team

iFLYTEK Research Institute Engineering Group