Triton-Ascend Debugging Guide

1 Overview

This document is the Triton-Ascend Debugging Guide, which is intended for engineers who participate in adapting Triton to Ascend NPU. It systematically describes the common debugging methods and tools used during Triton-Ascend compilation and running.

The contents of this document are as follows:

Section	Description
1. Overview	Describes the core objectives of debugging (focusing on the `ttir.mlir` → `ttadapter.mlir` conversion) and provides guidance on common issues.
2. Compilation Process Overview	Describes the key phases of the Triton-Ascend end-to-end compilation chain, providing a context basis for subsequent debugging.
3. Temporary File Guide	Describes the storage locations and functions of intermediate files (such as the `.mlir`, `.ll`, and`.o` files) generated during the compilation, facilitating manual check.
4. Interpreter Mode	Describes how to set `TRITON_INTERPRET` to `1` to run the kernel on the CPU and use the result as the accuracy benchmark of the NPU computing result.
5. Debugging Methods	The following practical debugging methods are provided: • Static/Runtime printing • Compilation error debugging
Appendix A	Provides a quick reference table of common environment variables to improve debugging efficiency.

You are advised to refer to the corresponding sections as required to efficiently locate and resolve various exceptions in Triton-Ascend integration.

1.1 Triton-Ascend Common Issue Classification and Debugging Guide

During development, issues can be classified into different types. The following table provides guidance for quickly identifying issue types and preferred debugging methods.

Issue Type	Typical Symptom/Description	Preferred Debugging Method
Accuracy issue	The NPU running result is different from the benchmark reference result (such as the PyTorch or Triton CPU interpreter).	4. Interpreter mode 5.1 Debugging by printing
Compilation error (MLIRCompileError)	If the compilation fails in the conversion phase, `MLIRCompileError` is thrown on the Python side.	5.2 Compilation error debugging

2 Triton-Ascend Compilation Process Overview

Understanding the complete compilation chain is the basis for effective debugging. The compilation process of Triton-Ascend consists of the following phases:

Phase	Input	Output	Tool/Component	Description
Python Kernel compilation	`triton_kernel.py` (Python)	`ttir.mlir` (MLIR)	Triton JIT compiler	Compiles the Triton Python kernel written by users into the standard Triton IR (TTIR).
Triton IR adaptation and transformation	`ttir.mlir`	`ttadapter.mlir`	Ascend-adapted Triton backend	Key debugging phase. Converts TTIR into the adapter IR for the Ascend NPU backend.
MLIR compilation and code generation	`ttadapter.mlir`	`.o` (executable object file)	BiSheng compiler (`bishengir-compile`)	The adapter IR is further compiled and optimized to generate binary code that can be executed on the NPU.

# Triton-Ascend compilation process
[Python Kernel]
     ↓ (triton.compile)
[ttir.mlir]
     ↓        │ (TRITON_DEBUG=1 → ~/.triton/dump/)
[ttadapter.mlir]
     ↓ (bishengir-compile)
[NPU executable file.o]

This guide focuses on the second phase, that is, the ttir.mlir → ttadapter.mlir conversion. This phase is the main function of Triton-Ascend.

3 Triton-Ascend Temporary File Guide

During the compilation of Triton-Ascend, the system generates multiple temporary files for caching and debugging. Understanding the location and usage of these files is critical for efficient debugging.

3.1 Cache

Triton uses the cache mechanism to accelerate the repeated compilation process. Intermediate files generated during compilation are cached in the user directory to avoid repeated compilation of the same kernel.

Cache directory structure:

Default path: ~/.triton/cache/

Main cache content:

Input file cache: ttir.mlir file generated by the original Triton kernel
Output file cache: ttadapter.mlir file converted to adapt to Ascend
Compilation product cache: executable file generated after compilation

Naming conventions of cache files: Cache files are usually named using MD5 hash values to ensure that the same kernel code corresponds to the same cache file.

Recommendations for cache management:

Periodic clearing: Cache files may occupy a large amount of disk space. You can periodically clear the cache files.

rm -rf ~/.triton/cache

Disabling cache during debugging: You are advised to temporarily disable the cache to ensure that the compilation is performed each time when debugging compilation issues.

export TRITON_DISABLE_CACHE=1

Cache verification: If you suspect that the issue is caused by the cache, delete related cache files and perform the test again.

3.2 Dump Files

You can set the environment variable TRITON_DEBUG to 1 to dump intermediate representation files to disks during compilation. These files are key resources for debugging compilation issues.

Dump directory structure:

Default path: ~/.triton/dump/

Directory naming: A subdirectory named by a timestamp or unique ID is generated for each compilation session.

Main dump files:

kernel.ttir.mlir: Triton IR file (compilation input)
kernel.ttadapter.mlir: adapter IR file (conversion output)

Enabling debug dump: Even if the cache is enabled, the system still generates dump files (overriding files in the directory with the same name) each time the system runs as long as TRITON_DEBUG=1 is set. However, if the cache is hit and compilation is skipped, IR conversion may not be triggered. As a result, no new dump file is generated. Therefore, during debugging, you are advised to set as follows:

# Set environment variables before running the Triton program.
export TRITON_DEBUG=1
export TRITON_DISABLE_CACHE=1

# Run Triton kernel.
python your_triton_program.py

3.3 File Lifecycle Management

Understanding when these temporary files are generated and how they are cleared helps you better manage the debugging environment.

File generation time table

File Type	Generation Phase	Triggering Condition	Clearance Suggestion
Cache file	During each compilation	Generated when the cache is not hit	Periodic clearing or clearing during troubleshooting
Dump file	After TRITON_DEBUG=1 is set	Generated during each compilation	Manual clearing after debugging

In the production environment, debug dump should be disabled (that is, TRITON_DEBUG=1 is not set).
The cache mechanism can significantly improve performance and should not be disabled.

By properly using these temporary files, developers can efficiently locate and solve issues encountered during Triton-Ascend compilation.

3.4 IR File Parsing

The following uses the 01-vector-add.py test case as an example to describe the compilation process: This is a simple addition calculation of two tensors. For the calculation logic, see the comments in the sample case. You can enable the dump file output by setting TRITON_DEBUG=1 to obtain kernel.ttir.mlir and kernel.ttadapter.mlir.

Run the test case.

TRITON_DEBUG=1 python 01-vector-add.py

After the test case is executed, the dump file path is displayed. The default path is ~/.triton/dump. The following information is displayed:

Dumping intermediate results to ~/.triton/dump/xxx 
# xxx is a unique hash identifier.

Go to the dump path and view kernel.ttir.mlir and kernel.ttadapter.mlir.

3.4.1 Triton Intermediate Representation (TTIR)

TTIR example The kernel.ttir.mlir file is as follows:

module {
  tt.func public @add_kernel(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32} , %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32} , %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32} , %arg3: i32 {tt.divisibility = 16 : i32} ) attributes {noinline = false} {
    %cst = arith.constant dense<0.000000e+00> : tensor<1024xf32> loc(#loc1)
    %c1024_i32 = arith.constant 1024 : i32 loc(#loc1)
    %0 = tt.get_program_id x : i32 loc(#loc2)
    %1 = arith.muli %0, %c1024_i32 : i32 loc(#loc3)
    %2 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> loc(#loc4)
    %3 = tt.splat %1 : i32 -> tensor<1024xi32> loc(#loc5)
    %4 = arith.addi %3, %2 : tensor<1024xi32> loc(#loc5)
    %5 = tt.splat %arg3 : i32 -> tensor<1024xi32> loc(#loc6)
    %6 = arith.cmpi slt, %4, %5 : tensor<1024xi32> loc(#loc6)
    %7 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>> loc(#loc7)
    %8 = tt.addptr %7, %4 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32> loc(#loc7)
    %9 = tt.load %8, %6, %cst : tensor<1024x!tt.ptr<f32>> loc(#loc8)
    %10 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>> loc(#loc9)
    %11 = tt.addptr %10, %4 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32> loc(#loc9)
    %12 = tt.load %11, %6, %cst : tensor<1024x!tt.ptr<f32>> loc(#loc10)
    %13 = arith.addf %9, %12 : tensor<1024xf32> loc(#loc11)
    %14 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>> loc(#loc12)
    %15 = tt.addptr %14, %4 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32> loc(#loc12)
    tt.store %15, %13, %6 : tensor<1024x!tt.ptr<f32>> loc(#loc13)
    tt.return loc(#loc14)}}

TTIR analysis

TTIR is an intermediate representation generated by the frontend of the Triton compiler. It is expressed in the Multi-Level IR (MLIR) format and retains the semantic structure of the original Triton Python kernel. In kernel.ttir.mlir:

The @add_kernel function receives three pointer parameters (corresponding to the device memory addresses of input A, input B, and output C respectively) and an integer parameter n indicating the vector length.
Each triton program (vectorized execution unit) processes 1024 elements (represented by the %c1024_i32 constant), obtains the ID of the current block through the tt.get_program_id x, and calculates the global offset.
tt.make_range and tt.splat are used to construct a SIMD-style index tensor and they are used together with arith.addi to generate the global address offset processed by each thread.
tt.addptr and tt.load are used to implement vectorized loading, and the mask %6 (generated by arith.cmpi slt) is used to prevent out-of-bounds access.
The element-wise floating-point addition arith.addf is executed, and the result is returned to the global memory by using tt.store.

The TTIR layer is still based on the native abstraction (such as !tt.ptr<f32>, tt.load, and tt.store) of Triton and has not been mapped to the specific memory model or execution unit of the underlying hardware. It is a platform-independent high-level IR.

3.4.1 Target-Specific Adapter Representation (TTAdapter IR)

TTAdapter IR example The kernel.ttadapter.mlir file is as follows:

module {
  func.func @add_kernel(%arg0: memref<?xi8>, %arg1: memref<?xi8>, %arg2: memref<?xf32> {tt.divisibility = 16 : i32, tt.tensor_kind = 0 : i32}, %arg3: memref<?xf32> {tt.divisibility = 16 : i32, tt.tensor_kind = 0 : i32}, %arg4: memref<?xf32> {tt.divisibility = 16 : i32, tt.tensor_kind = 1 : i32}, %arg5: i32 {tt.divisibility = 16 : i32}, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: i32, %arg10: i32, %arg11: i32) attributes {SyncBlockLockArgIdx = 0 : i64, WorkspaceArgIdx = 1 : i64, global_kernel = "local", mix_mode = "aiv", parallel_mode = "simd"} {
    %cst = arith.constant 0.000000e+00 : f32
    %c1024 = arith.constant 1024 : index
    %c1024_i32 = arith.constant 1024 : i32
    %0 = arith.muli %arg9, %c1024_i32 : i32
    %1 = arith.index_cast %0 : i32 to index
    %reinterpret_cast = memref.reinterpret_cast %arg2 to offset: [%1], sizes: [1024], strides: [1] : memref<?xf32> to memref<1024xf32, strided<[1], offset: ?>>
    %alloc = memref.alloc() : memref<1024xf32>
    %2 = arith.addi %1, %c1024 : index
    %3 = arith.index_cast %arg5 : i32 to index
    %4 = arith.maxsi %1, %3 : index
    %5 = arith.minsi %2, %4 : index
    %6 = arith.subi %5, %1 : index
    %7 = arith.cmpi slt, %6, %c1024 : index
    scf.if %7 {
      linalg.fill ins(%cst : f32) outs(%alloc : memref<1024xf32>)
    } {hivm.unlikely_condition}
    %subview = memref.subview %reinterpret_cast[0] [%6] [1] : memref<1024xf32, strided<[1], offset: ?>> to memref<?xf32, strided<[1], offset: ?>>
    %subview_0 = memref.subview %alloc[0] [%6] [1] : memref<1024xf32> to memref<?xf32, strided<[1]>>
    memref.copy %subview, %subview_0 : memref<?xf32, strided<[1], offset: ?>> to memref<?xf32, strided<[1]>>
    %8 = bufferization.to_tensor %alloc restrict writable : memref<1024xf32>
    %reinterpret_cast_1 = memref.reinterpret_cast %arg3 to offset: [%1], sizes: [1024], strides: [1] : memref<?xf32> to memref<1024xf32, strided<[1], offset: ?>>
    %alloc_2 = memref.alloc() : memref<1024xf32>
    scf.if %7 {
      linalg.fill ins(%cst : f32) outs(%alloc_2 : memref<1024xf32>)
    } {hivm.unlikely_condition}
    %subview_3 = memref.subview %reinterpret_cast_1[0] [%6] [1] : memref<1024xf32, strided<[1], offset: ?>> to memref<?xf32, strided<[1], offset: ?>>
    %subview_4 = memref.subview %alloc_2[0] [%6] [1] : memref<1024xf32> to memref<?xf32, strided<[1]>>
    memref.copy %subview_3, %subview_4 : memref<?xf32, strided<[1], offset: ?>> to memref<?xf32, strided<[1]>>
    %9 = bufferization.to_tensor %alloc_2 restrict writable : memref<1024xf32>
    %10 = arith.addf %8, %9 : tensor<1024xf32>
    %reinterpret_cast_5 = memref.reinterpret_cast %arg4 to offset: [%1], sizes: [1024], strides: [1] : memref<?xf32> to memref<1024xf32, strided<[1], offset: ?>>
    %extracted_slice = tensor.extract_slice %10[0] [%6] [1] : tensor<1024xf32> to tensor<?xf32>
    %subview_6 = memref.subview %reinterpret_cast_5[0] [%6] [1] : memref<1024xf32, strided<[1], offset: ?>> to memref<?xf32, strided<[1], offset: ?>>
    bufferization.materialize_in_destination %extracted_slice in writable %subview_6 : (tensor<?xf32>, memref<?xf32, strided<[1], offset: ?>>) -> ()
    return
  }
}

TTAdapter IR parsing

TTIR is converted to TTAdapter IR to adapt to the Ascend NPU architecture in the Triton-Ascend compilation process. TTAdapter IR uses standard MLIR dialect (such as memref, linalg, and scf) and introduces NPU-specific constraints and optimization policies. In kernel.ttadapter.mlir:

The function signature has been converted from the Triton pointer type to memref<?xi8> or memref<?xf32> with attributes. tt.divisibility = 16 indicates the memory alignment requirement, and tt.tensor_kind distinguishes input (marked with 0) and output (marked with 1).
The global offset is reconstructed as a local view of a fixed size (1024) by using memref.reinterpret_cast for subsequent vectorization.
The boundary check logic is introduced to calculate the number of valid elements %6 and use scf.if to control whether to fill zeros (linalg.fill) at the end to ensure that the SIMD width is aligned and does not exceed the boundary.
memref.alloc is used to allocate a local buffer, memref.copy is used to securely copy the global memory data to the local host, and bufferization.to_tensor is used to convert the data into tensors for operators.
The addition operation is performed by arith.addf on the tensor. The valid part of the result is truncated by tensor.extract_slice and written back to the target memref by bufferization.materialize_in_destination.

TTAdapter IR has been abstracted from Triton to adapt to the Ascend NPU format.

4 Interpreter Mode

The core value of the interpreter is to isolate hardware differences. You can set the environment variable TRITON_INTERPRET to 1 to forcibly execute kernel computation on the CPU. The result of the kernel computation can be used as the benchmark for determining the NPU computation accuracy.

Usage:

Set the environment variable TRITON_INTERPRET to 1 and run the program so that the Triton kernel is executed on the CPU interpreter.
Insert a Python breakpoint at the position to be checked in the Triton kernel source code.
```
breakpoint()  # Python built-in breakpoint function
```
The program execution is paused and you enter the Python debugger (Pdb). You can print and check the value of any intermediate variable.
```
(Pdb) p tmp0  # Print the value of variable tmp0.
```

Note: The interpreter mode performs all computations on the CPU, which significantly reduces the running efficiency. Therefore, after debugging or verification, you must cancel the setting of the environment variable TRITON_INTERPRET or explicitly set it to 0 to ensure that the system performance is not affected.

# Cancel the environment variable.
unset TRITON_INTERPRET

# Explicitly set it to 0.
export TRITON_INTERPRET=0

5 Debugging Methods

5.1 Debugging by Printing

5.1.1 Static Printing Debugging

This method uses tl.static_print to print the value of a constant expression during compilation. It is applicable to debugging configuration parameters and constants that are known during compilation.

Setting the environment variable TRITON_DEVICE_PRINT to 1 can enable the tl.static_print function. This function allows constant values to be printed during kernel compilation. It is an effective method for verifying configuration parameters and constant expressions.

Features:

tl.static_print is executed during compilation, not during runtime.
Only compilation constants (tl.constexpr parameters and constant expressions) can be printed.
The output is displayed in the standard output of the compiler.

Usage:

1.In the Triton kernel, add the tl.static_print statement for the constant parameters to be debugged.

import triton.language as tl

@triton.jit
def triton_kernel(
    out_ptr0, 
    in_ptr0, 
    in_ptr1, 
    XBLOCK: tl.constexpr,  # Constant parameter during compilation
    USE_FP16: tl.constexpr  # Constant parameter during compilation
):
    # Print constant parameters during compilation.
    tl.static_print("XBLOCK = ", XBLOCK)
    tl.static_print("USE_FP16 = ", USE_FP16)
    
    idx = tl.arange(0, XBLOCK)
    tmp0 = tl.load(in_ptr0 + idx)
    tmp1 = tl.load(in_ptr1 + idx)
    
    # Print the constant calculation result.
    elements_per_thread = XBLOCK // 32
    tl.static_print("Elements per thread = ", elements_per_thread)
    
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + idx, tmp2)

2.Set the environment variable and run the program for compilation.

# Enable Triton debugging output (including static_print).
export TRITON_DEVICE_PRINT=1

# Run the Python program. The output is displayed in the compilation phase.
python your_program.py

5.1.2 Runtime Debugging

You can use tl.device_print to flexibly print the values of the variables to be observed. Setting the environment variable TRITON_DEVICE_PRINT to 1 can enable the tl.device_print function. This function allows tensor values to be printed in the kernel. It is an efficient method for verifying the computation accuracy by phase.

Usage:

1.In the Triton kernel, add the tl.device_print statement for the variables to be printed.

import triton.language as tl

@triton.jit
def triton_kernel(out_ptr0, in_ptr0, in_ptr1, XBLOCK: tl.constexpr):
    idx = tl.arange(0, XBLOCK)
    tmp0 = tl.load(in_ptr0 + idx)
    tmp1 = tl.load(in_ptr1 + idx)
    tmp2 = tmp0 + tmp1
    tl.device_print("tmp2 after addition = ", tmp2)  # Print the intermediate result.
    tl.store(out_ptr0 + idx, tmp2)

2.Set the environment variable TRITON_DEVICE_PRINT to 1 and run the program. The window displays the value of the variable.

# Enable Triton debugging output (including device_print).
export TRITON_DEVICE_PRINT=1

# Run the Python program. The output is displayed in the compilation phase.
python your_program.py

Note: The print length is limited. The length of the tensor printed by tl.device_print is limited. When the tensor length exceeds a certain threshold, the output is truncated.

5.1.3 Comparing the Two Printing Methods

Feature	`tl.device_print`	`tl.static_print`
Execution time	Runtime (kernel execution)	Compilation (kernel compilation)
Output location	Runtime standard output	Compiler standard output
Print content	Runtime tensor values and variables	Compilation constants and constant expressions
Impact on performance	There is runtime overhead.	No runtime overhead.
Enabling environment variables	`TRITON_DEVICE_PRINT=1`	`TRITON_DEVICE_PRINT=1`

Description of environment variables:

TRITON_DEVICE_PRINT=1: enables runtime printing and compilation printing.

TRITON_DEBUG=1: enables all debugging outputs (including compilation and runtime printing).

5.2 Compilation Error Debugging

When the ttir.mlir → ttadapter.mlir conversion fails, the ttadapter.mlir cannot be generated and the MLIRCompileError error is reported. You need to locate the fault at the Triton-Ascend code layer. Triton-Ascend contains the Python and C++ code layers. You need to locate the error code segment based on the call stack information in the error log and use the corresponding debugging method.

5.2.1 Debugging Python Code

When the call stack information shows that the error is caused by the Python layer code of Triton-Ascend, you can use the built-in debugger pdb of Python for interactive debugging. As an effective tool for locating Python code logic errors, pdb allows you to set breakpoints, perform step-by-step execution, and check variable status.

Procedure:

Locating faults In the error log, find the Python call stack information closest to the user code, which is usually near the top of the stack. For example:

File "/path/to/triton/ascend/compiler.py", line 123, in compile_fn
    result = lower_function(...)

Inserting a debugging breakpoint Insert a pdb breakpoint in the Python source file that is suspected to be faulty.

def compile_fn(ttir):
    import pdb; pdb.set_trace()  # Compatible with all Python versions

Example: Assume that a breakpoint is set in line 123 of compiler.py. After the program is suspended, the following information is displayed:

python
(Pdb) l  # View the current code context.
118     def compile_fn(ttir):
120         import pdb; pdb.set_trace()
121         # Check the input parameter.
122         print(f"ttir type: {type(ttir)}")
123         result = lower_function(ttir)  # <-- The current suspension position.

(Pdb) p ttir  # Check the input parameter.
(Pdb) n # Execute the next line of code.
(Pdb) p result  # View the result.

5.2.2 Debugging Environment Variables

When developing or debugging Triton operators, you can set the following environment variables to enable IR printing in different phases, which helps locate faults. The following describes the two key debugging switches.

5.2.2.1 `MLIR_ENABLE_DUMP=1`

Function: Enables automatic dump of the MLIR high-level IR and outputs the IR of the current function in readable text to stderr before and after each MLIR pass is executed.

Feature: Small log size: usually dozens to hundreds of lines, which are easy to read. Focus on high-level logic: applicable to debugging operator conversion, memory layout, and parallel policies.

Suggestion: First choice for routine debugging: This log can be used to locate 90% of Triton operator issues. It can be used together with TRITON_DEBUG=1 to further enhance information.

Enabling method:

export MLIR_ENABLE_DUMP=1
export TRITON_DEBUG=1
python your_triton_script.py

5.2.2.2 `TRITON_ENABLE_LLVM_DEBUG=1`

Function: Enables full debugging logs in the LLVM backend CodeGen phase, including instruction selection, register allocation, instruction scheduling, and machine code generation.

Feature: Large log size: A single kernel can generate tens of thousands of lines of output. Bottom-layer details: Register name, physical/virtual register mapping, and stack frame layout are included. Only for LLVM experts: For common Triton developers, this is considered "noise."

Suggestion: Enable this function only when LLVM backend bugs are suspected (for example, invalid instructions are generated or performance exceptions occur). It can be used together with LLVM_DEBUG_ONLY to limit the output scope.

When TRITON_ENABLE_LLVM_DEBUG=1 is enabled, you can use the LLVM_DEBUG_ONLY environment variable to specify the module for which the logs will be output. The following is a brief description of the common DEBUG_TYPE:

## `isel` (Instruction Selection)
- **Function**: Converts LLVM IR instructions into machine instructions (MachineInstr) of the target architecture.
- **Debugging content**: Displays the mapping process and pattern matching result between IR and machine instructions.
- **Application scenario**: The instruction selection is suspected to be incorrect (for example, invalid instructions or inefficient instruction sequences are generated).

## `regalloc` (Register Allocation)
- **Function**: Allocates physical registers to virtual registers and processes spilling.
- **Debugging content**: Status before and after register allocation, conflict graph, and active interval analysis.
- **Application scenario**: The register pressure is high, the performance deteriorates, or unexpected memory access occurs.

## `spiller` (Spiller)
- **Function**: Spills some values to the stack memory when registers are insufficient.
- **Debugging content**: Which virtual registers are spilled and the positions of inserted load/store instructions.
- **Application scenario**: The performance deteriorates due to frequent memory access, and register usage needs to be optimized.

## `peephole` (Peephole Optimizer)
- **Function**: Performs partial optimization (such as constant folding and redundant instruction elimination) at the machine code layer.
- **Debugging content**: Comparison of instructions before and after optimization.
- **Application scenario**: The generated code is redundant, but high-level optimization is not overriding.

## `asm-printer` (Assembly Printer)
- **Function**: Converts MachineInstr into the final assembly text (such as PTX, AMDGCN, and CCE).
- **Debugging content**: Generated assembly code, symbol references, and instruction encoding.
- **Application scenario**: Assembly syntax errors, tag mismatch, or viewing final output.

Enabling method: In the following example, it is specified that only isel is output.

export TRITON_ENABLE_LLVM_DEBUG=1
export LLVM_DEBUG_ONLY="isel"
python your_triton_script.py

Recommended debugging process: Enable MLIR_ENABLE_DUMP=1 first. → Check whether the conversion at the MLIR layer is correct (for example, ReduceOp → scf.for). If the MLIR is normal but the result is incorrect: → It is suspected that the LLVM is faulty. Enable TRITON_ENABLE_LLVM_DEBUG=1 + LLVM_DEBUG_ONLY. Do not directly enable TRITON_ENABLE_LLVM_DEBUG=1. → Large log size may mask key information and severely affect the running speed.

Appendix A: Quick Reference Table for Common Environment Variables

Variable	Description
`TRITON_DEBUG=1`	Enables intermediate IR dump.
`TRITON_DISABLE_CACHE=1`	Disables compilation cache.
`TRITON_INTERPRET=1`	Uses the CPU interpreter to execute the kernel.
`TRITON_DEVICE_PRINT=1`	Enables runtime print output and compilation print output.
`MLIR_ENABLE_DUMP=1`	Enables automatic dump of the MLIR high-level IR. Outputs the IR of the current function in readable text before and after each MLIR pass is executed.
`TRITON_ENABLE_LLVM_DEBUG=1`	Enables full debugging logs in the LLVM backend CodeGen phase, including instruction selection, register allocation, instruction scheduling, and machine code generation.