Environment Variables and Compiler Options

This document summarizes Triton-Ascend behavior controls that developers can set explicitly, including environment variables configured before running a program and NPU compiler options passed through triton.Config or kernel launch meta-parameters.

Environment Variables

Environment Variable Usage Example

Set environment variables before starting the Python program. Example:

export TRITON_DEBUG=1
python run_kernel.py

Environment Variable Reference Table

The following table describes how to set environment variables.

Category Environment Variable Default Value Function Description Setting Description Change Description
Debugging and logging TRITON_DEBUG 0 or not set Specifies whether to enable the debugging output function of Triton to print detailed debugging information during running. This is useful for troubleshooting problems in the compilation or execution phase. When this parameter is set to 1, Triton outputs more information about the compilation, kernel generation, and execution. Some implementations may support more fine-grained debugging levels (such as 2 and 3), depending on the Triton version and implementation. 0: The debugging is disabled.
1: The debugging is enabled.
Debugging and logging MLIR_ENABLE_DUMP 0 or not set Specifies whether to dump the intermediate representation (IR) of all kernels before each MLIR optimization. You can set MLIR_ENABLE_DUMP to kernelName to dump the IR of a specific kernel. 0: Do not dump.
1: Dump the IR of all kernels.
kernelName: Dump the IR of a specific kernel.
The Triton cache may interfere with the dump. If MLIR_ENABLE_DUMP=1 does not take effect, you can run rm -r ~/.triton/cache to clear the Triton cache.
Debugging and logging LLVM_IR_ENABLE_DUMP 0 or not set Specifies whether to dump the IR before each LLVM IR optimization. 0: Do not dump.
1: Dump IRs.
Debugging and logging TRITON_REPRODUCER_PATH Not set Generates the MLIR reproduction file before each MLIR compilation phase. If a phase fails, <reproducer_path> saves the MLIR status before the failure. <reproducer_path>: save path.
Debugging and logging TRITON_INTERPRET 0 or not set Specifies whether to use the Triton interpreter instead of the GPU for running and support inserting Python breakpoints in kernel function code. 0: Breakpoints are not supported.
1: Breakpoints are supported.
Debugging and logging TRITON_ENABLE_LLVM_DEBUG 0 or not set Specifies whether to pass the-debug parameter to LLVM and outputs a large amount of debugging information. If there is too much information, you can use TRITON_LLVM_DEBUG_ONLY to limit the output scope. 0: Pass.
1: Do not pass.
Another method to reduce output interference is as follows: Set the running program by setting LLVM_IR_ENABLE_DUMP to 1, extract the IR before the target LLVM optimization channel, and run the opt tool of the LLVM separately. In this case, you can add -debug-only=foo to the command line to limit the debugging range.
Debugging and logging TRITON_LLVM_DEBUG_ONLY Not set Equivalent to the -debug-only command line option of LLVM. This parameter can be used to limit the LLVM debugging output to a specific optimization channel or component name (defined by the #define DEBUG_TYPE macro in LLVM and Triton), thereby effectively reducing redundant debugging output. You can specify one or more comma-separated values, for example, TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions" or TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions,regalloc". Comma-separated values: channel or component name
Debugging and logging USE_IR_LOC 0 or not set Specifies whether to include location information (such as file names and line numbers) in the generated IR. This information is helpful for debugging, but may increase the size of the generated IR. If this parameter is set to 1, the IR is re-parsed, and the location information is mapped to the line number of the IR file with a specific extension (not the line number of the Python source file). This enables a direct mapping from the IR to the LLVM IR/PTX. When used with the performance analysis tool, this parameter can be used to implement fine-grained performance analysis on IR instructions. 0: No location information is included.
1: The location information is included.
Debugging and logging TRITON_PRINT_AUTOTUNING 0 or not set After the automatic optimization is complete, the optimal configuration and total time of each kernel are output. 0: Do not output.
1: Output.
Debugging and logging MLIR_ENABLE_REMARK 0 or not set Specifies whether to enable the output of remarks during MLIR compilation, including performance warnings in remarks. 0: Disabled.
1: Enabled.
Debugging and logging TRITON_KERNEL_DUMP 0 or not set Specifies whether to enable the dump function of the Triton kernel. When this function is enabled, Triton saves the generated kernel code (IR and final PTX in each compilation phase) to the specified directory. 0: Disabled.
1: Enabled.
Debugging and logging TRITON_DUMP_DIR Current working directory or not set Specifies the directory for storing the Triton kernel dump file, which is the directory for saving the IR and PTX when TRITON_KERNEL_DUMP is set to 1. "path": save path.
Debugging and logging TRITON_DEVICE_PRINT 0 or not set If this parameter is set to 1 or true (TRUE is converted to true), the function of tl.device_print is enabled. Note: This function uses the GM buffer (the pointer of which is passed to the kernel). 0: Disabled.
1: The functionality of tl.device_print is enabled.
The maximum size of the GM buffer for each thread is 16 KB. If the buffer size exceeds 16 KB, the excess content will be discarded. The value is fixed currently and will be adjusted through an environment variable.
Compilation control TRITON_ALWAYS_COMPILE 0 or not set Specifies whether Triton forcibly recompiles the kernel each time it runs, instead of using the existing cached version. By default, Triton caches the compiled kernels (based on parameters and configurations) to improve performance. If this parameter is set to 1, Triton ignores the cache and recompiles the kernel each time it runs, which is useful for debugging or testing new compiler features. 0: Disabled.
1: All kernels are recompiled during each running.
Compilation control DISABLE_LLVM_OPT 0 or not set If this parameter is set to 1, the optimization steps (LLVM optimization of make_llir and make_ptx) during LLVM compilation can be disabled. If this parameter is set to a character string, the LLVM optimization flags to be disabled are parsed. For example, if DISABLE_LLVM_OPT is set to "disable-lsr", the loop strength optimization is disabled (this optimization may cause a performance fluctuation of up to 10% in some kernels with register pressure). 0: The LLVM optimization is enabled.
1: The optimization steps (LLVM optimization of make_llir and make_ptx) during LLVM compilation are disabled.
Compilation control MLIR_ENABLE_TIMING 0 or not set Specifies whether to enable the time statistics function during MLIR compilation. 0: Disabled.
1: Enabled.
Compilation control LLVM_ENABLE_TIMING 0 or not set Specifies whether to enable the time statistics function during LLVM compilation. 0: Disabled.
1: Enabled.
Compilation control TRITON_DEFAULT_FP_FUSION 1 (enabled) Specifies whether to enable the floating-point operation fusion optimization by default. The default floating-point operation fusion behavior (for example, mul+add->fma) is overwritten. 0: Disabled.
1: Enabled.
Compilation control TRITON_KERNEL_OVERRIDE 0 or not set Specifies whether to enable the Triton kernel override function. You can use the user-specified external file (such as IR/PTX) to override the default generated kernel code at the beginning of each compilation phase. 0: Disabled.
1: Enabled.
Compilation control TRITON_OVERRIDE_DIR Current working directory or not set Specifies the directory for searching the Triton kernel override file. Directory for loading the IR/PTX file when TRITON_KERNEL_OVERRIDE is set to 1. "path": save path.
Compilation control TRITON_ASCEND_COMPILE_SPEED_OPT 0 or not set Specifies whether the JIT compiler skips the subsequent compilation phase after detecting that the kernel compilation fails. Set the parameter to 1 to skip the attempt. (The default value 0 indicates that the attempt is continued.) 0: Continue the attempt.
1: Skip.
Compilation control TRITON_COMPILE_ONLY 0 or not set Specifies whether to perform only compilation without execution. This parameter is used when remote_launch is used. 0: Disabled.
1: Enabled.
Compilation control TRITON_DISABLE_FFTS 0 or not set Specifies whether to disable FFTS. 0: Enabled.
1: Disabled.
Running and scheduling TRITON_ALL_BLOCKS_PARALLEL 0 or not set Specifies whether to enable the automatic optimization of the number of logical cores based on the number of physical cores. This parameter can be enabled only when logical cores can execute in parallel. When the number of logical cores is greater than the number of physical cores, enabling this parameter will instruct the compiler to automatically adjust the number of logical cores to match the number of physical cores, thereby reducing scheduling overhead. After this parameter is enabled, the value of grid can be greater than 65535. Limitation: This option can be enabled only when the logic of the Triton kernel is insensitive to the execution sequence. Otherwise, a deadlock may occur. The per-kernel option enable_auto_blockify (see architecture_difference.md) takes precedence over this env var when set; the env var only acts as the default for kernels that leave enable_auto_blockify unset. 0: Disabled.
1: Enabled.
Running and scheduling TRITON_ENABLE_TASKQUEUE 0 or not set Specifies whether to enable task_queue. 0: Disabled.
1: Enabled.
Running and scheduling TRITON_ENABLE_SANITIZER 0 or not set Specifies whether to enable SANITIZER. 0: Disabled.
1: Enabled.
Running and scheduling ENABLE_PRINT_UB_BITS 0 or not set After this parameter is enabled, the current UB usage can be obtained for the inductor. 0: Disabled.
1: Enabled.
Others TRITON_BENCH_METHOD Not set When the Ascend NPU is used, change do_bench in testing.py to do_bench_npu. (This parameter is used when INDUCTOR_ASCEND_AGGRESSIVE_AUTOTUNE is set to 1.) If this parameter is set to default, the original do_bench function is still called even if the NPU is available. "npu": Switch to do_bench_npu.
Others TRITON_REMOTE_RUN_CONFIG_PATH path Specifies the configuration path for remote running. Specify the path directly.

Compiler Options

Compiler options control the compilation strategy for a single Triton kernel and can be passed through triton.Config, autotune parameters, or kernel launch meta-parameters.

Compiler Option Usage Example

For example, pass multibuffer directly during kernel launch:

import triton
import triton.language as tl

@triton.jit
def kernel(..., BLOCK_SIZE: tl.constexpr):
    ...

grid = (triton.cdiv(n_elements, 1024),)
kernel[grid](..., BLOCK_SIZE=1024, multibuffer=True)

Compiler Option Reference Table

The following table describes the options.

Category Compiler Option Default/Values Function Description Setting Description
General pipeline multibuffer True, False; disabled by default in 910_95 compilation scenarios Enables or disables ping-pong/double-buffer pipelines. triton.Config or launch meta-parameter
CV fusion enable_auto_bind_sub_block None, True, False Enables or disables automatic sub-block binding. triton.Config or launch meta-parameter
CV fusion enable_hivm_auto_cv_balance None, True, False Enables or disables automatic CV balance. triton.Config or autotune parameter
CV fusion/sync sync_solver None, True, False Enables or disables the HIVM synchronization solver. triton.Config or launch meta-parameter
Synchronization unit_flag None, True, False Cube-output synchronization option. triton.Config or autotune parameter
Synchronization inject_barrier_all None, True, False Enables or disables automatic barrier synchronization injection. triton.Config or launch meta-parameter
Synchronization inject_block_all None, True, False Enables or disables automatic block synchronization injection. triton.Config or launch meta-parameter
Multibuffer scope limit_auto_multi_buffer_only_for_local_buffer None, True, False Restricts automatic multi-buffering to local buffers. triton.Config or autotune parameter
Multibuffer scope limit_auto_multi_buffer_of_local_buffer None, "no-limit", "no-l0c" Configures the local-buffer automatic multi-buffering scope. triton.Config or autotune parameter
Workspace set_workspace_multibuffer None, 2, 4 Configures workspace multi-buffering. triton.Config or autotune parameter
CV fusion tiling tile_mix_vector_loop None, 2, 4, 8 Configures the Vector loop split count. triton.Config or autotune parameter
CV fusion tiling tile_mix_cube_loop None, 2, 4, 8 Configures the Cube loop split count. triton.Config or autotune parameter
CV fusion/sync disable_auto_inject_block_sync None, True, False Enables or disables automatic block sync injection. triton.Config or launch meta-parameter
Runtime stream stream None or NPU stream identifier Specifies the NPU stream. launch meta-parameter
Compiler pass enable_linearize Version-dependent Enables or disables the linearization pass. triton.Config or launch meta-parameter
CV fusion/layout enable_nd2nz_on_vector Default False Enables or disables ND-to-NZ layout transformation on the Vector path. triton.Config or launch meta-parameter
Large-grid optimization auto_blockify_size Default 1 Enables or disables AutoBlockify pass. Ignored when TRITON_ALL_BLOCKS_PARALLEL is not set. launch meta-parameter or triton.Config