KanuaKdocs(developer): add document of programming

Environment Variables and Compiler Options

This document summarizes Triton-Ascend behavior controls that developers can set explicitly, including environment variables configured before running a program and NPU compiler options passed through triton.Config or kernel launch meta-parameters.

Environment Variables

Environment Variable Usage Example

Set environment variables before starting the Python program. Example:

export TRITON_DEBUG=1
python run_kernel.py

Environment Variable Reference Table

The following table describes how to set environment variables.

Category	Environment Variable	Default Value	Function Description	Setting Description	Change Description
Debugging and logging	TRITON_DEBUG	0 or not set	Specifies whether to enable the debugging output function of Triton to print detailed debugging information during running. This is useful for troubleshooting problems in the compilation or execution phase. When this parameter is set to 1, Triton outputs more information about the compilation, kernel generation, and execution. Some implementations may support more fine-grained debugging levels (such as 2 and 3), depending on the Triton version and implementation.	0: The debugging is disabled. 1: The debugging is enabled.
Debugging and logging	MLIR_ENABLE_DUMP	0 or not set	Specifies whether to dump the intermediate representation (IR) of all kernels before each MLIR optimization. You can set `MLIR_ENABLE_DUMP` to `kernelName` to dump the IR of a specific kernel.	0: Do not dump. 1: Dump the IR of all kernels. kernelName: Dump the IR of a specific kernel.	The Triton cache may interfere with the dump. If `MLIR_ENABLE_DUMP=1` does not take effect, you can run `rm -r ~/.triton/cache` to clear the Triton cache.
Debugging and logging	LLVM_IR_ENABLE_DUMP	0 or not set	Specifies whether to dump the IR before each LLVM IR optimization.	0: Do not dump. 1: Dump IRs.
Debugging and logging	TRITON_REPRODUCER_PATH	Not set	Generates the MLIR reproduction file before each MLIR compilation phase. If a phase fails, `<reproducer_path>` saves the MLIR status before the failure.	`<reproducer_path>`: save path.
Debugging and logging	TRITON_INTERPRET	0 or not set	Specifies whether to use the Triton interpreter instead of the GPU for running and support inserting Python breakpoints in kernel function code.	0: Breakpoints are not supported. 1: Breakpoints are supported.
Debugging and logging	TRITON_ENABLE_LLVM_DEBUG	0 or not set	Specifies whether to pass the`-debug` parameter to LLVM and outputs a large amount of debugging information. If there is too much information, you can use `TRITON_LLVM_DEBUG_ONLY` to limit the output scope.	0: Pass. 1: Do not pass.	Another method to reduce output interference is as follows: Set the running program by setting `LLVM_IR_ENABLE_DUMP` to `1`, extract the IR before the target LLVM optimization channel, and run the `opt` tool of the LLVM separately. In this case, you can add `-debug-only=foo` to the command line to limit the debugging range.
Debugging and logging	TRITON_LLVM_DEBUG_ONLY	Not set	Equivalent to the `-debug-only` command line option of LLVM. This parameter can be used to limit the LLVM debugging output to a specific optimization channel or component name (defined by the `#define DEBUG_TYPE` macro in LLVM and Triton), thereby effectively reducing redundant debugging output. You can specify one or more comma-separated values, for example, `TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions"` or `TRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions,regalloc"`.	Comma-separated values: channel or component name
Debugging and logging	USE_IR_LOC	0 or not set	Specifies whether to include location information (such as file names and line numbers) in the generated IR. This information is helpful for debugging, but may increase the size of the generated IR. If this parameter is set to 1, the IR is re-parsed, and the location information is mapped to the line number of the IR file with a specific extension (not the line number of the Python source file). This enables a direct mapping from the IR to the LLVM IR/PTX. When used with the performance analysis tool, this parameter can be used to implement fine-grained performance analysis on IR instructions.	0: No location information is included. 1: The location information is included.
Debugging and logging	TRITON_PRINT_AUTOTUNING	0 or not set	After the automatic optimization is complete, the optimal configuration and total time of each kernel are output.	0: Do not output. 1: Output.
Debugging and logging	MLIR_ENABLE_REMARK	0 or not set	Specifies whether to enable the output of remarks during MLIR compilation, including performance warnings in remarks.	0: Disabled. 1: Enabled.
Debugging and logging	TRITON_KERNEL_DUMP	0 or not set	Specifies whether to enable the dump function of the Triton kernel. When this function is enabled, Triton saves the generated kernel code (IR and final PTX in each compilation phase) to the specified directory.	0: Disabled. 1: Enabled.
Debugging and logging	TRITON_DUMP_DIR	Current working directory or not set	Specifies the directory for storing the Triton kernel dump file, which is the directory for saving the IR and PTX when `TRITON_KERNEL_DUMP` is set to `1`.	"path": save path.
Debugging and logging	TRITON_DEVICE_PRINT	0 or not set	If this parameter is set to `1` or `true` (`TRUE` is converted to `true`), the function of `tl.device_print` is enabled. Note: This function uses the GM buffer (the pointer of which is passed to the kernel).	0: Disabled. 1: The functionality of `tl.device_print` is enabled.	The maximum size of the GM buffer for each thread is 16 KB. If the buffer size exceeds 16 KB, the excess content will be discarded. The value is fixed currently and will be adjusted through an environment variable.
Compilation control	TRITON_ALWAYS_COMPILE	0 or not set	Specifies whether Triton forcibly recompiles the kernel each time it runs, instead of using the existing cached version. By default, Triton caches the compiled kernels (based on parameters and configurations) to improve performance. If this parameter is set to 1, Triton ignores the cache and recompiles the kernel each time it runs, which is useful for debugging or testing new compiler features.	0: Disabled. 1: All kernels are recompiled during each running.
Compilation control	DISABLE_LLVM_OPT	0 or not set	If this parameter is set to 1, the optimization steps (LLVM optimization of make_llir and make_ptx) during LLVM compilation can be disabled. If this parameter is set to a character string, the LLVM optimization flags to be disabled are parsed. For example, if `DISABLE_LLVM_OPT` is set to `"disable-lsr"`, the loop strength optimization is disabled (this optimization may cause a performance fluctuation of up to 10% in some kernels with register pressure).	0: The LLVM optimization is enabled. 1: The optimization steps (LLVM optimization of make_llir and make_ptx) during LLVM compilation are disabled.
Compilation control	MLIR_ENABLE_TIMING	0 or not set	Specifies whether to enable the time statistics function during MLIR compilation.	0: Disabled. 1: Enabled.
Compilation control	LLVM_ENABLE_TIMING	0 or not set	Specifies whether to enable the time statistics function during LLVM compilation.	0: Disabled. 1: Enabled.
Compilation control	TRITON_DEFAULT_FP_FUSION	1 (enabled)	Specifies whether to enable the floating-point operation fusion optimization by default. The default floating-point operation fusion behavior (for example, mul+add->fma) is overwritten.	0: Disabled. 1: Enabled.
Compilation control	TRITON_KERNEL_OVERRIDE	0 or not set	Specifies whether to enable the Triton kernel override function. You can use the user-specified external file (such as IR/PTX) to override the default generated kernel code at the beginning of each compilation phase.	0: Disabled. 1: Enabled.
Compilation control	TRITON_OVERRIDE_DIR	Current working directory or not set	Specifies the directory for searching the Triton kernel override file. Directory for loading the IR/PTX file when `TRITON_KERNEL_OVERRIDE` is set to `1`.	"path": save path.
Compilation control	TRITON_ASCEND_COMPILE_SPEED_OPT	0 or not set	Specifies whether the JIT compiler skips the subsequent compilation phase after detecting that the kernel compilation fails. Set the parameter to `1` to skip the attempt. (The default value `0` indicates that the attempt is continued.)	0: Continue the attempt. 1: Skip.
Compilation control	TRITON_COMPILE_ONLY	0 or not set	Specifies whether to perform only compilation without execution. This parameter is used when remote_launch is used.	0: Disabled. 1: Enabled.
Compilation control	TRITON_DISABLE_FFTS	0 or not set	Specifies whether to disable FFTS.	0: Enabled. 1: Disabled.
Running and scheduling	TRITON_ALL_BLOCKS_PARALLEL	0 or not set	Specifies whether to enable the automatic optimization of the number of logical cores based on the number of physical cores. This parameter can be enabled only when logical cores can execute in parallel. When the number of logical cores is greater than the number of physical cores, enabling this parameter will instruct the compiler to automatically adjust the number of logical cores to match the number of physical cores, thereby reducing scheduling overhead. After this parameter is enabled, the value of grid can be greater than 65535. Limitation: This option can be enabled only when the logic of the Triton kernel is insensitive to the execution sequence. Otherwise, a deadlock may occur. The per-kernel option `enable_auto_blockify` (see `architecture_difference.md`) takes precedence over this env var when set; the env var only acts as the default for kernels that leave `enable_auto_blockify` unset.	0: Disabled. 1: Enabled.
Running and scheduling	TRITON_ENABLE_TASKQUEUE	0 or not set	Specifies whether to enable task_queue.	0: Disabled. 1: Enabled.
Running and scheduling	TRITON_ENABLE_SANITIZER	0 or not set	Specifies whether to enable SANITIZER.	0: Disabled. 1: Enabled.
Running and scheduling	ENABLE_PRINT_UB_BITS	0 or not set	After this parameter is enabled, the current UB usage can be obtained for the inductor.	0: Disabled. 1: Enabled.
Others	TRITON_BENCH_METHOD	Not set	When the Ascend NPU is used, change `do_bench` in `testing.py` to `do_bench_npu`. (This parameter is used when `INDUCTOR_ASCEND_AGGRESSIVE_AUTOTUNE` is set to `1`.) If this parameter is set to `default`, the original `do_bench` function is still called even if the NPU is available.	"npu": Switch to `do_bench_npu`.
Others	TRITON_REMOTE_RUN_CONFIG_PATH	path	Specifies the configuration path for remote running.	Specify the path directly.

Compiler Options

Compiler options control the compilation strategy for a single Triton kernel and can be passed through triton.Config, autotune parameters, or kernel launch meta-parameters.

Compiler Option Usage Example

For example, pass multibuffer directly during kernel launch:

import triton
import triton.language as tl

@triton.jit
def kernel(..., BLOCK_SIZE: tl.constexpr):
    ...

grid = (triton.cdiv(n_elements, 1024),)
kernel[grid](..., BLOCK_SIZE=1024, multibuffer=True)

Compiler Option Reference Table

The following table describes the options.

Category	Compiler Option	Default/Values	Function Description	Setting Description
General pipeline	`multibuffer`	`True`, `False`; disabled by default in 910_95 compilation scenarios	Enables or disables ping-pong/double-buffer pipelines.	`triton.Config` or launch meta-parameter
CV fusion	`enable_auto_bind_sub_block`	`None`, `True`, `False`	Enables or disables automatic sub-block binding.	`triton.Config` or launch meta-parameter
CV fusion	`enable_hivm_auto_cv_balance`	`None`, `True`, `False`	Enables or disables automatic CV balance.	`triton.Config` or autotune parameter
CV fusion/sync	`sync_solver`	`None`, `True`, `False`	Enables or disables the HIVM synchronization solver.	`triton.Config` or launch meta-parameter
Synchronization	`unit_flag`	`None`, `True`, `False`	Cube-output synchronization option.	`triton.Config` or autotune parameter
Synchronization	`inject_barrier_all`	`None`, `True`, `False`	Enables or disables automatic barrier synchronization injection.	`triton.Config` or launch meta-parameter
Synchronization	`inject_block_all`	`None`, `True`, `False`	Enables or disables automatic block synchronization injection.	`triton.Config` or launch meta-parameter
Multibuffer scope	`limit_auto_multi_buffer_only_for_local_buffer`	`None`, `True`, `False`	Restricts automatic multi-buffering to local buffers.	`triton.Config` or autotune parameter
Multibuffer scope	`limit_auto_multi_buffer_of_local_buffer`	`None`, `"no-limit"`, `"no-l0c"`	Configures the local-buffer automatic multi-buffering scope.	`triton.Config` or autotune parameter
Workspace	`set_workspace_multibuffer`	`None`, `2`, `4`	Configures workspace multi-buffering.	`triton.Config` or autotune parameter
CV fusion tiling	`tile_mix_vector_loop`	`None`, `2`, `4`, `8`	Configures the Vector loop split count.	`triton.Config` or autotune parameter
CV fusion tiling	`tile_mix_cube_loop`	`None`, `2`, `4`, `8`	Configures the Cube loop split count.	`triton.Config` or autotune parameter
CV fusion/sync	`disable_auto_inject_block_sync`	`None`, `True`, `False`	Enables or disables automatic block sync injection.	`triton.Config` or launch meta-parameter
Runtime stream	`stream`	`None` or NPU stream identifier	Specifies the NPU stream.	launch meta-parameter
Compiler pass	`enable_linearize`	Version-dependent	Enables or disables the linearization pass.	`triton.Config` or launch meta-parameter
CV fusion/layout	`enable_nd2nz_on_vector`	Default `False`	Enables or disables ND-to-NZ layout transformation on the Vector path.	`triton.Config` or launch meta-parameter
Large-grid optimization	`auto_blockify_size`	Default `1`	Enables or disables AutoBlockify pass. Ignored when `TRITON_ALL_BLOCKS_PARALLEL` is not set.	launch meta-parameter or `triton.Config`