Wwuying211docs: add language tags for all code blocks

Framework Integration

AscendNPU IR supports framework integration (PyTorch/TensorFlow/MindSpore) in two ways:

DSL integration: Integrate via domain-specific languages such as Triton and TileLang, which compile to AscendNPU IR.
IR integration: Integrate via IR representation, supporting multi-level IR (Torch IR, Linalg/HFusion IR, HIVM IR), with automatic fusion and tiling for Ascend-friendly kernels.

DSL integration

AscendNPU IR supports upstream integration with languages and frameworks such as Triton and TileLang, so that third-party DSLs can target Ascend hardware and run custom operators on the NPU.

Integration	Description
Triton interface	Use Triton to write high-performance kernels and run them on Ascend NPU via Triton Ascend. Covers installation, environment, op mapping, and Ascend extensions.
TileLang interface	Use TileLang Ascend (tile-lang/TVM-based DSL) to develop kernels for Ascend NPU (e.g., GEMM, vector ops, attention). Covers environment, build, and quick start.

IR integration

AscendNPU IR supports multi-level IR integration; each level differs in abstraction and control granularity (see Interface API - Multi-level IR Abstraction):

Torch IR: Framework-level ATen ops, lowered to Linalg/HFusion via Passes.
Linalg/HFusion IR: General tensor algebra and hardware-aware fusion layer; standard MLIR dialects for operator semantics, HFusion performs fusion, tiling, and scheduling automatically.
HIVM IR: NPU instruction layer; direct mapping to hardware instructions, explicit control of memory hierarchy (GM/UB/L1/L0) and compute pipelines (Vector/Cube/MTE) for fine-grained tuning.

Torch IR integration

Use Torch dialect ATen ops; Passes such as convert-torch-to-hfusion lower to Linalg/HFusion named ops, then enter the fusion and scheduling flow.

Torch → AscendNPU IR pipeline

Torch IR is integrated via the torch-backend-to-named-op-backend-pipeline conversion pipeline. The custom convert-torch-to-hfusion Pass lowers Torch ATen ops to Linalg/HFusion named ops first; uncovered ops fall back to the standard lowering path of upstream torch-mlir. Main conversion stages:

convert-torch-to-hfusion: BishengIR custom lowering for 55+ ATen ops to Linalg/HFusion named ops.
convert-torch-to-linalg: Upstream torch-mlir for remaining ops.
convert-torch-to-scf / arith / tensor: Upstream torch-mlir for control flow, arithmetic, and tensor conversion.
func-backend-type-conversion: Converts Torch types (!torch.vtensor) to builtin types (tensor).

Example `torch.mlir`

func.func @torch_mul(%arg0: !torch.vtensor<[4096],f16>, %arg1: !torch.vtensor<[1,56,4096],f16>) -> !torch.vtensor<[1,56,4096],f16>
attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %0 = torch.aten.mul.Tensor %arg0, %arg1 : !torch.vtensor<[4096],f16>, !torch.vtensor<[1,56,4096],f16> -> !torch.vtensor<[1,56,4096],f16>
  return %0 : !torch.vtensor<[1,56,4096],f16>
}

Invocation: two methods, both share the same compile pipeline.

Stepwise conversion: Converts Torch IR to Linalg/HFusion IR first, suitable for caching or inspecting intermediate IR. After conversion, use torch_to_hfusion.mlir as input and continue with the Linalg/HFusion IR integration flow to produce a binary.
- Command: bishengir-opt -torch-backend-to-named-op-backend-pipeline torch.mlir -o torch_to_hfusion.mlir
- Expected output: MLIR text file (.mlir format) containing the converted Linalg/HFusion IR. For example:

func.func @torch.aten.mul_tensor(%arg0: tensor<4096xf16>, %arg1: tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %0 = tensor.empty() : tensor<1x56x4096xf16>
  %broadcasted = linalg.broadcast ins(%arg0 : tensor<4096xf16>) outs(%0 : tensor<1x56x4096xf16>) dimensions = [0, 1] 
  %1 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%broadcasted, %arg1 : tensor<1x56x4096xf16>, tensor<1x56x4096xf16>) outs(%0 : tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16>
  return %1 : tensor<1x56x4096xf16>
}

End-to-end compilation: Uses bishengir-compile to compile Torch IR directly to an executable binary, running through the full Torch → HFusion → HIVM IR compile pipeline.
- Command: bishengir-compile -enable-torch-compile=true -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 torch.mlir -o torch_kernel.o
- Expected output: Ascend NPU operator binary (.o format), loadable and runnable on device via CANN runtime.

Supported Torch ops

Elementwise binary

Torch Op	Lowering target
`aten.add.Tensor` / `aten.add.Scalar`	`linalg.binary_fn<add>`
`aten.sub.Tensor` / `aten.sub.Scalar`	`linalg.binary_fn<sub>`
`aten.mul.Tensor` / `aten.mul.Scalar`	`linalg.binary_fn<mul>`
`aten.div.Tensor` / `aten.div.Scalar`	`linalg.binary_fn<div>`
`aten.maximum`	`linalg.binary_fn<max_signed>`
`aten.minimum`	`linalg.binary_fn<min_signed>`
`aten.clamp_min` / `aten.clamp_min.Tensor`	`linalg.binary_fn<max_signed>`
`aten.clamp_max` / `aten.clamp_max.Tensor`	`linalg.binary_fn<min_signed>`
`aten.clamp`	Combination of `max_signed` + `min_signed`
`aten.pow.Tensor_Tensor` / `aten.pow.Tensor_Scalar` / `aten.pow.Scalar`	`hfusion.binary_fn<powf>`
`aten.logical_and`	`hfusion.binary_fn<vand>`
`aten.logical_or`	`hfusion.binary_fn<vor>`

Elementwise unary

Torch Op	Lowering target
`aten.abs`	`linalg.unary_fn<abs>`
`aten.ceil`	`linalg.unary_fn<ceil>`
`aten.floor`	`linalg.unary_fn<floor>`
`aten.neg`	`linalg.unary_fn<negf>`
`aten.log`	`linalg.unary_fn<log>`
`aten.exp`	`linalg.unary_fn<exp>`
`aten.reciprocal`	`hfusion.unary_fn<rec>`
`aten.relu`	`hfusion.unary_fn<relu>`
`aten.rsqrt`	`hfusion.unary_fn<rsqrt>`
`aten.sqrt`	`hfusion.unary_fn<sqrt>`
`aten.erf`	`hfusion.unary_fn<erf>`
`aten.tanh`	`hfusion.unary_fn<tanh>`
`aten.sin`	`hfusion.unary_fn<sin>`
`aten.cos`	`hfusion.unary_fn<cos>`
`aten.bitwise_not`	`hfusion.unary_fn<vnot>`
`aten.sigmoid`	Decomposed to negf -> exp -> add -> div
`aten.gelu`	Decomposed to tanh-based approximation

Compare

Torch Op	Lowering target
`aten.gt.Scalar` / `aten.gt.Tensor`	`hfusion.compare_fn<vgt>`
`aten.lt.Scalar` / `aten.lt.Tensor`	`hfusion.compare_fn<vlt>`
`aten.ge.Scalar` / `aten.ge.Tensor`	`hfusion.compare_fn<vge>`
`aten.le.Scalar` / `aten.le.Tensor`	`hfusion.compare_fn<vle>`
`aten.eq.Scalar` / `aten.eq.Tensor`	`hfusion.compare_fn<veq>`
`aten.ne.Scalar` / `aten.ne.Tensor`	`hfusion.compare_fn<vne>`

Reduction

Torch Op	Lowering target
`aten.sum` / `aten.sum.dim_IntList`	`linalg.reduce` + `arith.addf/addi`
`aten.prod` / `aten.prod.dim_int`	`linalg.reduce` + `arith.mulf/muli`
`aten.max`	`linalg.reduce` + `arith.maximumf/maxsi`
`aten.min`	`linalg.reduce` + `arith.minimumf/minsi`
`aten.max.dim`	`hfusion.reduce_with_index` (MAX)
`aten.min.dim`	`hfusion.reduce_with_index` (MIN)
`aten.any` / `aten.any.dim` / `aten.any.dims`	`linalg.reduce` + `arith.ori`
`aten.all` / `aten.all.dim`	`linalg.reduce` + `arith.andi`

Data movement

Torch Op	Lowering target
`aten.permute`	`linalg.transpose`
`aten.broadcast_to`	`linalg.broadcast`

Other

Torch Op	Lowering target
`aten.to.dtype`	`hfusion.cast`
`aten.where.self`	`hfusion.select`
`aten.arange.start_step`	`hfusion.arange`

Linalg/HFusion IR integration

Use Linalg/Tensor, HFusion, and other standard MLIR dialects for operator semantics; input goes directly into the Linalg/HFusion IR layer's fusion and scheduling flow.

Example `hfusion.mlir`

func.func @hfusion_reduce_mul(%arg0: tensor<40960xf32>, %arg1: tensor<40960x1024xf32>, %arg2: tensor<40960x1024xf32>, %arg3: tensor<40960x1024xf32>) -> tensor<40960xf32>
attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %1 = tensor.empty() : tensor<40960x1024xf32>
  %3 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%arg1, %arg2 : tensor<40960x1024xf32>, tensor<40960x1024xf32>) outs(%arg3: tensor<40960x1024xf32>) -> tensor<40960x1024xf32>
  %4 = tensor.empty() : tensor<40960xf32>
  %sum = linalg.reduce {arith.addf} ins(%3 : tensor<40960x1024xf32>) 
                                    outs(%4 : tensor<40960xf32>) dimensions = [1]
  %5 = tensor.empty() : tensor<40960xf32>
  %6 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%arg0, %sum : tensor<40960xf32>, tensor<40960xf32>) 
                                                                  outs(%5: tensor<40960xf32>) -> tensor<40960xf32>
  return %6 : tensor<40960xf32>
}

Invocation:

Command: bishengir-compile -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 hfusion.mlir -o hfusion_kernel.o
Expected output: Ascend NPU operator binary (.o format), loadable and runnable on device via CANN runtime.

Automatic fusion

Once Linalg/HFusion IR is ingested, the HFusion compile flow performs automatic fusion and scheduling on eligible ops: multiple ops are merged into the same kernel so intermediate results are reused in on-chip memory and global memory traffic is reduced; scheduling and Tiling strategies are selected automatically based on fusion patterns and operator traits, producing efficient schedules for Ascend NPU. After fusion, the IR passes through Tiling, loop generation, Transform Dialect application, and similar steps before being lowered to HIVM and emitting an executable binary.

Supported op types:

Elemwise
Broadcast
Reduce
Transpose
Concat

For algorithm details, constraints, architecture, and related topics, see HFusion AutoSchedule: Automatic Fusion and Scheduling.

HIVM IR integration

For fine-grained hardware control, you can write kernels directly in the HIVM dialect, managing memory hierarchy and compute pipelines explicitly.

Example `hivm.mlir`

func.func @hivm_vadd(%valueA: memref<16xf16, #hivm.address_space<gm>>,
                       %valueB: memref<16xf16, #hivm.address_space<gm>>,
                       %valueC: memref<16xf16, #hivm.address_space<gm>>)
    attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %ubA = memref.alloc() : memref<16xf16, #hivm.address_space<ub>>
  hivm.hir.load ins(%valueA : memref<16xf16, #hivm.address_space<gm>>)
                outs(%ubA : memref<16xf16, #hivm.address_space<ub>>)
  %ubB = memref.alloc() : memref<16xf16, #hivm.address_space<ub>>
  hivm.hir.load ins(%valueB : memref<16xf16, #hivm.address_space<gm>>)
                outs(%ubB : memref<16xf16, #hivm.address_space<ub>>)
  %ubC = memref.alloc() : memref<16xf16, #hivm.address_space<ub>>
  hivm.hir.vadd ins(%ubA, %ubB : memref<16xf16, #hivm.address_space<ub>>,
                                 memref<16xf16, #hivm.address_space<ub>>)
                outs(%ubC : memref<16xf16, #hivm.address_space<ub>>)
  hivm.hir.store ins(%ubC : memref<16xf16, #hivm.address_space<ub>>)
                 outs(%valueC : memref<16xf16, #hivm.address_space<gm>>)
  return
}

HIVM uses #hivm.address_space to annotate memory hierarchy: gm (global memory), ub (Unified Buffer), l1 (L1 Buffer), l0a/l0b/l0c (L0 Buffer). Use hivm.hir.load/hivm.hir.store for explicit DMA transfers and hivm.hir.vadd and similar ops for on-chip compute.

Invocation: HIVM does not require the HFusion compile pipeline. The default HIVM compile pipeline performs sync insertion, memory planning, and other optimizations.

Command: bishengir-compile -enable-hfusion-compile=false -enable-hivm-compile=true -target=Ascend910B1 hivm.mlir -o hivm_kernel.o
Expected output: Ascend NPU operator binary (.o format), loadable and runnable on device via CANN runtime.

For IR-level concepts, common compile options, and other integration paths (e.g., Triton, TileLang), see Interface API.