Framework Integration

AscendNPU IR supports framework integration (PyTorch/TensorFlow/MindSpore) in two ways:

  • DSL integration: Integrate via domain-specific languages such as Triton and TileLang, which compile to AscendNPU IR.
  • IR integration: Integrate via IR representation, supporting multi-level IR (Torch IR, Linalg/HFusion IR, HIVM IR), with automatic fusion and tiling for Ascend-friendly kernels.

DSL integration

AscendNPU IR supports upstream integration with languages and frameworks such as Triton and TileLang, so that third-party DSLs can target Ascend hardware and run custom operators on the NPU.

Integration Description
Triton interface Use Triton to write high-performance kernels and run them on Ascend NPU via Triton Ascend. Covers installation, environment, op mapping, and Ascend extensions.
TileLang interface Use TileLang Ascend (tile-lang/TVM-based DSL) to develop kernels for Ascend NPU (e.g., GEMM, vector ops, attention). Covers environment, build, and quick start.

IR integration

AscendNPU IR supports multi-level IR integration; each level differs in abstraction and control granularity (see Interface API - Multi-level IR Abstraction):

  • Torch IR: Framework-level ATen ops, lowered to Linalg/HFusion via Passes.
  • Linalg/HFusion IR: General tensor algebra and hardware-aware fusion layer; standard MLIR dialects for operator semantics, HFusion performs fusion, tiling, and scheduling automatically.
  • HIVM IR: NPU instruction layer; direct mapping to hardware instructions, explicit control of memory hierarchy (GM/UB/L1/L0) and compute pipelines (Vector/Cube/MTE) for fine-grained tuning.

Torch IR integration

Use Torch dialect ATen ops; Passes such as convert-torch-to-hfusion lower to Linalg/HFusion named ops, then enter the fusion and scheduling flow.

Torch → AscendNPU IR pipeline

Torch IR is integrated via the torch-backend-to-named-op-backend-pipeline conversion pipeline. The custom convert-torch-to-hfusion Pass lowers Torch ATen ops to Linalg/HFusion named ops first; uncovered ops fall back to the standard lowering path of upstream torch-mlir. Main conversion stages:

  • convert-torch-to-hfusion: BishengIR custom lowering for 55+ ATen ops to Linalg/HFusion named ops.
  • convert-torch-to-linalg: Upstream torch-mlir for remaining ops.
  • convert-torch-to-scf / arith / tensor: Upstream torch-mlir for control flow, arithmetic, and tensor conversion.
  • func-backend-type-conversion: Converts Torch types (!torch.vtensor) to builtin types (tensor).

Example torch.mlir

func.func @torch_mul(%arg0: !torch.vtensor<[4096],f16>, %arg1: !torch.vtensor<[1,56,4096],f16>) -> !torch.vtensor<[1,56,4096],f16>
attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %0 = torch.aten.mul.Tensor %arg0, %arg1 : !torch.vtensor<[4096],f16>, !torch.vtensor<[1,56,4096],f16> -> !torch.vtensor<[1,56,4096],f16>
  return %0 : !torch.vtensor<[1,56,4096],f16>
}

Invocation: two methods, both share the same compile pipeline.

  • Stepwise conversion: Converts Torch IR to Linalg/HFusion IR first, suitable for caching or inspecting intermediate IR. After conversion, use torch_to_hfusion.mlir as input and continue with the Linalg/HFusion IR integration flow to produce a binary.
    • Command: bishengir-opt -torch-backend-to-named-op-backend-pipeline torch.mlir -o torch_to_hfusion.mlir
    • Expected output: MLIR text file (.mlir format) containing the converted Linalg/HFusion IR. For example:
func.func @torch.aten.mul_tensor(%arg0: tensor<4096xf16>, %arg1: tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %0 = tensor.empty() : tensor<1x56x4096xf16>
  %broadcasted = linalg.broadcast ins(%arg0 : tensor<4096xf16>) outs(%0 : tensor<1x56x4096xf16>) dimensions = [0, 1] 
  %1 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%broadcasted, %arg1 : tensor<1x56x4096xf16>, tensor<1x56x4096xf16>) outs(%0 : tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16>
  return %1 : tensor<1x56x4096xf16>
}
  • End-to-end compilation: Uses bishengir-compile to compile Torch IR directly to an executable binary, running through the full Torch → HFusion → HIVM IR compile pipeline.
    • Command: bishengir-compile -enable-torch-compile=true -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 torch.mlir -o torch_kernel.o
    • Expected output: Ascend NPU operator binary (.o format), loadable and runnable on device via CANN runtime.

Supported Torch ops

Elementwise binary
Torch Op Lowering target
aten.add.Tensor / aten.add.Scalar linalg.binary_fn<add>
aten.sub.Tensor / aten.sub.Scalar linalg.binary_fn<sub>
aten.mul.Tensor / aten.mul.Scalar linalg.binary_fn<mul>
aten.div.Tensor / aten.div.Scalar linalg.binary_fn<div>
aten.maximum linalg.binary_fn<max_signed>
aten.minimum linalg.binary_fn<min_signed>
aten.clamp_min / aten.clamp_min.Tensor linalg.binary_fn<max_signed>
aten.clamp_max / aten.clamp_max.Tensor linalg.binary_fn<min_signed>
aten.clamp Combination of max_signed + min_signed
aten.pow.Tensor_Tensor / aten.pow.Tensor_Scalar / aten.pow.Scalar hfusion.binary_fn<powf>
aten.logical_and hfusion.binary_fn<vand>
aten.logical_or hfusion.binary_fn<vor>
Elementwise unary
Torch Op Lowering target
aten.abs linalg.unary_fn<abs>
aten.ceil linalg.unary_fn<ceil>
aten.floor linalg.unary_fn<floor>
aten.neg linalg.unary_fn<negf>
aten.log linalg.unary_fn<log>
aten.exp linalg.unary_fn<exp>
aten.reciprocal hfusion.unary_fn<rec>
aten.relu hfusion.unary_fn<relu>
aten.rsqrt hfusion.unary_fn<rsqrt>
aten.sqrt hfusion.unary_fn<sqrt>
aten.erf hfusion.unary_fn<erf>
aten.tanh hfusion.unary_fn<tanh>
aten.sin hfusion.unary_fn<sin>
aten.cos hfusion.unary_fn<cos>
aten.bitwise_not hfusion.unary_fn<vnot>
aten.sigmoid Decomposed to negf -> exp -> add -> div
aten.gelu Decomposed to tanh-based approximation
Compare
Torch Op Lowering target
aten.gt.Scalar / aten.gt.Tensor hfusion.compare_fn<vgt>
aten.lt.Scalar / aten.lt.Tensor hfusion.compare_fn<vlt>
aten.ge.Scalar / aten.ge.Tensor hfusion.compare_fn<vge>
aten.le.Scalar / aten.le.Tensor hfusion.compare_fn<vle>
aten.eq.Scalar / aten.eq.Tensor hfusion.compare_fn<veq>
aten.ne.Scalar / aten.ne.Tensor hfusion.compare_fn<vne>
Reduction
Torch Op Lowering target
aten.sum / aten.sum.dim_IntList linalg.reduce + arith.addf/addi
aten.prod / aten.prod.dim_int linalg.reduce + arith.mulf/muli
aten.max linalg.reduce + arith.maximumf/maxsi
aten.min linalg.reduce + arith.minimumf/minsi
aten.max.dim hfusion.reduce_with_index (MAX)
aten.min.dim hfusion.reduce_with_index (MIN)
aten.any / aten.any.dim / aten.any.dims linalg.reduce + arith.ori
aten.all / aten.all.dim linalg.reduce + arith.andi
Data movement
Torch Op Lowering target
aten.permute linalg.transpose
aten.broadcast_to linalg.broadcast
Other
Torch Op Lowering target
aten.to.dtype hfusion.cast
aten.where.self hfusion.select
aten.arange.start_step hfusion.arange

Linalg/HFusion IR integration

Use Linalg/Tensor, HFusion, and other standard MLIR dialects for operator semantics; input goes directly into the Linalg/HFusion IR layer's fusion and scheduling flow.

Example hfusion.mlir

func.func @hfusion_reduce_mul(%arg0: tensor<40960xf32>, %arg1: tensor<40960x1024xf32>, %arg2: tensor<40960x1024xf32>, %arg3: tensor<40960x1024xf32>) -> tensor<40960xf32>
attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %1 = tensor.empty() : tensor<40960x1024xf32>
  %3 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%arg1, %arg2 : tensor<40960x1024xf32>, tensor<40960x1024xf32>) outs(%arg3: tensor<40960x1024xf32>) -> tensor<40960x1024xf32>
  %4 = tensor.empty() : tensor<40960xf32>
  %sum = linalg.reduce {arith.addf} ins(%3 : tensor<40960x1024xf32>) 
                                    outs(%4 : tensor<40960xf32>) dimensions = [1]
  %5 = tensor.empty() : tensor<40960xf32>
  %6 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%arg0, %sum : tensor<40960xf32>, tensor<40960xf32>) 
                                                                  outs(%5: tensor<40960xf32>) -> tensor<40960xf32>
  return %6 : tensor<40960xf32>
}

Invocation:

  • Command: bishengir-compile -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 hfusion.mlir -o hfusion_kernel.o
  • Expected output: Ascend NPU operator binary (.o format), loadable and runnable on device via CANN runtime.

Automatic fusion

Once Linalg/HFusion IR is ingested, the HFusion compile flow performs automatic fusion and scheduling on eligible ops: multiple ops are merged into the same kernel so intermediate results are reused in on-chip memory and global memory traffic is reduced; scheduling and Tiling strategies are selected automatically based on fusion patterns and operator traits, producing efficient schedules for Ascend NPU. After fusion, the IR passes through Tiling, loop generation, Transform Dialect application, and similar steps before being lowered to HIVM and emitting an executable binary.

Supported op types:

  • Elemwise
  • Broadcast
  • Reduce
  • Transpose
  • Concat

For algorithm details, constraints, architecture, and related topics, see HFusion AutoSchedule: Automatic Fusion and Scheduling.

HIVM IR integration

For fine-grained hardware control, you can write kernels directly in the HIVM dialect, managing memory hierarchy and compute pipelines explicitly.

Example hivm.mlir

func.func @hivm_vadd(%valueA: memref<16xf16, #hivm.address_space<gm>>,
                       %valueB: memref<16xf16, #hivm.address_space<gm>>,
                       %valueC: memref<16xf16, #hivm.address_space<gm>>)
    attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %ubA = memref.alloc() : memref<16xf16, #hivm.address_space<ub>>
  hivm.hir.load ins(%valueA : memref<16xf16, #hivm.address_space<gm>>)
                outs(%ubA : memref<16xf16, #hivm.address_space<ub>>)
  %ubB = memref.alloc() : memref<16xf16, #hivm.address_space<ub>>
  hivm.hir.load ins(%valueB : memref<16xf16, #hivm.address_space<gm>>)
                outs(%ubB : memref<16xf16, #hivm.address_space<ub>>)
  %ubC = memref.alloc() : memref<16xf16, #hivm.address_space<ub>>
  hivm.hir.vadd ins(%ubA, %ubB : memref<16xf16, #hivm.address_space<ub>>,
                                 memref<16xf16, #hivm.address_space<ub>>)
                outs(%ubC : memref<16xf16, #hivm.address_space<ub>>)
  hivm.hir.store ins(%ubC : memref<16xf16, #hivm.address_space<ub>>)
                 outs(%valueC : memref<16xf16, #hivm.address_space<gm>>)
  return
}

HIVM uses #hivm.address_space to annotate memory hierarchy: gm (global memory), ub (Unified Buffer), l1 (L1 Buffer), l0a/l0b/l0c (L0 Buffer). Use hivm.hir.load/hivm.hir.store for explicit DMA transfers and hivm.hir.vadd and similar ops for on-chip compute.

Invocation: HIVM does not require the HFusion compile pipeline. The default HIVM compile pipeline performs sync insertion, memory planning, and other optimizations.

  • Command: bishengir-compile -enable-hfusion-compile=false -enable-hivm-compile=true -target=Ascend910B1 hivm.mlir -o hivm_kernel.o
  • Expected output: Ascend NPU operator binary (.o format), loadable and runnable on device via CANN runtime.

For IR-level concepts, common compile options, and other integration paths (e.g., Triton, TileLang), see Interface API.