Engine Feature Analysis

1 Feature Background

1.1 Problem Domain

The Ascend AI processor uses a heterogeneous computing architecture. The chip integrates multiple types of computing units: AI Core handles dense matrix operations (for example, convolution and MatMul), Vector Core handles vector operations (for example, ElementWise), AI CPU handles general-purpose computing that is not suitable for hardware acceleration, DVPP handles image and video preprocessing, and HCCL handles multi-card distributed communication. Different types of operators require execution by different computing units. A complete neural network model typically contains multiple types of operators.

GE (Graph Engine) serves as the graph compiler and executor for Ascend. The core problem that GE must solve is: how to map a computational graph containing heterogeneous operators correctly and efficiently to multiple different execution units on the chip. This is the purpose of the Engine feature.

1.2 Engine Overview

Engine	Registered Name	Typical Operators	Engine Directory
AI Core Fusion Engine	`AIcoreEngine`	Conv2D, MatMul, BiasAdd, ReLU, BatchNorm	`compiler/engines/nn_engine/`
Vector Core Engine	`VectorEngine`	ElementWise, Sqrt, Exp, Log, Cast	`compiler/engines/nn_engine/`
FFTS+ Composite Engine	`ffts_plus`	Cross-engine fusion operators, AIC+AIV mixed tasks	`compiler/engines/ffts_engine/`
AI CPU Ascend Engine	`aicpu_ascend_kernel`	SparseToDense, Minimum, Maximum, Round	`compiler/engines/cpu_engine/aicpu_engine/`
AI CPU TF Engine	`aicpu_tf_kernel`	TensorFlow-format AI CPU operators	`compiler/engines/cpu_engine/tf_engine/`
Host CPU Engine	`DNN_VM_HOST_CPU_OP_STORE`	Constant folding operators, fallback operators not supported on Ascend devices	`compiler/engines/cpu_engine/hostcpu_engine/`, `base/host_cpu_engine/`
HCCL Communication Engine	`ops_kernel_info_hccl`	AllReduce, Broadcast, AllGather, ReduceScatter	`compiler/engines/hccl_engine/`
DVPP Preprocessing Engine	`dvpp_ops_kernel`	ImageDecode, Resize, Crop, ColorConvert	`compiler/engines/dvpp_engine/`
RTS Runtime Service Engine	`DNN_VM_RTS_OP_STORE`	StreamSwitch, StreamActive, LabelGoto, LabelSet	`compiler/engines/rts_engine/`
GE Local Engine	`DNN_VM_GE_LOCAL_OP_STORE`	NetOutput, NoOp, Const, PhonyConcat, PhonySplit	`compiler/engines/local_engine/`
Custom Operator Engine	`DNN_VM_CUSTOM`	User-defined Ascend C operators	`compiler/engines/custom_engine/`
DSA Engine	`DSAEngine`	DSA-specific operators	`compiler/engines/nn_engine/` (DSA submodule)

Note: The "AI Core Fusion Engine" and "Vector Core Engine" are collectively referred to as FE (Fusion Engine).

1.3 Design Philosophy

The Engine feature follows three core design principles:

Pluggable Architecture: Each engine compiles into an independent dynamic library (.so) and interacts with the GE framework through a unified C function interface. Engines can be developed, compiled, and deployed independently. The GE framework does not need to understand the internal implementation of engines.
Priority-Driven Automatic Selection: The system automatically selects the optimal engine for each operator through a cost model. Users do not need to specify manually. Engines with lower cost (COST_0 to COST_10) are selected first. Higher priority selection means higher performance execution paths.
Compile-Time Decision + Runtime Execution: Engine selection and graph partitioning complete at compile time. Runtime directly executes according to the execution plan in the compilation output. This avoids runtime engine dispatch overhead.

2 User Scenarios

2.1 Scenario 1: Model Training (Default Engine Selection)

When users train models using PyTorch or TensorFlow, GE automatically assigns computational operators to the AI Core engine, communication operators to the HCCL engine, and control flow operators to the RTS engine. Users do not need to be aware of engine existence.

Typical workflow:

User calls Session::AddGraph to add a computational graph
GE compiler automatically selects an engine for each operator through EnginePlacer
EnginePartitioner partitions the graph into subgraphs by engine
Each engine performs engine-specific optimization on its assigned subgraph (for example, FE fusion engine performs operator fusion)
Compilation output loads onto the device for execution

2.2 Scenario 2: Engine Configuration and Exclusion

Users can intervene in engine selection behavior through configuration options:

ge.engineType (that is, CORE_TYPE): Sets the core computing engine to AIcoreEngine (default) or VectorEngine. The two are mutually exclusive. Selecting VectorCore automatically excludes the AI Core engine, and vice versa. This applies to scenarios where computing unit configurations differ across chip models.
ge.exec.exclude_engines: Excludes specific engines through accelerator names, for example, excluding the DVPP engine to make visual preprocessing operators fallback to other engines.

2.3 Scenario 3: ACL Single Operator Execution

When users execute a single operator through ACL C API, they can specify the operator execution engine through the aclopEngineType parameter:

ACL_ENGINE_SYS: System automatically selects the engine (recommended)
ACL_ENGINE_AICORE: Specifies operator execution on AI Core (affects the binary format of compilation output, uses RT_DEV_BINARY_MAGIC_ELF magic number)
ACL_ENGINE_VECTOR: Specifies operator execution on Vector Core (uses RT_DEV_BINARY_MAGIC_ELF_AIVEC magic number)

This parameter passes through APIs such as aclopCompile, aclopCompileAndExecute, aclopCompileAndExecuteV2, and aclopCreateKernel.

2.4 Scenario 4: Custom Operator Engine

When users develop custom operators, GE provides the DNN_VM_CUSTOM engine (highest priority, COST_0) to host custom operator implementations. The OpsKernelInfoStore for custom operators creates directly during OpsKernelManager initialization and does not need to compile into an independent SO file.

2.5 Scenario 5: Host CPU Fallback Execution

When certain operators do not support execution on Ascend devices, users can fallback operators to host CPU execution through the hostExecFlag option. HostcpuEngineUpdatePass marks these operators as DNN_VM_HOST_CPU engine during the engine reassignment phase. The Host CPU engine has the lowest priority (COST_10) and is selected only when all device engines do not support the operator.

3 Interfaces

3.1 ACL Layer Enumeration

Defined in inc/external/acl/acl_op.h, aclopEngineType enumeration:

Enumeration Value	Meaning
`ACL_ENGINE_SYS`	System automatically selects engine
`ACL_ENGINE_AICORE`	Specifies AI Core engine
`ACL_ENGINE_VECTOR`	Specifies Vector Core engine

Current Status: According to the official documentation in docs/graph_engine_api/aclopEngineType.md, currently only ACL_ENGINE_SYS is supported. ACL_ENGINE_AICORE and ACL_ENGINE_VECTOR are not supported yet.

3.2 ACL Operator Compilation and Execution API

The following APIs accept the aclopEngineType parameter:

API	Description
`aclopCompile`	Specifies engine type when compiling operator
`aclopCompileAndExecute`	Compiles and executes operator
`aclopCompileAndExecuteV2`	V2 version compiles and executes
`aclopCreateKernel`	Specifies engine type when creating custom Kernel
`aclGenGraphAndDumpForOp`	Generates and dumps operator graph

3.3 GE Internal Enumeration

Defined in inc/graph_metadef/common/ge_common/ge_types.h, ge::OpEngineType enumeration:

Enumeration Value	Meaning
`ENGINE_SYS`	Default engine
`ENGINE_AICORE`	AI Core engine
`ENGINE_VECTOR`	Vector engine
`ENGINE_AICUBE`	AI Cube (not supported)
`ENGINE_AIVECTOR`	AI Vector (not supported)

The ACL layer aclopEngineType converts to ge::OpEngineType internally in the compilation pipeline (conversion completes in MakeCompileParam in api/acl/acl_op_executor/single_op/compile/op_compiler.cpp).

3.4 Engine Configuration Options

Option Key	Type	Meaning
`ge.engineType`	string	Core engine type: `"AIcoreEngine"` (default) or `"VectorEngine"`, mutually exclusive
`ge.exec.exclude_engines`	string	Excludes specified engines, comma-separated accelerator names
`ge.aicoreNum`	int32	Configures AI Core quantity
`ge.exec.enableEngineParallel`	string	Whether to enable engine parallel execution
`ge.exec.engineParallelConfigPath`	string	Engine parallel configuration file path
`ac_parallel_enable`	string	`"1"` enables parallel execution between heterogeneous engines (for example, AI CPU and AI Core parallel)

3.5 DNNEngine Base Class Interface

Defined in inc/framework/engine/dnnengine.h, DNNEngine class:

Interface	Description
`Initialize(options)`	Initializes engine (default empty implementation, overridden by subclass)
`Finalize()`	Finalizes engine
`GetAttributes(attr)`	Gets engine attributes (`DNNEngineAttribute`)
`IsAtomic()`	Whether it is an atomic engine

The DNNEngineAttribute structure contains the following fields:

Field	Type	Meaning
`engine_name`	string	Engine name
`mem_type`	vector<string>	Memory type (for example, HBM)
`compute_cost`	PriorityEnum	Compute cost priority (`COST_0` ~ `COST_10`)
`runtime_type`	RuntimeType	Runtime type: `HOST` or `DEVICE`
`engine_input_format`	Format	Engine input format
`engine_output_format`	Format	Engine output format
`atomic_engine_flag`	bool	Whether it is an atomic engine

4 Implementation Details

4.1 Overall Architecture

flowchart TB
    subgraph Compiler["Compiler Engine System"]
        EM["EngineManager<br/>Engine Registry"]
        DEM["DNNEngineManager<br/>Engine Manager (Singleton)"]
        OKM["OpsKernelManager<br/>Operator Kernel Manager"]
        EP["EnginePlacer<br/>Engine Placer"]
        EPT["EnginePartitioner<br/>Engine Partitioner"]
    end

    subgraph Engines["Engine Plugins (SO Dynamic Libraries)"]
        FE["nn_engine<br/>AI Core / Vector Core<br/>(libfe.so)"]
        CE["cpu_engine<br/>AICPU / HostCpu<br/>(libaicpu_*.so)"]
        LE["local_engine<br/>GeLocal<br/>(libge_local_engine.so)"]
        RE["rts_engine<br/>RTS<br/>(librts_engine.so)"]
        DE["dvpp_engine<br/>DVPP<br/>(libdvpp_engine.so)"]
        HE["hccl_engine<br/>HCCL<br/>(libhcom_graph_adaptor.so)"]
        FF["ffts_engine<br/>FFTS+<br/>(libffts.so)"]
        CU["custom_engine<br/>Custom Operators"]
    end

    Compiler --> Engines

4.2 Engine Type System

The GE system contains three cooperating engine type systems:

Compile-Time Engines (DNNEngine System): 15 engine types, all inheriting from the DNNEngine base class, defined in compiler/engines/manager/engine/dnnengines.h.

Engine Class	Registered Name	Priority	Runtime Type	Atomic Engine	Description
`CustomDNNEngine`	`DNN_VM_CUSTOM`	COST_0	DEVICE	Yes	User-defined custom operator engine
`AICoreDNNEngine`	`AIcoreEngine`	COST_1	DEVICE	Yes	AI Core matrix computing engine
`FftsPlusDNNEngine`	`ffts_plus`	COST_1	DEVICE	No	FFTS+ fusion engine (composite engine)
`VectorCoreDNNEngine`	`VectorEngine`	COST_2	DEVICE	Yes	Vector Core vector computing engine
`DSADNNEngine`	`DSAEngine`	COST_2	DEVICE	Yes	DSA engine
`AICpuDNNEngine`	`DNN_VM_AICPU_ASCEND`	COST_3	DEVICE	Yes	AI CPU Ascend engine
`AICpuTFDNNEngine`	`DNN_VM_AICPU`	COST_4	DEVICE	Yes	AI CPU TensorFlow engine
`DvppDNNEngine`	`DNN_VM_DVPP`	COST_5	DEVICE	Yes	DVPP digital visual preprocessing engine
`GeLocalDNNEngine`	`DNN_VM_GE_LOCAL`	COST_9	DEVICE	Yes	GE local engine (fallback)
`HostCpuDNNEngine`	`DNN_VM_HOST_CPU`	COST_10	HOST	Yes	Host CPU engine (lowest priority)
`RtsDNNEngine`	`DNN_VM_RTS`	COST_2	DEVICE	Yes	Runtime service engine
`RtsFftsPlusDNNEngine`	`DNN_VM_RTS_FFTS_PLUS`	COST_2	DEVICE	Yes	RTS FFTS+ engine
`HcclDNNEngine`	`DNN_HCCL`	COST_2	DEVICE	Yes	Collective communication engine
`AICpuFftsPlusDNNEngine`	`DNN_VM_AICPU_FFTS_PLUS`	COST_2	DEVICE	Yes	AI CPU FFTS+ engine
`AICpuAscendFftsPlusDNNEngine`	`DNN_VM_AICPU_ASCEND_FFTS_PLUS`	COST_2	DEVICE	Yes	AI CPU Ascend FFTS+ engine

Priority arrangement pattern: Custom engine is highest (0), dedicated hardware engines are next (1-2), general-purpose CPU engines are lower (3-5), and fallback engines are lowest (9-10).

Runtime V2 Engines (NodeConverter System): Defined in runtime/v2/engine/, registered through REGISTER_NODE_CONVERTER macro, performs node lowering by engine name and execution placement.

Engine Directory	Description
`aicore/`	AI Core node lowering and kernel launch
`aicpu/`	AI CPU node conversion
`custom/`	Custom operator kernel
`dsacore/`	DSA Core node conversion
`dvpp/`	DVPP preprocessing node conversion
`ffts_plus/`	FFTS+ update operations
`gelocal/`	GE local engine (Const, NetOutput, Variable, If, While, and so on)
`rts/`	Runtime service node conversion

4.3 Engine Registration Mechanism

4.3.1 Compile-Time Engine Registration (EngineManager)

Engine registration uses a two-layer separation design: lightweight attribute carrier + heavyweight plugin capability.

First Layer: DNNEngine Registration (Attribute Carrier)

Defined in compiler/engines/manager/engine/engine_manager.cc, the entry function GetDNNEngineObjs() calls 15 registration functions one by one, such as RegisterAiCoreEngine() and RegisterVectorEngine(). Each registration function follows the same internal pattern:

Construct DNNEngineAttribute structure (name, priority, runtime type, and so on)
Create corresponding DNNEngine subclass instance (for example, AICoreDNNEngine)
Call EngineManager::RegisterEngine(name, ptr) to register to the global engine_map_

DNNEngine itself only carries attribute information and does not contain any operator compilation or execution logic.

Second Layer: OpsKernelInfoStore + GraphOptimizer + OpsKernelBuilder Registration (Engine Capability)

Each engine compiles into an independent .so file, stored in the plugin/opskernel/ directory:

SO File	Engine
`libfe.so`	AI Core / Vector Core fusion engine
`libge_local_engine.so`	GE local engine
`librts_engine.so`	RTS engine
`libaicpu_ascend_engine.so`	AI CPU Ascend engine
`libhost_cpu_engine.so`	Host CPU engine
`libaicpu_tf_engine.so`	AI CPU TF engine
`libffts.so`	FFTS+ engine
`libdvpp_engine.so`	DVPP engine
`libhcom_graph_adaptor.so`	HCCL engine

Each SO exports standard C interfaces: Initialize(), GetOpsKernelInfoStores(), GetGraphOptimizerObjs(), Finalize(). OpsKernelManager::Initialize() dynamically loads these SOs through PluginManager and calls interfaces to obtain component objects.

This two-layer separation design decouples engine discovery from engine capability. DNNEngine is responsible for registering identity, and OpsKernelInfoStore, GraphOptimizer, and OpsKernelBuilder are responsible for actual work.

4.3.2 Runtime V2 Engine Registration (NodeConverter)

V2 runtime registers node converters through REGISTER_NODE_CONVERTER and REGISTER_NODE_CONVERTER_PLACEMENT macros:

// Registration macro definition (inc/graph_metadef/register/node_converter_registry.h)
REGISTER_NODE_CONVERTER(type, func)
REGISTER_NODE_CONVERTER_PLACEMENT(type, placement, func)

Typical registration examples:

// AI Core engine, device side
REGISTER_NODE_CONVERTER_PLACEMENT(ge::kEngineNameAiCore.c_str(), kOnDeviceHbm, LoweringAiCoreNode);
// AI CPU engine, device side
REGISTER_NODE_CONVERTER_PLACEMENT(ge::kEngineNameAiCpu.c_str(), kOnDeviceHbm, LoweringAiCpuNode);
// Host CPU engine, host side
REGISTER_NODE_CONVERTER_PLACEMENT(ge::kEngineNameHostCpu.c_str(), kOnHost, LoweringAiCpuNode);

NodeConverterRegistry (singleton) manages the global registry and completes node-to-kernel mapping through GraphConverter at compile time.

4.4 Engine Management and Scheduling

4.4.1 Compile-Time Managers

DNNEngineManager (compiler/engines/manager/engine_manager/dnnengine_manager.h) is the core of compile-time engine management, using singleton pattern:

Core Method	Responsibility
`Initialize()`	Loads engine SOs, calls `GetDNNEngineObjs()` to register all DNNEngines, parses `engine_conf.json` configuration
`GetDNNEngineName(node, exclude_engines)`	Selects engine for operator - iterates through OpInfo list, calls `CheckSupported()`, returns the first supported engine
`GetExcludeEngines()`	Excludes specific engines according to configuration (`CORE_TYPE` mutual exclusion + `EXCLUDE_ENGINES` list)
`GetCompositeEngineName()`	Checks whether subgraph belongs entirely to the same composite engine
`IsStreamAssignSkip()`	Decides whether to skip stream assignment according to engine configuration

OpsKernelManager (compiler/engines/manager/opskernel_manager/ops_kernel_manager.h) connects engines and operator kernel libraries:

Core Method	Responsibility
`Initialize()`	Loads plugin SOs, collects all OpsKernelInfoStores and GraphOptimizers
`InitOpsKernelInfo()`	Builds global operator information index (sorted by `compute_cost` ascending)
`GetAllOpsKernelInfo()`	Queries OpInfo list of an operator type in all engines

engine_conf.json (compiler/engines/manager/engine_manager/engine_conf.json) defines scheduling configuration:

Configuration Item	Meaning
`independent`	Whether it needs an independent stream (only HCCL engine is true)
`attach`	Whether to attach to other streams
`skip_assign_stream`	Whether to skip stream assignment

4.4.2 Engine Assignment (EnginePlacer)

EnginePlacer (compiler/graph/partition/engine_place.h) assigns engines to each node in the graph. Core workflow:

flowchart TD
    A["EnginePlacer::Run()"] --> B["Multi-threaded concurrent execution of SelectEngine() for each node"]
    B --> C{Node already has specified engine?}
    C -->|Yes| D["Use already specified engine"]
    C -->|No| E["DNNEngineManager::GetDNNEngineName()"]
    E --> F["Get all OpInfo for this operator type"]
    F --> G["Iterate by compute_cost ascending"]
    G --> H["Check if in exclusion list"]
    H --> I["Call OpsKernelInfoStore::CheckSupported()"]
    I --> J{Supported?}
    J -->|Yes| K["Select this engine"]
    J -->|No| G
    A --> L["UpdateOpDescsWithOpInfos()"]
    L --> M["ReAssignEngine()"]
    M --> N["DynamicDataFlowEngineReassignPass"]
    M --> O["HostcpuEngineUpdatePass"]

Engine selection uses a greedy strategy - iterates from high to low priority, and the first engine that passes CheckSupported() is selected. Uses a thread pool (default 16 threads) to concurrently select engines for nodes, protects shared data through mutex.

Engine reassignment (ReAssignEngine()) implements through strategy pattern. EngineReAssignPass is the strategy interface, currently with two implementations:

DynamicDataFlowEngineReassignPass: Engine reassignment in dynamic data flow scenarios
HostcpuEngineUpdatePass: Reassigns operators marked with hostExecFlag to Host CPU engine

4.4.3 Partitioning Subgraphs by Engine (EnginePartitioner)

EnginePartitioner (compiler/graph/partition/engine_partitioner.h) partitions the computational graph into subgraphs according to engine assignment results:

flowchart TD
    A["StagePartition<br/>Partition by training stage"] --> B["EnginePlacer<br/>Engine assignment"]
    B --> C["HostcpuEngineUpdatePass<br/>Mark Host CPU engine"]
    C --> D["DynamicShapePartition<br/>Dynamic shape partitioning"]
    D --> E["CompositeEngine Partition<br/>Composite engine partitioning"]
    E --> F["AtomicEngine Partition<br/>Atomic engine partitioning"]
    F --> G["VerifyCommNodesOrder<br/>Verify communication node order"]

Partitioning uses a Cluster-Based Greedy Algorithm:

Initialization: Create a Cluster for each node
Marking: Mark the engine for each Cluster according to engine assignment results
Merging: If adjacent Clusters belong to the same engine and do not have multiple data paths (HasSecondPath check), merge them
Splitting: Insert PlaceHolder and End nodes at the boundaries of merged Clusters

The design reason for two-level partitioning (Composite + Atomic): Fusion engines (for example, FFTS+) need to see the complete fusion region first for fusion optimization. Atomic engine partitioning performs fine-grained partitioning only after fusion optimization.

After partitioning, each subgraph is assigned to different engines. The thread pool executes engine-specific subgraph optimization (OptimizeFusedGraph) in parallel.

4.5 Implementation Details of Various Engines

4.5.1 nn_engine (AI Core / Vector Core Fusion Engine)

Directory: compiler/engines/nn_engine/

SO File: libfe.so

This is the most core and complex engine, responsible for all dense computing on AI Core and Vector Core. Implementation contains three major components:

Component	File	Responsibility
`FEOpsKernelInfoStore`	`optimizer/ops_kernel_store/fe_ops_kernel_info_store.h`	Operator information management (loads operator descriptions from JSON), `CheckSupported()` checks support, `CompileOp()` calls TBE compiler
`FEGraphOptimizer`	`optimizer/graph_optimizer/fe_graph_optimizer.h`	Multi-stage graph optimization (original graph optimization, post-fusion optimization, stream graph optimization, full graph optimization), includes subcomponents such as FormatDtypeSetter, OpCompiler, TransNodeManager, BufferFusion
`AICoreOpsKernelBuilder`	`optimizer/ops_kernel_builder/aicore_ops_kernel_builder.h`	Calculates runtime parameters (workspace size), generates device execution tasks

Directory structure:

nn_engine/
  fusion/            -- Fusion rule management
  opskernel/         -- OpsKernelInfoStore implementation
  optimizer/
    graph_optimizer/  -- Core graph optimizer and sub-optimizers
    ops_kernel_store/ -- FE OpsKernelInfoStore
    ops_kernel_builder/ -- FE OpsKernelBuilder
    fusion_manager/   -- Fusion management
    cmo/              -- Cache Management Optimization
  utils/             -- Utility functions

4.5.2 cpu_engine (AICPU / HostCpu)

Directory: compiler/engines/cpu_engine/

SO Files: libaicpu_ascend_engine.so, libhost_cpu_engine.so, libaicpu_tf_engine.so

Architecture characteristics:

BaseEngine (common/engine/base_engine.h) is the internal base class for CPU engines, different from the public layer DNNEngine. Each CPU engine contains three components: AicpuOpsKernelInfoStore, AicpuGraphOptimizer, AicpuOpsKernelBuilder
Uses factory pattern (FACTORY_ENGINE::Register<CLASS> macro) to register sub-engines
Exports standard interfaces through extern "C"

Sub-engines:

AicpuEngine (cpu_engine/aicpu_engine/): Executes Ascend native operators on device-side AI CPU
HostCpuEngine (cpu_engine/hostcpu_engine/): Executes operators on host CPU, runtime_type = HOST, lowest priority
TfEngine (tf_engine/): Executes TensorFlow-format AI CPU operators

4.5.3 hccl_engine (HCCL Collective Communication Engine)

Directory: compiler/engines/hccl_engine/

SO File: libhcom_graph_adaptor.so

Handles collective communication operations in distributed training (AllReduce, Broadcast, AllGather, ReduceScatter, and so on)
Only engine that uses independent stream (independent: true in engine_conf.json), does not share stream with other engines
Supports communication fusion optimization (hcom_broadcast_fusion, hcom_reduce_fusion, and so on)
Includes auto-tuning module (auto_tuning/)

4.5.4 dvpp_engine (DVPP Digital Visual Preprocessing Engine)

Directory: compiler/engines/dvpp_engine/

SO File: libdvpp_engine.so

Handles preprocessing operations such as image/video decoding, scaling, color conversion
Has different implementations by chip model (for example, ascend910b/), performs capability adaptation through DvppChipCapability
Contains three standard components: DvppOpsKernelInfoStore, DvppGraphOptimizer, DvppOpsKernelBuilder

4.5.5 rts_engine (RTS Runtime Service Engine)

Directory: compiler/engines/rts_engine/

SO File: librts_engine.so

Handles runtime control flow operators (StreamSwitch, StreamActive, Label, and so on)
Contains two OpsKernelInfoStores: regular RTS and FFTS+ mode
Configured as skip_assign_stream: true, attach: true (attaches to other streams, does not need independent stream assignment)

4.5.6 local_engine (GeLocal Fallback Engine)

Directory: compiler/engines/local_engine/

SO File: libge_local_engine.so

Handles operators that do not belong to any dedicated engine (for example, NetOutput, NoOp, Const, and so on)
Priority COST_9 (second lowest), serves as fallback engine
OpFactory manages various local operators
Configured as skip_assign_stream: true, attach: true

4.5.7 ffts_engine (FFTS+ Fusion Engine)

Directory: compiler/engines/ffts_engine/

SO File: libffts.so

Only composite engine (atomic_engine_flag = false), priority COST_1
Unified management of cross-atomic-engine operator fusion scheduling in subgraphs
Task builders categorized by context type: AIC+AIV mixed, AICPU, DSA, RuntimeOps, CollectionOps
Configured as skip_assign_stream: true (FFTS+ manages streams itself)
Declares its contained atomic engine set through GetCompositeEngines()

4.5.8 custom_engine (Custom Operator Engine)

Directory: compiler/engines/custom_engine/

Highest priority (COST_0), ensures user-defined custom operators are selected first
Creates directly in OpsKernelManager::Initialize(), does not load from SO file
CustomOpsKernelInfoStore maintains OpInfo mapping for custom operators

4.5.9 HostCpuEngine (Base Layer Basic Implementation)

Directory: base/host_cpu_engine/

Files: host_cpu_engine.h / host_cpu_engine.cc

Complements the compiler-side HostCpuEngine (BaseEngine subclass), provides more basic host CPU execution capability
Singleton pattern (HostCpuEngine::GetInstance())
Dynamically loads libconstant_folding_ops.so and libops_host_cpu.so through dlopen
Used for compile-time constant folding and runtime Host CPU operator execution

4.6 Runtime Engine Execution

4.6.1 Runtime V2 (ExeGraph Execution Engine)

V2 adopts compile-time Lowering + runtime direct execution mode.

Core difference: V1 dynamically dispatches NodeExecutor at runtime. V2 lowers computational graph nodes to executable ExeGraph through NodeConverter at compile time. Runtime directly executes kernel function pointers according to ExeGraph topology order, eliminating runtime dispatch overhead.

Standard Lowering Process of NodeConverter:

InferShape — Infers output shape
AllocOutputMemory — Allocates output memory
AllocWorkspace — Allocates workspace
Build Launch Graph — Builds execution graph nodes for launching kernel
FreeWorkspace — Frees workspace
Returns LowerResult (contains order_holders, out_shapes, out_addrs)

Lowering of AI Core Engine is more complex, additionally involves:

Tiling data calculation (runtime/v2/graph_builder/bg_tiling.h)
Kernel compilation result lookup
Atomic operation handling
FFTS Plus subtask flushing

V2 Executor Types (runtime/v2/core/executor/):

Executor	Description
`SequentialExecutor`	Sequential execution (static graph)
`TopologicalExecutor`	Topological sort execution (dynamic graph)
`MultiThreadTopologicalExecutor`	Multi-threaded topological execution
`PriorityTopologicalExecutor`	Priority topological execution

StreamExecutor (runtime/v2/core/stream_executor.cc) creates independent ModelV2Executor for each stream, implements stream-level execution isolation.

Implementation of Host CPU in V2: Host CPU and AICPU share the same Lowering function LoweringAiCpuNode, distinguishes execution location through placement parameter (kOnHost vs kOnDeviceHbm). This design reduces code duplication and unifies Lowering logic for CPU-type operators.

4.7 Complete Position of Engine in Compilation Flow

flowchart TD
    subgraph Init["Initialization Phase"]
        I1["GELib::Initialize()"]
        I1 --> I2["DNNEngineManager::Initialize()<br/>Load engine SOs, register all DNNEngines"]
        I1 --> I3["OpsKernelManager::Initialize()<br/>Load OpsKernel SOs, get InfoStore and Optimizer"]
        I1 --> I4["OpsKernelBuilderManager::Initialize()<br/>Load operator compilation builders"]
    end

    subgraph Compile["Compilation Phase"]
        C1["Graph preprocessing"] --> C2["Each engine GraphOptimizer::OptimizeOriginalGraph()"]
        C2 --> C3["EnginePlacer::Run()<br/>Multi-threaded engine selection for each node"]
        C3 --> C4["Engine reassignment Passes"]
        C4 --> C5["EnginePartitioner<br/>Partition into subgraphs by engine"]
        C5 --> C6["Each engine GraphOptimizer::OptimizeFusedGraph()<br/>Multi-threaded parallel subgraph optimization"]
        C6 --> C7["Subgraph merging"]
        C7 --> C8["OpsKernelInfoStore::CompileOp()<br/>Operator compilation"]
        C8 --> C9["OpsKernelBuilder::CalcOpRunningParam()<br/>Calculate runtime parameters"]
        C9 --> C10["OpsKernelBuilder::GenerateTask()<br/>Task generation"]
        C10 --> C11["GeRootModel compilation output"]
    end

    subgraph Execute["Execution Phase"]
        E1["Load GeRootModel"]
        E1 --> E2["V1: NodeExecutorManager dispatch"]
        E1 --> E3["V2: NodeConverterRegistry Lowering"]
        E2 --> E4["Asynchronous execution"]
        E3 --> E5["ExeGraph direct execution"]
    end

    Init --> Compile --> Execute

4.8 Atomic Engine and Composite Engine

GE divides engines into two roles:

Atomic Engine: atomic_engine_flag = true, directly handles individual operators. Most engines are atomic engines.
Composite Engine: atomic_engine_flag = false (currently only ffts_plus), unified management of multiple atomic engines, supports cross-engine operator fusion scheduling.

OpsKernelManager maintains atomic_2_composite_ mapping table, maps atomic engines to their belonging composite engines. During engine partitioning, first partitions by composite engine (lets composite engine see the complete fusion region), then performs fine-grained partitioning by atomic engine. | FEGraphOptimizer | optimizer/graph_optimizer/fe_graph_optimizer.h | Multi-stage graph optimization (original graph optimization, post-fusion optimization, stream graph optimization, full graph optimization), includes subcomponents such as FormatDtypeSetter, OpCompiler, TransNodeManager, BufferFusion |