| Dump Module |
Dump module overall design: architecture layering, RT1.0/RT2.0 adaptation, HCCL processing, dynamic switch |
| External Weight |
FileConstant feature: weight separation from OM, compile-time Const→FileConstant conversion, RT V1/V2 loading flow, memory management, global weight manager |
| Constant Folding |
Constant folding optimization: compile-time constant expression evaluation, dimension calculation, empty tensor replacement, delayed effect mechanism, multi-stage compilation pipeline |
| Fusion Pattern Pass |
Fusion Pattern Pass mechanism: PatternFusionPass / DecomposePass matching, filtering, replacement, execution stages and Python/C++ integration |
| Dynamic Gear |
Dynamic gear feature: dynamic Batch / dynamic resolution / ND arbitrary dimension modes, gear enumeration, static subgraph generation and runtime dispatch |
| Memory Conflict Handling |
Memory conflict protection system: semantic read-write conflict, memory layout conflict, subgraph address isolation, Inplace reuse conflict, multi-stream concurrency management |
| Model Cache |
Compilation result persistence mechanism: graph compilation cache, JIT compilation cache, operator model cache three-level system, cache hit and invalidation strategies |
| Profiling |
Performance collection and observability: layered collection architecture (API/Host/Device), on-demand enablement, msprof unified reporting |
| SO in OM |
Operator self-contained packaging: packaging dependent operator .so files into OM on demand, eliminating runtime dependency on OPP operator packages |
| TensorMove Elimination |
TensorMove redundant node elimination optimization: identify and delete redundant memory copy nodes, O3 optimization level |
| Variable Management |
Variable lifecycle management: registration, memory allocation, format conversion, logical address mapping, serialization/deserialization full flow |
| Zero Copy |
Zero copy feature: input zero copy (eliminate H2D), output zero copy (eliminate D2H/D2D), compile-time planning and runtime execution |
| Concat No Task |
Concat continuous memory optimization: compile-time identification of continuous input Concat operators, mark as virtual operators to skip Task generation and memory movement |
| GE Local Operator |
GE Local engine: dedicated engine for non-computation nodes (Data, Constant, control flow, shape transformation, etc.), zero runtime computation overhead |
| Engine |
Engine system: plugin-based engine architecture, priority-driven automatic selection, compile-time engine registration and partitioning, runtime dispatch |
| Tiling Sink |
Tiling sink feature: move Tiling computation from Host to Device AICPU execution, eliminate Host-Device synchronization overhead |
| Graph Splitter |
Graph split feature: static/dynamic Shape split, engine-level split, pipeline stage split, JIT incremental split |
| Static Executor |
Static subgraph executor: Task Sink pre-dispatch, DavinciModel loading/execution, hybrid execution mode address refresh |
| Dynamic Executor |
RT2.0 dynamic Shape executor: Lowering mechanism, ExecuteGraph, ModelV2Executor, three-subgraph lifecycle, Kernel registration system |
| Stream Allocator |
Stream allocation feature: logical stream allocation, synchronization event management, physical stream split, stream activation mechanism |
| InferShape |
Shape inference: OriginShape/StorageShape dual system, compile-time InferShapePass, runtime inference node, symbolic inference |
| Format Inference |
Format inference: OriginFormat anchor propagation, StorageFormat automatic selection, TransData insertion optimization |