Runtime Execution System — Bridge from OM File to Silicon Chip Computation
Introduces how GE runtime completes complete execution loop from "model loading" to "task sinking" to "result return".
1. Global Architecture: Dual Version Coexisting Design
GE runtime simultaneously maintains v1 and v2 two execution architectures, this is conscious evolution strategy.
1.1 v1 Architecture: Static Shape Executor
v1 is current production主力, undertakes static shape model loading and execution responsibilities. Its core directory structure is:
runtime/v1/
├── graph/
│ ├── load/ # Model loading entry
│ │ ├── graph_loader.cc # Loading facade (Facade pattern)
│ │ └── model_manager/ # Model management core
│ │ ├── model_manager.cc # Global singleton model registry
│ │ ├── davinci_model.cc # Single model instance (core)
│ │ └── task_info/ # Various Task dispatch implementation
│ ├── execute/ # Model execution entry
│ │ ├── graph_executor.cc # Sync/async execution facade
│ │ └── model_executor.cc # Session-level execution coordinator
│ └── manager/ # Memory management
│ ├── mem_manager.cc
│ ├── caching_allocator.cc
│ └── session_scope_mem_allocator.cc
├── hybrid/ # Dynamic shape hybrid execution mode
├── single_op/ # Single operator execution mode
└── opskernel_executor/ # Operator kernel executor
1.2 v2 Architecture: Dynamic Shape Executor
v2 is next-generation runtime, design goal is through Lowering (converting ComputeGraph to ExecuteGraph) achieve more streamlined execution path:
runtime/v2/
├── core/
│ ├── model_v2_executor.cc # v2 model executor
│ ├── stream_executor.cc # Stream-level executor management
│ └── executor/ # Various execution strategies
│ ├── sequential/ # Sequential execution (C language implementation)
│ ├── topological/ # Topological sorting execution
│ └── multi_thread_topological/ # Multi-thread topological execution
├── lowering/ # ComputeGraph → ExecuteGraph conversion
├── engine/ # Engine adaptation layer (aicore/aicpu/dvpp...)
└── kernel/ # Kernel registration and execution
1.3 v1 and v2 Architecture Differences
| Dimension | v1 | v2 |
|---|---|---|
| Core Abstraction | DavinciModel + TaskDef | ExecuteGraph + Node/Kernel |
| Execution Model | rtModelExecute (Hardware Sink) | Host sequential/topological execution |
| Applicable Scenarios | Static shape models (Sink mode) | Dynamic shape / single operator |
| Memory Management | Segmented (FM/Weight/Var) | Unified Allocator |
| Code Language | C++ (heavy runtime) | C core + C++ kernel (pursuing ultimate performance) |
v1 guarantees existing OM model compatibility, v2 provides more flexible infrastructure for dynamic shape scenarios.
2. v1 Model Loading: Mapping from OM to Device
2.1 Loading Flow Overview
flowchart TD
A[GraphLoader::LoadModelOnline<br/>Facade Entry] --> B[ModelManager::LoadModelOnline<br/>Global Singleton Registry]
B --> C{Hybrid Model?}
C -->|No| D[Create DavinciModel]
D --> D1[InitModelMem<br/>Allocate FM/Weight/Var Memory]
D1 --> D2[InitIoNodes<br/>Zero-Copy I/O Mapping]
D2 --> D3[InitNodes<br/>Operator Engine Initialization]
D3 --> D4[DoTaskSink<br/>Task Sink to Device]
D4 --> D5[aclmdlRIBuildEnd<br/>Notify Device Build Complete]
C -->|Yes| E[Create HybridDavinciModel<br/>Dynamic Shape Path]
DavinciModel is v1 core, completes memory mapping, I/O initialization, operator initialization and task sinking through six-stage pipeline.
flowchart TD
A[GraphLoader::LoadModelOnline] --> B[ModelExecutor::ModelLoad]
B --> C{Check Memory/Stream/Event Resources}
C -->|Insufficient Resources| D[Unload Existing Models Release Resources]
C -->|Sufficient Resources| E[MallocFixedFeatureMemoryIfNeed]
D --> E
E --> F[GraphLoader::LoadModelOnline]
F --> G[ModelManager::LoadModelOnline]
G --> H{Is Hybrid Model?}
H -->|Yes| I[Create HybridDavinciModel]
H -->|No| J[Create DavinciModel and Init]
J --> K[DoTaskSink]
K --> L[DistributeTask]
L --> M[aclmdlRIBuildEnd]
M --> N[Model Ready]
I --> I1[Init Hybrid Model]
I1 --> N
2.2 GraphLoader: Minimalist Facade
GraphLoader (runtime/v1/graph/load/graph_loader.cc) is a pure Facade pattern — all methods are static methods, directly forwarding to ModelManager::GetInstance(). This design has a benefit:
Decoupling loading protocol and internal implementation: Upper layer (Session, ACL) only needs to know "load a model", does not need to know ModelManager's existence.
Key loading entries:
LoadModelOnline: Online mode, directly load from GeRootModel in memoryLoadModelFromData: Offline mode, load from serialized ModelDataLoadModelWithQ: Queue mode, bind input/output queues (for data flow scenarios)
2.3 ModelManager: Global Model Registry
ModelManager (runtime/v1/graph/load/model_manager/model_manager.h) is a process-level singleton, maintains two parallel model registries:
model_map_: map<uint32_t, shared_ptr<DavinciModel>> # Static/known shape models
hybrid_model_map_: map<uint32_t, shared_ptr<HybridDavinciModel>> # Dynamic shape models
Execution paths for two model types are completely different — DavinciModel goes through rtModelExecute (hardware Sink), HybridDavinciModel goes through Host-side subgraph scheduling. Separate storage allows each path to achieve zero-overhead abstraction.
ModelManager responsibilities:
- AICPU Kernel Lifecycle Management: Load/unload custom AICPU SO libraries
- Weight Sharing: Through
weights_mem_ids_to_addr_info_support multiple models sharing same weight memory - Session Binding:
sess_id_to_device_ids_tracks Session and device mapping relationship - Resource Cleanup: Clean all runtime resources during model unload
2.4 DavinciModel: Model's Runtime Avatar
DavinciModel (runtime/v1/graph/load/model_manager/davinci_model.h) is v1 runtime's absolute core — it transforms compilation product (GeModel) into executable state, and manages entire execution lifecycle.
2.4.1 Initialization Flow
DavinciModel's Init process is a carefully orchestrated six-stage pipeline:
InitModelMem → InitIoNodes → TransAllVarData → InitNodes → DoTaskSink → ...
Stage One: InitModelMem — Memory Mapping
GE has three device memory types needing management:
- Feature Map Memory (
mem_base_): Operator input/output workspace, corresponds to compile-time calculatedruntime_param_.mem_size - Weight Memory (
weights_mem_base_): Model parameters (Constant, Variable initial values) - Variable Memory (
var_mem_base_): Mutable parameters in training scenarios
Memory sources have two types:
- GE self-allocates (
MallocFeatureMapMem→aclrtMalloc) - External provides (user sets through
SetFeatureMemoryBase)
In multi-model co-deployment scenarios, user may wish to uniformly manage device memory, avoiding memory fragmentation caused by GE internal allocation.
Stage Two: InitIoNodes — I/O Node Initialization
Traverse Data and NetOutput nodes, establish input/output address mapping. Core is Zero-Copy mechanism:
User Tensor Address → ZeroCopyOffset → Model Internal Args Address
Zero-Copy eliminates additional copy of input/output data by directly writing user Tensor address into model internal Args table.
Stage Three: InitNodes — Operator Node Initialization
Execute engine-specific initialization for each node:
- TBE operators: Register Kernel Handle (
InitTbeHandle) - HCCL operators: Collect communication stream information
- LabelSet/StreamSwitch: Control flow hardware resource allocation
Stage Four: DoTaskSink — Task Sinking (Core!)
This is Sink mode's core implementation:
flowchart TD
A[DoTaskSink] --> B[BindModelStream]
B --> C[InitTaskInfo]
C --> D[LoadWithQueue]
D --> E[DistributeTask]
E --> F[aclmdlRIBuildEnd]
F --> G[LoadWithHardwareQueue]
- BindModelStream: Bind all logical streams to rtModel handle. On Ascend hardware, one rtModel contains multiple rtStreams, Tasks on each Stream can parallel execute.
- InitTaskInfo + DistributeTask: Traverse all TaskDefs (Kernel, Hccl, FftsPlus etc.) in ModelTaskDef, create corresponding TaskInfo object for each Task and call
Distribute()to distribute to device. - aclmdlRIBuildEnd: Notify underlying runtime "model build complete", after this model can be executed by
rtModelExecute.
TaskSink pre-loads all Tasks to device, Host only needs one rtModelExecute call. For models with thousands of operators, this avoids Host scheduling becoming bottleneck.
3. v1 Model Execution: Sink Mode's Two Trigger Ways
Sink mode is GE's core runtime optimization mechanism.
Traditional Host Scheduling: Host dispatch operator by operator → Device execute → Host dispatch next → ... (N interactions)
Sink Mode: Host one launch → Device autonomously executes all Tasks (1 interaction)
GE's TaskSink serializes Task sequence to OM file at compile time, runtime only needs one call to trigger all Task execution on device side.
3.1 Two Execution Modes
GE v1's Sink mode supports two trigger ways, corresponding to different usage scenarios:
flowchart LR
subgraph RunLoop["Run Loop Mode (Resident Execution)"]
S1[Independent Thread Loop] --> S2[Get Data from Queue]
S2 --> S3[rtModelExecute]
S3 --> S4[rtStreamSynchronize]
end
subgraph NnExecute["NnExecute Mode (On-demand Execution)"]
H1[Caller Thread] --> H2[CopyModelData]
H2 --> H3[rtModelExecute]
H3 --> H4[rtStreamSynchronize]
H4 --> H5[CopyOutputData]
end
3.2 Run Loop Mode: Independent Thread Resident Execution
DavinciModel::Run() loops execution in independent thread:
Loop {
1. Get input data from DataInputer queue
2. HandleInputData: Write input address to model Args table
3. rtModelExecute: One call, triggers all Task execution on device side
4. rtStreamSynchronizeWithTimeout: Wait for device completion
5. AssembleListenerOutput: Assemble output
6. Callback notification completion
}
Key Design Decisions:
-
Independent Thread:
ModelRunStart()creates a dedicated thread executing Run loop. Model is "resident" — after loading continuously receives data and executes, untilModelRunStop(). Thread and DataInputer queue cooperation, achieves producer-consumer pattern. -
Timeout Mechanism:
rtStreamSynchronizeWithTimeoutsupports configuring timeout. After timeout will callaclmdlRIAbortto abort model execution, avoiding device deadlock. -
Error Propagation: Device-side error propagates to Host through
rtStreamSynchronizeWithTimeoutreturn value. Special return codes likekSinkModelEndOfSequence(sequence end) andkSinkModelAbortNormal(normal abort) have specific meanings.
GE's TaskSink at compile time has already serialized Task to OM file, runtime through rtModelExecute one call can trigger all Task execution on device side.
3.3 NnExecute Mode: Caller Thread On-demand Execution
DavinciModel::NnExecute() is execution mode actively triggered by caller thread, each inference is synchronously or asynchronously initiated by caller:
NnExecute(stream, async_mode, input_tensor, output_tensor):
1. InitModelStream: Initialize or reuse execution stream
2. CopyModelData: Map input Tensor address to model Args (Zero-Copy) or copy
3. rtModelExecute: Dispatch model execution
4. rtStreamSynchronizeWithTimeout: Wait for completion (if non-async)
5. CopyOutputData: Copy output from model internal address to user Tensor
6. UpdateOutputTensorShape: If dynamic shape, update output shape
When using forbidden stream and timeout is set, will call rtModelExecuteSync interface, its internal will do stream sync, timeout will abort model.
Run Loop Mode vs NnExecute Mode differences:
| Dimension | Run Loop Mode | NnExecute Mode |
|---|---|---|
| Trigger Way | Independent thread loop, get data from queue | Caller thread active call |
| Thread Model | Independent thread | Caller thread |
| Data Copy | Pass through DataInputer queue | CopyModelData + CopyOutputData |
| Applicable Scenario | High throughput inference service | Interactive inference/small batch |
Both methods' underlying execution depends on rtModelExecute, belongs to Sink mode's different usage forms.
3.4 API Entry and Internal Execution Function Correspondence
External API through different paths finally reaches Run() or NnExecute():
┌──────────────────────────────────────────────────────────────┐
│ API Entry Layer │
├────────────────────┬─────────────────────────────────────────┤
│ ACL Layer │ GE Session Layer │
│ aclmdlExecuteV2 │ GeSession::RunGraph │
│ aclmdlExecuteAsyncV2 │ GeSession::RunGraphAsync │
│ │ GeSession::RunGraphAsyncWithStream │
└─────────┬──────────┴──────────────┬──────────────────────────┘
│ │
▼ ▼
NnExecute() Run() or NnExecute()
(Caller thread) (Depends on execution path)
Paths to NnExecute() entry:
| API | Call Chain | Description |
|---|---|---|
aclmdlExecuteV2 |
GeExecutor::ExecModel → GraphLoader::ExecuteModel → ModelManager::ExecuteModel → NnExecute |
Sync execution, caller thread blocks waiting |
aclmdlExecuteAsyncV2 |
GeExecutor::ExecModel(async_mode=true) → ModelManager::ExecuteModel → NnExecute(async_mode=true) |
Async execution |
GeSession::RunGraph |
GraphManager::RunGraph → ModelExecutor::RunGraph → GraphExecutor::ExecuteGraph → ModelManager::syncExecuteModel → NnExecute |
Note: Sync path actually goes NnExecute, not Run |
GeSession::RunGraphAsyncWithStream |
ModelExecutor::ExecuteGraphWithStream → ModelManager::ExecuteModelWithStreamAsync → NnExecute |
Async execution with specified Stream |
Paths to Run() entry:
| API | Call Chain | Description |
|---|---|---|
GeSession::RunGraphAsync |
GraphManager::RunGraphAsync → queue push → ModelExecutor::RunThread → GraphExecutor::ExecuteGraphAsync → ModelManager::DataInputTensor → model->Push(args) → data_inputer_ queue → Run() pops from queue and executes |
Only entry that truly goes Run loop |
A notable design detail: GeSession::RunGraph although name suggests "run graph", actually goes NnExecute rather than Run() loop. The one that truly drives Run() background thread through data_inputer_ queue is only GeSession::RunGraphAsync.
3.5 GraphExecutor and ModelExecutor Division of Work
GraphExecutor (runtime/v1/graph/execute/graph_executor.cc) and ModelExecutor (runtime/v1/graph/execute/model_executor.cc) constitute execution layer's dual-layer abstraction:
GraphExecutor — Pure execution proxy:
- All methods are static or const methods
- Does not hold state, directly forwards to
ModelManager - Responsibilities: Sync execution (
ExecuteGraph), async execution (ExecuteGraphAsync), stream-level execution (ExecuteGraphWithStream)
ModelExecutor — Session-level execution coordinator:
- Inherits from
Executorbase class - Holds
GraphNoderegistry (graph_nodes_) - Manages resource recycling (memory, stream, event)
- Supports async execution thread (
RunThread+run_args_q_)
ModelExecutor needs to handle "loading decision" — when resources are insufficient, needs to unload existing models to make space. This logic is decoupled from pure execution logic, enabling GraphExecutor to focus on execution path.
3.5.1 Resource Recycling Strategy
ModelExecutor::CheckAndReleaseMemory demonstrates a graceful degradation strategy:
1. Check if free memory is sufficient to load new model
2. If not, traverse all loaded models
3. For each model check if contains HCCL Task (communication operators cannot unload)
4. If does not contain, unload that model to release resources
5. Re-check free memory
6. If still not enough, continue unloading next model
Same logic also applies to event (CheckAndReleaseEvent) resource recycling. This design ensures in device resource-limited situations, system still can run new models through "replace old with new" way.
HCCL (Huawei Collective Communication Library) involves cross-device communication, unloading will破坏 communication topology integrity. This is an important constraint in distributed training scenarios.
4. v2 Architecture: New Generation Runtime Based on Lowering
4.1 Design Philosophy: Compilation is Execution Preparation
v2's core idea is Lowering — converting high-layer ComputeGraph to low-layer ExecuteGraph, enabling runtime to only need an extremely simple execution loop.
4.2 Lowering: From ComputeGraph to ExecuteGraph
GraphConverter::ConvertComputeGraphToExecuteGraph (runtime/v2/lowering/graph_converter.cc) is v2's core conversion flow:
flowchart TD
A[ComputeGraph] --> B[CreateInitNode]
B --> C[CreateDeInitNode]
C --> D[CreateMainNode]
D --> D1[LoweringExecuteArgFeeds: Stream/Event/Notify]
D1 --> D2[LoweringStreamResources]
D2 --> D3[LoweringComputeGraph: Node-by-node conversion]
D3 --> D4[ExpandPartitionedCall]
D4 --> D5[OfflineOptimizer optimization]
D5 --> D6[TopologicalSorting]
D6 --> D7[CalculatePriority]
D7 --> D8[AppendGraphLevelData]
D8 --> E[ExecuteGraph Ready]
Lowering's Core Steps:
- Init Graph Generation: Extract all initialization operations (constant loading, stream allocation, memory allocator creation) into an independent Init subgraph.
- Main Graph Generation: For each ComputeGraph node, find corresponding
NodeConverter(throughNodeConverterRegistry), call its lowering function to generate one or more ExecuteGraph nodes. - Event Synchronization:
LoweringEventSynchandles cross-stream Send/Wait event synchronization. - Optimization:
OfflineOptimizeroptimizes generated ExecuteGraph (such as constant folding, dead code elimination).
v1 at runtime through DistributeTask dispatches Task one by one to device, while v2 at compile time has already converted ComputeGraph to directly executable ExecuteGraph. Runtime does not need any "translation" step.
4.3 ModelV2Executor: Three-stage Lifecycle
ModelV2Executor (runtime/v2/core/model_v2_executor.cc) manages three subgraphs' lifecycle:
Init Graph → Main Graph → DeInit Graph
Load Phase:
- Load and execute Init Graph (memory allocation, stream allocation, constant initialization)
- Unload Init Graph
- Load Main Graph (do not execute, only prepare execution data)
Execute Phase:
- Specify input/output Tensor
- Specify runtime parameters (stream, event, notify, memory allocator)
- Execute Main Graph
UnLoad Phase:
- Unload Main Graph
- Load and execute DeInit Graph (resource cleanup)
- Unload DeInit Graph
v2's executor is pure C implemented sequential execution loop, cannot handle "allocate memory" operations needing interaction with runtime API. Extracting these operations to Init/DeInit subgraphs, can maintain Main Graph's purity — Main Graph only contains pure computation nodes.
4.4 Sequential Executor: Extremely Concise Execution Engine
v2's core execution engine (runtime/v2/core/executor/sequential/executor/sequential_executor.c) adopts extremely minimalist execution loop:
KernelStatus SequentialExecute(void *arg) {
SequentialExecutionData *execution_data = (SequentialExecutionData *)arg;
for (size_t i = 0U; i < execution_data->node_num; ++i) {
Node *node = execution_data->nodes[i];
KernelStatus ret = node->func(&(node->context));
if (ret != kStatusSuccess) { return ret; }
}
return kStatusSuccess;
}
Reasons for Using C Implementation:
- No Runtime Overhead: C language has no implicit overhead like exception handling, RTTI, virtual function table. This loop executes thousands of times per inference, every nanosecond counts.
- Portability: C code can directly execute on device side (AICORE's DSP), leaving space for future device-side scheduling.
SequentialExecutionData structure is extremely minimalist:
typedef struct {
size_t node_num;
Node **nodes; // Node array (pre-sorted)
size_t input_num;
AsyncAnyValue **input_values; // Input values (set externally)
size_t output_num;
AsyncAnyValue **output_values; // Output values (set externally)
} SequentialExecutionData;
Each Node only contains: node ID, execution function pointer, runtime context. This flattened design eliminates overhead like virtual function calls, pointer chasing.
4.5 StreamExecutor: Multi-stream Concurrency Management
StreamExecutor (runtime/v2/core/stream_executor.cc) creates independent ModelV2Executor instance for each ACL Stream:
streams_to_executor_: map<aclrtStream, unique_ptr<ModelV2Executor>>
Each Stream one Executor is because in async execution mode, multiple Streams may concurrently execute different inference requests of same model. Each Executor maintains its own execution state (input/output binding, iteration count etc.), mutually non-interfering.
This is consistent with Ascend Stream's design philosophy — Stream is ordered queue of device-side operations, different Streams can parallelize.
5. Hybrid Execution: Dynamic Shape Solution
5.1 Why Need Hybrid?
Static shape model's TaskSink path although efficient, but面对 dynamic shape (such as variable length sequence in NLP) is powerless — because different shapes correspond to different Task sequences and memory layouts.
HybridDavinciModel (runtime/v1/hybrid/hybrid_davinci_model.h) is dynamic shape scenario's execution entry. Its existence is determined by ModelManager::IsNeedHybridLoad — when GeRootModel is marked as dynamic shape, goes Hybrid path.
5.2 Hybrid Execution Architecture
Hybrid execution experienced evolution from v1 to v2. v1's HybridModelRtV1Executor built based on SubgraphExecutor, NodeDoneManager and ShapeInferenceEngine, adopting Host-side operator-by-operator scheduling way, this path is no longer evolving. Current Hybrid scenario's main executor is HybridModelRtV2Executor, it reuses v2 runtime's ExecuteGraph infrastructure.
flowchart TD
A[HybridDavinciModel] --> B[HybridModelRtV2Executor]
B --> C[RtV2ExecutorFactory::Create]
C --> D{Is Stage Partition?}
D -->|No| E[RtV2SimpleExecutor]
D -->|Yes| F[RtV2PipelineExecutor]
E --> G[ModelV2Executor]
F --> G
G --> H[Lowering: ComputeGraph → ExecuteGraph]
H --> I[ExecuteGraph Execution Loop]
I --> J[Sequential/Topological Executor]
HybridModelRtV2Executor's Execution Flow:
- Initialization: Create executor through
RtV2ExecutorFactory::Create. If graph containsPartitionedCallnodes (Stage partition), then createRtV2PipelineExecutor; otherwise createRtV2SimpleExecutor. - Lowering: Convert
ComputeGraphinGeRootModeltoExecuteGraphthroughModelConverter. This step is completed at compile time, runtime directly loads. - Execution:
ModelV2Executormanages Init/Main/DeInit three subgraphs' lifecycle, Main Graph executes throughSequentialExecutororTopologicalExecutor.
Different from v1's operator-by-operator Host scheduling, v2's Hybrid executor converts whole graph to ExecuteGraph, runtime only needs to execute minimalist node loop, significantly reducing Host-side scheduling overhead.
6. Single Operator Execution Mode
6.1 Design Background
Single operator mode was initially introduced to support PyTorch running on Ascend devices. PyTorch early adopted dynamic graph execution model, each time only executing single operator. To enable PyTorch to utilize GE's compilation capability, GE introduced single operator execution mode: when PyTorch calls aclopCompileAndExecuteV2 interface, GE will construct Data and NetOutput nodes for this single operator, form a minimized graph, then compile and execute.
6.2 Current Status
Currently PyTorch mainly uses aclnn as single operator execution path, only operators not supporting aclnn will go aclop line. Accordingly, GE's single_op module is also no longer evolving.
SingleOp (runtime/v1/single_op/single_op.h) and DynamicSingleOp provide operator-level execution capability:
- SingleOp: Fixed shape single operator execution. Shape determined at initialization, subsequent execution无需重新编译.
- DynamicSingleOp: Dynamic shape single operator execution. Each execution may pass different shape.
7. Session Management: Model's Lifecycle Context
7.1 Session Hierarchical Structure
flowchart TD
A[GeSession] --> B[InnerSession]
B --> D[GraphManager]
B --> E[ModelExecutor]
D --> F[GraphNode: Graph's Compilation State]
E --> G[GraphNode: Graph's Load/Execution State]
7.2 InnerSession: Session's Core Implementation
InnerSession (api/session/session/inner_session.h) is each user Session's complete context, containing:
- GraphManager: Manages graph compilation (AddGraph → BuildGraph → CompileGraph)
- ModelExecutor: Manages graph loading and execution (LoadGraph → RunGraph)
- Session ID: Globally unique identifier, used for variable management, memory isolation
InnerSession's Lifecycle:
Initialize() → AddGraph() → BuildGraph() / CompileGraph() → RunGraph() → Finalize()
Key Design Decisions:
-
Compilation and Execution Separation:
BuildGraphonly compiles不加载,RunGraphwill trigger first-time loading and execution. This lazy loading strategy avoids unnecessary device resource occupation. -
External Memory Management:
SetGraphConstMemoryBase/UpdateGraphFeatureMemoryBaseallow users to self-manage device memory, GE only负责constructing model on user-provided memory. -
ForkGraph (
inner_session.h): Supports fork a compiled graph, forked graph shares compilation product but can independently load and execute. This is key capability for multi-instance concurrent inference — avoiding repeated compilation.
8. Memory Management: Segmented Strategy
8.1 Memory Partition Model
GE divides device memory into multiple logical segments:
┌─────────────────────────────────────────────────────┐
│ Device Memory │
├──────────┬──────────────┬──────────┬────────────────┤
│ Weights │ Feature Map │ Variable │ Zero-Copy IO │
│ (Fixed) │ (Operator Activation Memory)│ (Training) │ (Model Input Output) │
├──────────┼──────────────┼──────────┴────────────────┤
│ Fixed FM │ Refreshable FM│ │
│ (Non-refreshable)│ (Refreshable) │ │
└──────────┴──────────────┴───────────────────────────┘
9. Multi-stream Parallelism
GE's multi-stream parallelism algorithm based on graph's topology structure and engine type:
- Assign execution engine for each node
- Assign Stream for each node based on topology and engine
- Insert synchronization between different Streams to ensure execution timing
Three parallel scenarios:
- Computation and Communication Parallel: AllReduce and Convolution无依赖can并发
- Different Engine Parallel: AI Core and DVPP can simultaneously work
- Same Engine Internal Parallel: When one operator cannot fully occupy engine, different topology sets can并发
10. Runtime Design Characteristics
| Dimension | GE Runtime v1 | GE Runtime v2 |
|---|---|---|
| Execution Model | TaskSink + Host Scheduling | Host Sequential/Topological Execution |
| Dynamic Shape | Hybrid Subgraph Scheduling (no longer evolving) | ExecuteGraph Node-level |
| Memory Management | Segmented + Zero-Copy | Unified Allocator |
| Multi-stream Parallelism | Multiple rtStream Bind rtModel | Multiple Executor Instances |
| Load/Execution Separation | Yes (Load + NnExecute) | Yes (Load + Execute) |
GE Runtime's Unique Features:
- TaskSink Mode: Pre-load entire execution sequence to device, Host zero scheduling overhead. This is Ascend hardware's unique capability — device-side Task scheduler can autonomously execute pre-loaded Task sequence.
- Dual-version Runtime: v1 pursues极致static performance (Sink mode), v2 pursues flexibility and extensibility (Lowering + pure C executor).
- Multi-engine Heterogeneous Execution: AICore, AICPU, DVPP, HCCE, HostCPU etc. engines协同work in same runtime.
Runtime system maps compiler-produced static execution plan to physical device, achieves extreme execution efficiency in Sink mode, but in dynamic shape scenarios still needs to承受 Host scheduling overhead — this naturally引出requirements for task sequence optimization, stream parallel scheduling, memory reuse etc. key optimization technologies.