Stream Allocator Feature Analysis

1 Feature Background

Computation tasks on Ascend AI processors are organized and scheduled through "Streams". A stream is a device-side execution queue. Tasks within the same stream execute strictly in order, while tasks between different streams can execute in parallel. The quality of stream allocation directly affects model execution efficiency. Allocating too few streams cannot fully utilize hardware parallel capability, while allocating too many streams brings excessive synchronization overhead (Event/Notify) and resource occupation.

During the process of compiling AscendIR to executable models (OM files), the GE graph compiler needs to complete stream allocation decisions at compilation time. This decision involves three core questions:

Which operators can execute in parallel? This requires deciding based on engine type, data dependency relationships, user annotations, and other information.
How to synchronize between operators executing in parallel? Events/Notify need to be inserted between different streams to ensure data consistency.
How to split when physical stream capacity is limited? A logical stream has an upper limit on the number of tasks it can carry. When exceeded, it needs to be split into multiple physical streams.

The stream allocation feature is designed to systematically solve these problems.

Applicable Scenarios

The stream allocation feature applies to the following typical scenarios:

Scenario	Description
Static Shape Model Compilation	Model input shape is known at compilation time. GE can perform fine-grained multi-stream allocation based on complete graph topology
Dynamic Shape Model Compilation	Model input shape is determined at runtime. GE needs to adopt a more conservative stream allocation strategy
Mixed Engine Model	Model contains operators from different engines such as AI Core, HCCL (collective communication), AI CPU, DVPP. Stream allocation needs to be based on engine characteristics
Training Scenario AllReduce Parallelism	Gradient aggregation (AllReduce) executes in parallel with backward computation to accelerate training
User-defined Stream Allocation	User specifies particular operators to be allocated to particular streams through StreamLabel attribute

2 Overall Architecture

The stream allocation feature spans both compiler and runtime phases, forming a complete pipeline of "logical stream allocation → synchronization event insertion → physical stream splitting → runtime stream creation".

flowchart TB
    subgraph Compiler["Compilation Phase"]
        direction TB
        A[Logical Stream Allocation<br/>LogicalStreamAllocator] --> B{Static or Dynamic?}
        B -->|Static| C[Static Stream Allocation Pass Chain]
        B -->|Dynamic| D[Dynamic Stream Allocation Strategy]
        C --> E[Attached Stream Allocation<br/>AssignAttachedStreamPass]
        D --> E
        E --> F[Synchronization Event Insertion<br/>InsertSyncEvents]
        F --> G[Synchronization Event Optimization<br/>OptimizeSyncEvents]
        G --> H{Device supports unlimited depth stream?}
        H -->|Yes| I[Automatically handled by RTS]
        H -->|No| J[Physical Stream Splitting<br/>SplitStreams]
        J --> K[Stream Activation Update<br/>UpdateActiveStreams]
        K --> L[Synchronization Event Generation<br/>GenerateSyncEventNodes]
    end

    subgraph Runtime["Runtime Phase"]
        direction TB
        M[gert::StreamAllocator<br/>Create rtStream on demand]
        N[ge::ReusableStreamAllocator<br/>Stream reuse pool]
    end

    L --> M
    L --> N

Module Responsibilities

Module	Directory	Responsibility
`StreamAllocator`	`compiler/graph/build/stream/`	Compilation phase stream allocation main entry, coordinating logical stream allocation, synchronization insertion, physical stream splitting
`LogicalStreamAllocator`	`compiler/graph/build/stream/`	Logical stream allocation for static shape, based on Pass chain architecture
`DynamicStreamAllocator`	`compiler/graph/build/stream/`	Stream allocation for dynamic shape, with simpler strategy
`StreamUtils`	`compiler/graph/build/stream/`	Common utility functions for stream allocation
`gert::StreamAllocator`	`inc/framework/runtime/`	Runtime V2 path stream creation interface
`ge::ReusableStreamAllocator`	`runtime/v1/`	Runtime V1 path stream reuse pool

3 External Interfaces

3.1 Compilation Phase API

Compilation phase stream allocation is part of the graph compilation pipeline and is not directly exposed to end users. However, after compilation completes, users can query stream allocation results through the following interfaces.

GetStreamAllocationSummary

Get stream allocation summary information, including logical stream, physical stream, and attached stream allocation status.

Header File: ge/ge_graph_compile_summary.h
Library File: libge_compiler.so
Function Prototype:

Status GetStreamAllocationSummary(
    std::shared_ptr<StreamAllocationSummary> &stream_allocation) const;

The returned StreamAllocationSummary object provides the following query interfaces:

Interface	Description
`GetAllLogicalStreamInfos()`	Get allocation information of all logical streams
`GetUsrStreamLabels()`	Get user stream label list
`GetPhysicalStreamNums()`	Get physical stream count
`GetAttachedStreamIds()`	Get attached stream ID list
`GetHcclFollowedStreamNums()`	Get HCCL followed stream count
`IsAssignedByStreamPass()`	Determine if assigned by StreamPass

LogicalStreamAllocationInfo

Detailed information for each logical stream, including:

Interface	Description
`GetLogicalStreamId()`	Logical stream ID
`GetUsrStreamLabel()`	User stream label
`GetAttachedStreamIds()`	Attached stream ID
`GetPhysicalStreamNum()`	Physical stream count
`GetHcclFollowedStreamNum()`	HCCL followed stream count
`GetAllNodes()`	All nodes on this stream

3.2 Runtime API

gert::StreamAllocator (V2 Path)

Runtime stream creation interface, creating and managing device streams on demand.

Header File: framework/runtime/stream_allocator.h
Core Interface:

namespace gert {
class StreamAllocator {
    // Supports up to 2024 streams
    static constexpr size_t kMaxStreamNum = 2024U;

    StreamAllocator(int32_t priority = RT_STREAM_PRIORITY_DEFAULT,
                    uint32_t flags = RT_STREAM_DEFAULT);
    ~StreamAllocator();

    // Acquire streams on demand, returns continuous vector, auto-creates if insufficient
    TypedContinuousVector<rtStream_t> *AcquireStreams(size_t stream_num) const;
};
}

This interface is called during model loading phase. Based on the stream count determined at compilation time, it creates all required device streams at once. The implementation uses ContinuousVector to pre-allocate maximum capacity (2024 streams), marking actual used stream count through SetSize, avoiding frequent memory allocation.

ge::ReusableStreamAllocator (V1 Path)

Runtime stream reuse pool, used to reuse device streams across models, reducing stream creation/destruction overhead.

Header File: runtime/v1/graph/load/model_manager/reusable_stream_allocator.h
Core Interface:

namespace ge {
class ReusableStreamAllocator {
    static ReusableStreamAllocator *Create();
    Status GetOrCreateRtStream(aclrtStream &stream, uint32_t rt_model_id,
                               int32_t priority, uint32_t stream_flag,
                               uint32_t task_num = 0U);
    Status DestroyStream(aclrtStream &stream, bool is_force_destroy = false);
};
}

ReusableStreamAllocator maintains a stream pool keyed by <priority, stream_flag>, sorted by task_num. When a new model loads, it first searches the existing stream pool for reusable streams, avoiding repeated calls to rtStreamCreate. Each stream tracks the models using it through rt_model_id set, ensuring it does not reuse streams from its own model.

3.3 User Configurable Options

Users can influence stream allocation behavior through the following methods:

Configuration Item	Scope	Description
`SINGLE_STREAM_ENABLE`	Static Shape	Enable single-stream mode, all operators execute on one stream
`AC_PARALLEL_ENABLE`	Dynamic Shape	Values are "0", "1" or empty, controls whether AI CPU and AI Core execute in parallel
`EVENT`	Static Shape	When set to "notify", use Notify instead of Event for synchronization
`STREAM_LABEL` (Node Attribute)	All Scenarios	Operator-level stream label, operators with same label are allocated to same stream
`USER_STREAM_LABEL` (Node Attribute)	All Scenarios	User-level stream label, highest priority
`PARALLEL_GROUP` (Node Attribute)	Static Shape	Parallel group identifier, operators in same group are allocated to independent streams
`ATTACHED_STREAM_INFO` (Node Attribute)	Static Shape	Attached stream information, one node can produce multiple streams

4 Specific Implementation

4.1 Static Shape Logical Stream Allocation

Logical stream allocation under static shape uses a Pass chain architecture, where each Pass is responsible for one type of stream allocation rule, executing in order by priority. The design philosophy of this architecture is "separation of concerns". Each stream allocation logic is independently encapsulated as a Pass. Adding new stream allocation rules only requires adding new Passes, without modifying existing logic.

flowchart LR
    subgraph PassChain["Multi-stream Mode Pass Chain"]
        direction TB
        P1["UpdateForMdeGroupPass<br/>MDE group stream allocation"]
        P2["AssignByLabelPass<br/>Label stream allocation"]
        P3["IndependentStreamPass<br/>Independent engine stream allocation"]
        P4["AssignByDependencyPass<br/>Dependency stream allocation + reuse"]
        P5["NodeStreamUpdatePass<br/>Subgraph → Node mapping"]
        P6["UpdateForParallelGroupPass<br/>Parallel group stream allocation"]
        P7["AllReduceParallelPass<br/>AllReduce parallelism"]
        P8["UpdateForSkippedEnginePass<br/>Skipped engine optimization"]
    end
    P1 --> P2 --> P3 --> P4 --> P5 --> P6 --> P7 --> P8

    subgraph TopoOptimize["Topology Optimization"]
        O1["OptimizeIneffectiveMultiStreamPass<br/>Eliminate ineffective multi-stream"]
    end
    P8 --> O1

    subgraph Attached["Attached Stream"]
        A1["AssignAttachedStreamPass<br/>Node-level attached stream"]
    end
    O1 --> A1

4.1.1 Pass Chain Details

UpdateForMdeGroupPass: Allocates new streams to nodes based on NewStreamId attribute. This is the highest priority stream allocation rule, used to support independent stream requirements for specific operators in MDE (Multi-Data Execution) scenarios.

AssignByLabelPass: Allocates streams based on StreamLabel attribute. Subgraphs with the same StreamLabel are allocated to the same stream, different StreamLabels get new streams. This allows upper-layer compilation optimizations (such as fusion Pass) to guide stream allocation by setting StreamLabel.

IndependentStreamPass: Allocates independent streams to subgraphs of independent engines (such as HCCL). Operators of independent engines need to exclusively occupy a stream and cannot reuse with other engines. Within the same independent engine, subgraphs with the same StreamLabel share streams.

AssignByDependencyPass: The most core stream allocation Pass. It allocates and reuses streams based on data dependency relationships between engine subgraphs. This Pass works as follows:

Traverse all subgraphs without stream allocation
Check if predecessor subgraphs have reusable streams
Reuse if possible, otherwise allocate new stream
Stream reuse requires three conditions: same scheduler_id, not independent engine/tagged stream, no engine conflict

NodeStreamUpdatePass: Maps subgraph-level stream allocation results to node level. Each node gets the stream_id of its belonging subgraph. Specifically, nodes with ATTR_NAME_RTS_LABEL_NODE attribute are allocated to parent stream (instead of subgraph stream), used to support control flow scenarios.

UpdateForParallelGroupPass: Reallocates streams to nodes based on PARALLEL_GROUP attribute. Nodes in the same parallel group are allocated to the same new stream. For HCOM operators, if the parallel group name is "-1" and has only one input, it tries to reuse the input node's stream.

AllReduceParallelPass: When hcom_parallel is enabled, allocates successor non-HCOM nodes of AllReduce operators to new streams, enabling AllReduce to execute in parallel with backward computation. This is a key optimization for training acceleration.

UpdateForSkippedEnginePass: Optimizes node stream allocation in skipped engine subgraphs. For patterns like NodeA(stream1) → Const(stream2) → NodeB(stream1), changes Const node's stream to stream1, reducing unnecessary synchronization events between two streams.

OptimizeIneffectiveMultiStreamPass: Topology optimization Pass, eliminating "nominally multi-stream but actually no parallel benefit" situations. If a node connects to another stream on all input/output directions, and no other nodes exist between input/output nodes on that stream, move the current node to that stream, reducing synchronization overhead.

4.1.2 Attached Stream Allocation

Attached Stream is an additional stream produced by a node besides the main stream. Some operators (such as SuperKernel) require multiple streams to execute different computation tasks. Attached stream allocation occurs after main stream allocation completes.

AssignAttachedStreamPass gets attached stream information through ATTR_NAME_ATTACHED_STREAM_INFO or ATTR_NAME_ATTACHED_STREAM_INFO_LIST attribute, including stream count and reuse_key. Attached streams with the same reuse_key share the same stream, avoiding unnecessary stream creation.

After attached stream allocation completes, total stream count = main stream count + attached stream count.

4.2 Dynamic Shape Stream Allocation

Stream allocation strategy under dynamic shape is more conservative compared to static shape. By default, only one stream is allocated (single-stream mode). Multi-stream is only enabled when configuration allows. This is because dynamic shape graph structure is incomplete at compilation time, preventing precise dependency analysis.

flowchart LR
    subgraph DynPassChain["Dynamic Shape Stream Allocation Strategy"]
        direction TB
        D1["AssignEnginesOwningStream<br/>Stream allocation by engine"]
        D2{ac_parallel_enable?}
        D2 -->|Yes| D3["AssignAicpuCanParallel<br/>AICPU parallel judgment"]
        D2 -->|No| D4["AssignIndependentAicpuNode<br/>Independent AICPU node stream allocation"]
        D3 --> D5["AssignWithReuse<br/>Stream reuse"]
        D4 --> D5
        D5 --> D6["AssignRemainSubgraphNeedAssignStream<br/>Remaining subgraph stream allocation"]
        D6 --> D7["ReassignStreamByStreamLabel<br/>Label reallocation"]
    end
    D1 --> D2
    D7 --> D8["Force main stream nodes<br/>Data/Variable/NetOutput/FILECONSTANT → stream 0"]

Key Differences from Static Shape:

Difference	Static Shape	Dynamic Shape
Default Mode	Multi-stream	Single-stream
Allocation Granularity	Pass chain processing, fine-grained rules	Stream allocation by engine, simple rules
Stream Reuse Strategy	Complex reuse judgment based on dependency	Predecessor/successor subgraph reuse
Node-level Constraints	Few	Data, Variable, NetOutput, FILECONSTANT forced on main stream
Attached Stream	Supported	Supported through independent interface `AssignAttachedResource`
Synchronization Mechanism	Event + Notify dual mode	Event only

4.3 Synchronization Event Management

After stream allocation completes, synchronization events are needed between operators on different streams to ensure correct execution order. Synchronization event management is the most complex part of the stream allocation feature.

4.3.1 Event Types

Type	Description	Applicable Scenario
`kEvent`	Normal Event, Send/Recv pairing	Default mode
`kNotify`	Notify, supports finer-grained synchronization	Enabled through `EVENT=notify` configuration

4.3.2 Event Insertion Rules

The system inserts synchronization events in the following scenarios:

flowchart TB
    subgraph InsertRules["Event Insertion Scenarios"]
        S1["Between adjacent nodes on different streams<br/>Cross-stream data dependency"]
        S2["Between stream split nodes<br/>Same logical stream split into multiple physical streams"]
        S3["Between main and secondary streams<br/>StreamActive and activated stream"]
        S4["Between attached and main streams<br/>ATTR_NAME_ATTACHED_STREAM_INFO dependency"]
        S5["Subgraph boundary<br/>Subgraph End node and parent node successor"]
        S6["Loop scenario<br/>NextIteration and StreamActive"]
    end

The core logic of event insertion is in StreamAllocator::InsertOneEventInTwoNodes. Traverse all data edges and control edges of the entire graph. When two adjacent nodes belong to different streams, insert a pair of Send/Recv events between them.

4.3.3 Event Optimization

After inserting events, the system eliminates redundant events through three optimizations:

OptimizeBySendEvents: Within the same stream, if the event of Send node A already ensures that Recv node C on stream B executes after A, then no additional event is needed between A and C.

OptimizeByRecvEvents: Similarly, eliminate redundancy in the receiving direction.

OptimizeByStreamActivate: Optimize cross-stream events through StreamActive mechanism. When a node on stream A activates stream B through StreamActive, no additional Event is needed from stream A to stream B, because StreamActive itself implies synchronization semantics. This optimization judges through IsRecvNodeActivatedBySendNode method, tracing back along the activation chain to check for activation relationship.

4.3.4 Event Reuse

In multi-dims scenarios, only one dim executes at any moment, so their Events can be reused. ReuseEventForMultiDims method independently numbers Events for each dim, then takes the maximum value as the final Event count. For example:

dim0: event 0, 1, 2, 3  →  0, 1, 2, 3
dim1: event 4, 5, 6, 7, 8  →  0, 1, 2, 3, 4
dim2: event 9, 10, 11  →  0, 1, 2
Final event_num = max(4, 5, 3) = 5

Additionally, operators can explicitly declare Event reuse relationship through ATTR_NAME_EVENT_MULTIPLEXING attribute. The system replaces corresponding event IDs based on this.

4.3.5 Event Continuity Guarantee

RTS (Runtime Service) requires Event IDs to be allocated continuously starting from 0. Therefore, after all optimization and reuse processing completes, the system remaps Event IDs through RefreshContinuousEvents method to ensure continuity. This logic also applies to Notify.

4.4 Physical Stream Splitting

Logical stream allocation does not consider task count limits, but physical streams have an upper limit on task count. Physical stream splitting phase is responsible for splitting logical streams that exceed the limit into multiple physical streams.

flowchart TB
    A[Detect device capability<br/>FEATURE_TYPE_PERSISTENT_STREAM_UNLIMITED_DEPTH] --> B{Supports unlimited depth?}
    B -->|Yes| C[Skip GE-side splitting<br/>Automatically handled by RTS]
    B -->|No| D[Calculate task count for each logical stream]
    D --> E{task count > limit?}
    E -->|Yes| F[Split into new physical stream<br/>Insert split point synchronization event]
    E -->|No| G[Keep as is]
    F --> H[Update stream activation relationship<br/>UpdateActiveStreams]
    H --> I[Generate synchronization event nodes after splitting<br/>GenerateSyncEventNodes]

Split Trigger Conditions (StreamAllocator::NeedSpiltNewStream):

Not the first node of the stream
Current stream task count + reserved count > limit
Node has no subgraph (non-control flow node)
Not the first node of StreamActive
Not control flow label nodes like LabelSet/LabelGotoEx/LabelSwitchByIndex

Items to Handle During Splitting:

Update node's stream_id to new physical stream ID
Insert synchronization events between nodes before and after the split point
Maintain split_stream_id_to_logic_stream_id_ mapping
Handle ContinuousStreamLabel: Nodes with the same label must be split to the same stream

Huge Stream: When single-stream mode task count exceeds normal stream limit, the system tries to use Huge Stream, which has higher task capacity.

4.5 Stream Activation Mechanism

Stream Activate is the stream scheduling mechanism on the Ascend device side. When a stream needs to wake up another stream, it sends an activation signal through StreamActive operator.

Establishment Process of Activation Relationship:

Label Activation (SetActiveStreamsByLabel): Traverse all nodes with ATTR_NAME_ACTIVE_LABEL_LIST attribute, map labels to actual stream IDs, write to ATTR_NAME_ACTIVE_STREAM_LIST attribute.
Subgraph Activation (SetActiveStreamsForSubgraphs): For While/For loop subgraphs, set activation stream list for the first StreamActive node, ensuring all streams within the subgraph are correctly activated.
Switch Node Activation (UpdateActiveStreamsForSwitchNode): A StreamActive node is inserted after StreamSwitch node, activating corresponding streams based on conditional branches.
Loop Activation (SetActiveStreamsForLoop): Handle FpBp loops in training scenarios. The StreamActive node needs to activate all streams not specifically activated, ensuring all streams are correctly started at the beginning of each iteration.

Activation Update After Stream Splitting: Physical stream splitting produces new stream IDs. The system needs to update activation lists of all StreamActive nodes, adding newly split streams to the activation scope.

4.6 Synchronization Event Node Generation

In the final stage of stream allocation, the system needs to convert event information recorded in data structures (node_to_send_events_, node_to_recv_events_, etc.) into actual nodes in the graph (Send/Recv operators).

GenerateSyncEventNodes method traverses event mappings of all nodes, creates corresponding Send and Recv nodes for each event, and inserts them into correct positions in the graph through control edges. These nodes are converted to device-side Event Record/Wait tasks in subsequent Task generation phase.

4.7 Runtime Stream Creation

After compilation determines stream count, runtime creates corresponding device streams during model loading phase.

V2 Path (gert::StreamAllocator)

gert::StreamAllocator pre-allocates a continuous vector that can hold up to 2024 rtStream_t. On first AcquireStreams call, it creates the specified number of streams in sequence (calling rtStreamCreateWithFlags). Subsequent calls reuse already created streams. Stream destruction completes uniformly in the destructor.

V1 Path (ge::ReusableStreamAllocator)

V1 path supports cross-model stream reuse. ReusableStreamAllocator maintains a stream pool keyed by <priority, stream_flag>. Each stream records its task_num and the model ID set using it. When a new model requests streams:

Search stream pool for matching <priority, stream_flag>
Filter streams not belonging to current model
Sort by task_num, preferentially reuse streams with similar task count
Create new stream if no available stream

5 Key Design Decisions

5.1 Why Do Static and Dynamic Use Different Stream Allocation Strategies?

Static Shape graph topology is completely known at compilation time. GE can precisely analyze all data dependency relationships and perform fine-grained stream reuse. Dynamic Shape graph structure is incomplete at compilation time (some subgraphs only expand at runtime), preventing precise dependency analysis. Therefore, Dynamic Shape adopts a more conservative strategy: default single-stream, only enabling multi-stream when user explicitly configures, with stream allocation rules primarily at engine granularity rather than dependency analysis.

5.2 Why Use Pass Chain Instead of Single Allocation Algorithm?

The core advantage of Pass chain architecture is extensibility and maintainability. Each stream allocation rule (label allocation, engine allocation, dependency allocation, AllReduce parallelism, etc.) is encapsulated as an independent Pass, each maintaining its own state without interference. Adding new stream allocation rules only requires adding new Passes to the chain, without modifying existing logic. If using a single algorithm, all rules would be interwoven, drastically reducing code readability and maintainability.

5.3 Why Need "Graph Structure Stability Principle"?

Stream allocation depends on topological sort and topo ID continuity. If graph structure changes during stream allocation (adding or deleting nodes), topo ID becomes discontinuous, affecting subsequent memory reuse (which allocates memory blocks based on topo order) and physical stream splitting (which calculates task count based on topo order). Therefore, stream allocation phase strictly prohibits graph modification operations. All nodes needing insertion (such as Send/Recv, StreamActive) are inserted uniformly after stream allocation completes.

5.4 Why Must Attached Stream Be Allocated After Main Stream Allocation Completes?

Attached stream IDs are incremented based on main stream IDs. If attached streams and main streams are allocated together, stream IDs become discontinuous, increasing synchronization management complexity. Allocating attached streams all at once after main stream allocation completes ensures stream ID continuity and predictability.

6 Constraints and Limitations

Constraint	Description
Single-stream mode does not support StreamLabel	Single-stream mode has only one stream, StreamLabel causes conflict
Event ID must be continuous	RTS has Event ID continuity validation, must ensure through `RefreshContinuousEvents`
Graph structure immutable	Graph structure cannot change during stream allocation, topo ID must be continuous
Notify count upper limit	Maximum 1024 Notify supported
Single node attached stream count	One node supports at most one attached stream (reuse through reuse_key)
Multi-thread safety	`StreamAllocator` supports multi-threading but needs to protect shared resources; `ScalableAllocator` does not support multi-thread concurrency
Dynamic graph While constraint	NetOutput of While operator's static body subgraph in dynamic graph must be on stream 0
Multi-dim scenario label constraint	When adding StreamLabel in multi-dim scenarios, need to add dim information to distinguish