Concat No Task Feature Analysis
1. Feature Background
1.1 Problem Scenario
In deep learning models, ConcatD/ConcatV2D operators concatenate multiple input tensors along a specified dimension into one output tensor. The traditional Concat operator execution flow is:
InputA ──┐
InputB ──┼──► Concat Task ──► Output (concatenated result)
InputC ──┘
The Concat Task requires launching a computation task on Ascend AI Core. It uses DataMove instructions to transport each input data to a continuous memory region of the output address.
1.2 Optimization Approach
When Concat operator input tensors are naturally contiguously arranged in memory, you do not need to execute any data transportation operations. You can directly reuse the first input address as the output address. The Concat No Task feature identifies this scenario during compilation and marks the Concat operator as a "virtual operator" (Virtual Op), achieving:
- No hardware Task generation: Skip AI Core task dispatch
- No memory transportation: Output directly reuses input memory address
- Eliminate redundant computation: Avoid meaningless data movement
2. User Usage Scenarios
2.1 Typical Scenario: AllGather + Concat in Distributed Training
In distributed training, multiple cards collect their respective data through AllGather and then concatenate into a complete batch:
Card0: Data ──► AllGather ──┐
├──► ConcatD ──► Complete Batch
Card1: Data ──► AllGather ──┘
AllGather output data is contiguously arranged in memory by card number order. Concat is logically concatenation, but physically does not require data transportation.
2.2 Typical Scenario: Result Merging After Block Computation
Split a large tensor across multiple operators for parallel computation and then merge results:
Input ──► Split ──┬──► Relu ──┐
├──► Relu ──┼──► ConcatD ──► Output
└──► Relu ──┘
When Split splits along batch dimension and each branch computation does not change memory layout, Concat inputs are naturally contiguous.
2.3 Applicable Conditions
Concat No Task optimization requires simultaneously satisfying the following strict conditions:
| Condition Category | Specific Requirement |
|---|---|
| Operator type | Only ConcatD and ConcatV2D |
| Shape constraint | All dimension values before concat_dim axis must be 1 |
| Alignment constraint | concat_dim axis original size must be integer multiple of align_shape corresponding dimension (no padding) |
| Input constraint | Cannot have Scalar input; all input tensor sizes must be 32-byte aligned |
| Source constraint | Cannot have multiple inputs from same output anchor |
| Predecessor node | Cannot be DATA, REFDATA, VARIABLE, CONSTANTOP, CONSTANT node types |
| Predecessor node | Cannot contain subgraph |
| Predecessor attribute | Cannot have continuous_input, continuous_output, reference attributes |
| Successor node | Successor node cannot already be virtual operator (has _no_task attribute) |
| Output constraint | Input cannot simultaneously be model output (NetOutput) |
| Memory type | All input memory types must be consistent |
| LxFusion | Cannot involve LxFusion (L1/L2/UB address, lxslice operator) |
| Shape mode | Does not support Unknown Shape (dynamic Shape) scenario |
| Graph mode | Does not support Single Op scenario and memory non-contiguous allocation scenario |
3. External Interfaces
3.1 Compilation Period Attribute Marking
Concat No Task interacts with other system modules through the following Graph attributes:
| Attribute Name | Type | Setting Object | Description |
|---|---|---|---|
_no_task |
bool | Concat operator OpDesc | Mark this operator does not generate hardware Task |
_nopadding_continuous_input |
bool | Concat operator OpDesc | Mark input as continuous memory without padding |
_output_reuse_input |
bool | Concat operator OpDESC | Mark output reuses input memory |
_reuse_input_on_dim_index |
int64 | Concat operator OpDesc | Specify reused input memory dimension index (fixed as 0) |
can_reused_for_concat_optimize |
bool | Predecessor node output TensorDesc | Mark this output is occupied by Concat optimization, cannot be reused by other Concat |
3.2 Pass Registration
ConcatNotaskPass registered as O3 optimization level GraphPass:
REG_PASS_OPTION("ConcatNotaskPass").LEVELS(OoLevel::kO3);
During compilation flow, this Pass runs in OptimizeStage2 final stage, after subgraph merging completes and before memory conflict handling:
graph_manager.cc: OptimizeAfterMergeSubGraph()
├── ... (early optimization)
├── BufferPoolMemoryPass
├── ParallelGroupPass
└── ConcatNotaskPass ← Execute after graph stabilizes
3.3 Runtime Behavior
At runtime (RT1 and RT2), operators marked with _no_task receive special handling:
- RT1 (Davinci):
TbeKernelHandle::NeedInitdetects_no_taskattribute and returns false, skipping Kernel initialization - RT2 (V2):
IsVirtualOpfunction detects_no_taskattribute, skipping Task generation in AICore Node Converter
4. Specific Implementation
4.1 Overall Architecture
┌─────────────────────────────────────────────────────────────┐
│ Compilation Period (Compiler) │
│ │
│ ┌──────────────────┐ ┌──────────────────────────────┐ │
│ │ ConcatNotaskPass │───►│ Attribute Marking │ │
│ │ │ │ _no_task │ │
│ │ Verification chain│ │ _nopadding_continuous_input │ │
│ │ ├─ InputCheck │ │ _output_reuse_input │ │
│ │ ├─ CheckConcatDim│ │ _reuse_input_on_dim_index │ │
│ │ ├─ OutputCheck │ └──────────────────────────────┘ │
│ │ └─ LxFusionCheck│ │
│ └──────────────────┘ │
└─────────────────────┬───────────────────────────────────────┘
│ Attribute transfer
▼
┌─────────────────────────────────────────────────────────────┐
│ Memory Allocation (Memory Assigner) │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ BlockMemAssigner │ │ GraphMemAssigner │ │
│ │ │ │ │ │
│ │ Detect NOPADDING_ │ │ Calculate continuous_type │ │
│ │ CONTINUOUS_INPUT │ │ kTypeInputNoPadding │ │
│ │ │ │ │ │
│ │ no_assign_mem=true │ │ Calculate nopadding_size │ │
│ │ (no independent │ │ by dim_index │ │
│ │ memory allocation) │ │ │ │
│ └─────────────────────┘ └──────────────────────────┘ │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Task Generation (Task Generator) │
│ │
│ Detect _no_task attribute → Skip this node Task generation │
│ MarkFirstAndLastOps skip notask nodes │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Runtime (Runtime) │
│ │
│ RT1: TbeKernelHandle skip initialization │
│ RT2: AICoreNodeConverter skip conversion │
│ Output address directly reuse first input address │
└─────────────────────────────────────────────────────────────┘
4.2 ConcatNotaskPass Core Verification Chain
ConcatNotaskPass::Run executes the following verification chain for each ConcatD/ConcatV2D node in the graph. It sets attributes only after all pass:
4.2.1 Scenario Exclusion
- Single Op scenario: Single operator mode requires no optimization
- Memory non-contiguous allocation: Graph with
ATTR_NAME_MEMORY_DISCONTIGUOUS_ALLOCATIONset skips - Unknown Shape: Operators with dynamic Shape in input or output skip
- Dynamic Shape graph: Belonging graph marked with dynamic Shape partition skips
4.2.2 InputCheck (Input Verification)
Check each input anchor sequentially:
- IsScalarInput: Exclude dimension count 0 Scalar inputs
- CheckTensorAlign: In multi-input scenario, each tensor size must be 32-byte aligned
- HasSameSourceAnchor: Through
GetFirstOutAnchorNotInRefNodetrace RefNode chain, ensure no multiple inputs trace back to same output anchor - IsPreNodeTypeValid: Through
GetFirstNotRefNodefind actual production node, exclude DATA/REFDATA/VARIABLE/CONSTANTOP/CONSTANT - IsPreNodeWithSubgraph: Predecessor node cannot contain subgraph instance
- IsPreOutAnchorCanReuseForConcatOptimize: Check predecessor output TensorDesc
can_reused_for_concat_optimizeattribute, ensure not occupied by other Concat - IsPreOutAnchorValidMultiRef: If predecessor output simultaneously connects to NetOutput, cannot optimize
- IsPreNodeAttrValid: Predecessor node cannot have continuous_input, continuous_output, reference, _no_task, _output_reuse_input, _nopadding_continuous_input attributes, nor atomic output
- IsSameInputMemType: All input memory types must be consistent (check through
ATTR_NAME_OUTPUT_MEM_TYPE_LIST)
4.2.3 CheckConcatDim (Concat Dimension Verification)
This is the core verification logic, ensuring all dimensions before concat_dim axis are 1:
Original format (like NCHW) ──► Runtime format (like NC1HWC0)
│ │
│ GetTransferDims() │
│ (call FE interface) │
▼ ▼
src_to_dst_transfer_dims dst_to_src_transfer_dims
{0},{1,4},{2},{3} {0},{1},{2},{3},{1}
- GetAlignedShape: Call
transformer::TransferShapeUtils::GetAlignedShapeto get aligned shape - GetTransferDims: Call
transformer::TransferShapeUtils::TransferDimsto get original format to runtime format axis mapping relationship - CheckRealConcatDim: Find actual concat_dim axis in runtime format, verify all dimension values before that axis are 1
- CheckConcatDimAlignment: Verify concat_dim axis original size is integer multiple of align_shape corresponding dimension (ensure no padding)
CheckRealConcatDim key logic:
- Find real_concat_dim in runtime format through
src_to_dst_transfer_dims[concat_dim][0] - If real_concat_dim is produced by axis merging (multiple source axes in
dst_to_src_transfer_dims), need additional verification:- All axis values before real_concat_dim axis are 1
- All values before concat_dim in merged axis are 1
- If real_concat_dim is not produced by axis merging, only need to verify previous axis values are 1
4.2.4 OutputCheck (Output Verification)
Traverse all successor nodes of Concat node:
- If successor is Reshape and has output node, continue checking Reshape output node
- Successor node cannot already have
_no_task,_output_reuse_input,_nopadding_continuous_inputattributes
4.2.5 LxFusionCheck (LxFusion Verification)
- IsLxFusionMem: Check input/output memory type is L1/L2/UB (on-chip memory used by LxFusion)
- IsLxFusionOp: Check operator name contains "lxslice"
4.3 Attribute Setting
After all verifications pass, SetAttrForConcatNotask executes following operations:
// Set Concat operator itself attributes
AttrUtils::SetBool(op_desc, ATTR_NAME_NOTASK, true);
AttrUtils::SetBool(op_desc, ATTR_NAME_NOPADDING_CONTINUOUS_INPUT, true);
AttrUtils::SetBool(op_desc, ATTR_NAME_OUTPUT_REUSE_INPUT, true);
AttrUtils::SetInt(op_desc, ATTR_NAME_REUSE_INPUT_ON_DIM_INDEX, 0);
// Mark predecessor node output TensorDesc cannot be reused again
for each input:
AttrUtils::SetBool(output_tensor_desc,
"can_reused_for_concat_optimize", false);
4.4 Memory Allocation Linkage
4.4.1 BlockMemAssigner
In AnalyzeSymbolMemReuseInfo, when detecting node has ATTR_NAME_NOPADDING_CONTINUOUS_INPUT attribute:
if (is_input_continuous) {
symbol_mem_reuse_info_[symbol].no_assign_mem_ = true;
}
This means this symbol (memory block) will not be allocated independent memory, but reuse upstream memory address.
GetContinuousNodeLifeTimeBegin method handles cascaded continuous input scenarios:
a b c (first layer)
| | |
d e f (second layer)
|___|___|
|
g h i (third layer, h is continuous input)
|___|___|
|
j (fourth layer, j is continuous input)
g cannot reuse memory with a/b/c, because d/e/f memory will be replaced by g address (cascaded continuous input). Therefore g lifetime start needs to trace back to earliest among d/e/f.
4.4.2 GraphMemAssigner
Identify continuous type in GetContinuousType:
kTypeInputNoPadding = _nopadding_continuous_input && _output_reuse_input
kTypeOutputNoPadding = _nopadding_continuous_output && _output_reuse_input
In GetMemorySize, for nopadding type:
- Get dimension index through
_reuse_input_on_dim_index - Calculate
nopadding_size(actual data size) andtensor_size(complete size with padding) - Memory allocator allocates precise size memory block based on this
4.4.3 SetInputOutputOffsetPass
For no_task Concat nodes, if has ATTR_NAME_CONTINUOUS_INPUT or satisfies BufferFusion condition, will call SetOutputOffsetForConcat to set output offset, ensuring output address correctly points to input data starting position.
4.5 Task Generation Skip
In TaskGenerator::MarkFirstAndLastOps:
bool attr_notask = false;
if (ge::AttrUtils::GetBool(op_desc, ATTR_NAME_NOTASK, attr_notask) && attr_notask) {
continue; // Skip notask node, not included in continuous node list
}
This ensures no_task nodes are not considered as part of continuous computation nodes in stream, not affecting first and last operator marking.
4.6 Runtime Processing
4.6.1 RT1 (Davinci Model)
In TbeKernelHandle::NeedInit:
bool attr_no_task = false;
const bool get_attr_no_task_flag = AttrUtils::GetBool(op_desc, ATTR_NAME_NOTASK, attr_no_task);
if (get_attr_no_task_flag && attr_no_task) {
GELOGI("Node[name:%s, type:%s] does not generate task, skip initialization.", ...);
return false; // Skip Kernel initialization
}
4.6.2 RT2 (V2 Engine)
In aicore_node_converter.cc IsVirtualOp:
bool attr_no_task = false;
(void)ge::AttrUtils::GetBool(op_desc, ge::ATTR_NAME_NOTASK, attr_no_task);
return attr_no_task; // Return true means virtual operator, skip conversion
5. Coordination with Other Optimizations
5.1 Coordination with Split No Task
Concat No Task frequently works with Split No Task, forming "split-compute-merge" zero-copy pipeline:
Input ──► Split(no_task) ──► Computation ──► Concat(no_task) ──► Output
Split splits along concat_dim reverse direction, output address directly points to different offsets of input address; Concat treats these offset addresses as continuous memory, directly reuses.
5.2 Relationship with Memory Reuse
Concat No Task nodes receive special handling in memory reuse check:
- In
ReuseChecker, nodes with_no_taskattribute are considered buffer_pool type - In
mem_reuse_strategy.cc, nopadding continuous input nodes participate in special memory reuse strategy
5.3 Relationship with Address Refresh
In MemLayoutConflictUtil, Concat No Task scenario needs to consider address refresh:
data
|
identity (insert identity operator to support address refresh)
|
split(no_task, no_padding_continuous_output)
/ \
op1 op2
When involving user input and requiring address refresh, insert Identity operator between Data and Split.
6. Test Verification
Unit test file located at tests/ge/ut/ge/graph/passes/concat_notask_pass_unittest.cc, covering following scenarios:
| Test Case | Verification Content |
|---|---|
allgather_connect_to_concat |
AllGather output connects to Concat, verify attribute setting correct |
allgather_connect_to_concat_reshape |
AllGather → Reshape → Concat chain |
| Multiple RefData tests | Verify RefNode chain trace logic |
| Different Format tests | dim verification under ND, NCHW, NC1HWC0 formats |
7. Design Considerations
7.1 Why Choose Attribute Marking Rather Than Graph Rewriting
Concat No Task uses attribute marking rather than node deletion approach, reasons are:
- Preserve debug information: Dump function needs to preserve operator OpDesc and address information (see
InitNoTaskAndDumpNeededNode) - Maintain graph structure integrity: Deleting nodes destroys graph topology relationship, affects other Pass execution
- Support dynamic scenarios: Attribute marking can flexibly handle in different compilation stages
7.2 Why Verification Conditions Are So Strict
Concat No Task verification conditions reach more than ten items, because:
- Memory safety: If input is not continuous but marked as no_task, will read wrong data
- Alignment constraint: Ascend hardware has strict requirements on memory alignment, not satisfying 32B alignment may cause hardware exception
- Cascading impact: One no_task node affects downstream memory allocation strategy, wrong marking will propagate
7.3 Why Execute at Stage2 End
Comment clearly states "graph stabilized then do ConcatNotaskPass":
- Before Stage2, graph structure may still change (subgraph merging, operator fusion etc.)
- Predecessor node attributes may be modified in subsequent Pass
- Executing at end can ensure judgment based on final stable graph structure