In AI model training and inference, performance bottlenecks may appear at any stage: operator execution taking too long, unreasonable memory allocation, stream scheduling conflicts, Host-Device data transmission blocking, etc. GE's Profiling feature is designed to solve these observability problems.
Typical User Scenarios:
-
Training scenario performance tuning: Developers need to know in one training step, how much time forward propagation (FP) and backward propagation (BP) each take, which AllReduce operators become communication bottlenecks, whether there's idle waiting between iterations.
-
Inference scenario latency analysis: How long does model loading take? What's each operator's execution time distribution? Are there certain operators abnormally slow?
-
Memory analysis: What's the static operator's memory lifecycle? Does memory layout conflict exist?
-
API-level performance tracing: How long do user-called
aclmdlExecute,aclopExecuteetc. APIs take?
GE's Profiling system design philosophy is: layered collection, on-demand enable, unified reporting. Different layers (API layer, Host layer, Device layer) independently collect, through unified msprof library report to analysis tools (such as MSProfiler), users can enable different granularity profiling as needed.
2. How to Enable Profiling
2.1 Enable via GE Options (Recommended Method)
When calling GEInitialize or creating Session, through options parameter configure ge.exec.profilingMode as "1" to enable profiling, and in ge.exec.profilingOptions specify in JSON format output path, training trace switch, fp_point/bp_point operator names, task_trace, hccl, aicpu, aic_metrics etc. options.
2.2 Enable via Environment Variables
Set environment variable GE_PROFILING_MODE=true, and through GE_PROFILING_OPTIONS specify JSON format configuration items (output, training_trace, task_trace, hccl, aicpu, aic_metrics etc.).
2.3 Dynamic Control via C API
After including header file ge/ge_prof.h, call in sequence: aclgrphProfInit initialize and specify output path → aclgrphProfCreateConfig create device list and metrics configuration → aclgrphProfStart start collection → execute model → aclgrphProfStop stop collection → aclgrphProfFinalize end profiling → aclgrphProfDestroyConfig destroy configuration.
2.4 Profiling Configuration Items Details
| Configuration Item | Description | Example Value |
|---|---|---|
output |
Profiling data output path | /tmp/profiling |
training_trace |
Whether enable training trace (FP/BP time points) | on / off |
fp_point |
Forward propagation starting operator name | data (if not specified, auto find Data/GetNext nodes) |
bp_point |
Backward propagation ending operator name | gradients (if not specified, auto find AllReduce nodes) |
task_trace |
Whether enable operator task-level tracing | on / off |
hccl |
Whether enable collective communication tracing | on / off |
aicpu |
Whether enable AI CPU operator tracing | on / off |
aic_metrics |
AI Core performance metrics type | PipeUtilization / ArithmeticUtilization / Memory etc. |
msproftx |
Whether enable msproftx function | on / off |
2.5 AI Core Metrics Indicator Types
| Enum Value | Description |
|---|---|
kAicoreArithmeticUtilization (0) |
Computation-type metrics percentage |
kAicorePipeUtilization (1) |
Compute unit and搬运 unit time percentage |
kAicoreMemory (2) |
UB/L1/L2 read/write bandwidth |
kAicoreMemoryL0 (3) |
L0 read/write bandwidth |
kAicoreResourceConflictRatio (4) |
Pipeline queue-type instruction percentage |
kAicoreMemoryUB (5) |
Fine-grained UB read/write bandwidth |
kAicoreL2Cache (6) |
Cache hit/miss counts |
3. Overall Architecture Design
GE Profiling system adopts layered architecture, from user API call to Device-side operator execution, each layer has independent collection mechanism, finally unified through msprof library report.
graph TB
subgraph "User Layer"
A[User code] --> B[ACL API]
A --> C[GE API]
end
subgraph "API Layer Profiling"
B --> D[AclProfilingReporter]
C --> E[GraphProfilingReporter]
D --> F[MsprofReportApi]
E --> F
end
subgraph "Host Layer Profiling"
G[ProfilingProperties] --> H[GlobalProfilingWrapper]
H --> I[GlobalProfiler]
I --> J[ScopeProfiler RAII]
end
subgraph "Compilation Layer Profiling"
K[ProfilingTaskUtils] --> L[Insert ProfilerTrace Task]
L --> M[domi::TaskDef]
end
subgraph "Runtime V1 (Hybrid)"
N[HybridProfiler] --> O[CannTracingProfiler]
O --> P[ProfilerCollector]
P --> Q[RecordStart/RecordEnd]
end
subgraph "Runtime V2 (Model Executor)"
R[CannProfilerV2] --> S[CannHostProfiler]
R --> T[GeHostProfiler]
R --> U[CannMemoryProfiler]
R --> V[BaseExecutorProfiler]
end
subgraph "Unified Reporting Layer"
F --> W[msprof library]
Q --> W
S --> W
T --> W
U --> W
M --> W
end
subgraph "Analysis Tools"
W --> X[MSProfiler tool]
X --> Y[Visual analysis]
end
style A fill:#e1f5fe
style W fill:#fff3e0
style X fill:#e8f5e9
3.1 Layered Responsibilities
| Layer | Core Components | Responsibilities |
|---|---|---|
| API Layer | AclProfilingReporter, GraphProfilingReporter |
Collect user API call duration (e.g. aclmdlExecute, GEInitialize) |
| Host Layer | GlobalProfilingWrapper, ScopeProfiler |
Collect Host-side framework execution duration (e.g. InferShape, Tiling, memory allocation) |
| Compilation Layer | ProfilingTaskUtils |
Insert ProfilerTrace tasks into model during compilation, for training trace collection |
| Runtime V1 | HybridProfiler, CannTracingProfiler, ProfilerCollector |
Operator execution time collection under Hybrid executor |
| Runtime V2 | CannProfilerV2, CannHostProfiler, CannMemoryProfiler |
Operator execution, Host scheduling, memory information collection under V2 executor |
4. Code Chain: From Entry to Implementation
4.1 Initialization Chain
User calls GEInitialize(options)
↓
api/session/client/ge_api_v2.cc: InitProfiling(options)
↓
runtime/v1/common/profiling/profiling_init.cc: ProfilingInit::Instance().Init(options)
↓
1. Parse profilingMode and profilingOptions
2. Parse training_trace, fp_point, bp_point
3. Call MsprofInit() to initialize msprof library
4. Register GE control callback ProfRegisterCtrlCallback()
5. Set ProfilingProperties singleton state
Initialization Entry Design:
InitProfiling function in ge_api_v2.cc receives options parameter, calls ProfilingInit::Instance().Init(options) to execute initialization. If returns non-SUCCESS status then report error, otherwise return success.
Options Parsing Logic:
ProfilingInit::InitProfOptions method adopts priority strategy: First look for ge.exec.profilingMode and ge.exec.profilingOptions from GE options; If options not configured (profilingMode not "1"), fallback to read environment variables MM_ENV_PROFILING_MODE and MM_ENV_PROFILING_OPTIONS; If environment variables also not set or value not "true", directly return SUCCESS (meaning profiling not enabled). After parsing complete call ParseOptions to extract training_trace, fp_point, bp_point etc. fields from JSON, finally through ProfilingProperties::Instance().SetExecuteProfiling(true) set global state.
4.2 Compilation Period Profiling Task Insertion
During graph compilation phase, ProfilingTaskUtils is responsible for inserting ProfilerTrace tasks into computational graph. These tasks generate timestamps when executing on Device side, used for training trace analysis.
Compilation graph build flow
↓
compiler/graph/build/profiling_task_utils.cc: ProfilingTaskUtils::FindProfilingTaskIndex()
↓
1. Check ProfilingProperties::ProfilingOn() or ProfilingTrainingTraceOn()
2. Find FP point (forward starting operator):
- User specified: Through fp_point config find matching operator name
- Auto find: Traverse graph to find first Data/GetNext/IteratorV2 node
3. Find BP point (backward ending operator):
- User specified: Through bp_point config find matching operator name
- Auto find: Find last AllReduce or NetOutput node
4. Find iteration end point: FlowCtrl related nodes or NetOutput
5. Find AllReduce node list (for communication trace)
6. Find GetNext node list (for data loading trace)
↓
InsertProfilingTaskBefore/After() Insert TaskDef before/after operator
↓
AssembleTaskForProfilerTrace() Generate MODEL_TASK_PROFILER_TRACE type task
Profiling Task Insertion Logic:
InsertProfilingTaskBefore method (defined in compiler/graph/build/profiling_task_utils.cc) checks whether need to insert profiling task before operator execution, through operator attribute judge whether marked as FP insertion point, if yes then generate ProfilerTrace task. For AllReduce type operators call dedicated method to insert communication trace task, for GetNext type operators insert data loading trace task.
AssembleTaskForProfilerTrace method responsible for assembling ProfilerTrace task: Create TaskDef object, set task type as MODEL_TASK_PROFILER_TRACE, bind stream_id, write logid and iteration end marker, finally add to task list.
LogID Definition:
| LogID | Meaning |
|---|---|
kProfilingFpStartLogid = 2 |
Forward propagation start |
kProfilingBpEndLogid = 3 |
Backward propagation end |
kProfilingIterEndLogid = 4 |
Iteration end |
kProfilingArStartLogid = 10000 |
AllReduce start (each AR +2) |
kProfilingArEndLogid = 10001 |
AllReduce end (each AR +2) |
kProfilingGetNextStartLogid = 20000 |
GetNext start |
kProfilingGetNextEndLogid = 20001 |
GetNext end |
1. Business Perspective: What Problems Does Profiling Solve
In AI model training和 inference, performance瓶颈 may appear at any环节: operator execution耗时过长, memory allocation不合理, stream调度冲突, Host-Device data transmission阻塞等. GE's Profiling feature is designed to solve these observability problems.
Typical User Scenarios:
-
Training scenario performance tuning: Developers need to know in one training step, forward propagation (FP)和 backward propagation (BP)各耗时多少, which AllReduce operators become communication瓶颈, whether there's idle waiting between iterations.
-
Inference scenario latency analysis: Model loading耗时多少? Each operator's execution time distribution如何? Are there certain operators异常慢?
-
Memory analysis: Static operator's memory lifecycle是怎样的? Does memory layout conflict exist?
-
API-level performance tracing: User-called
aclmdlExecute,aclopExecute等 API耗时多少?
GE's Profiling system design philosophy is: layered collection, on-demand enable, unified reporting. Different layers (API layer, Host layer, Device layer) independently collect, through unified msprof library report to analysis tools (such as MSProfiler), users can according to需要开启 different granularity profiling.
2. How to Enable Profiling
2.1 Enable via GE Options (Recommended Method)
When calling GEInitialize or creating Session, through options parameter configure ge.exec.profilingMode as "1" to enable profiling,并在 ge.exec.profilingOptions中以 JSON format specify output path, training trace开关, fp_point/bp_point operator names, task_trace, hccl, aicpu, aic_metrics等 options.
2.2 Enable via Environment Variables
Set environment variable GE_PROFILING_MODE=true,并通过 GE_PROFILING_OPTIONS specify JSON format configuration items (output, training_trace, task_trace, hccl, aicpu, aic_metrics等).
2.3 Dynamic Control via C API
After including header file ge/ge_prof.h, call in sequence: aclgrphProfInit initialize并 specify output path → aclgrphProfCreateConfig create device list和 metrics configuration → aclgrphProfStart start collection → execute model → aclgrphProfStop stop collection → aclgrphProfFinalize end profiling → aclgrphProfDestroyConfig destroy configuration.
2.4 Profiling Configuration Items Details
| Configuration Item | Description | Example Value |
|---|---|---|
output |
Profiling data output path | /tmp/profiling |
training_trace |
Whether enable training trace (FP/BP time points) | on / off |
fp_point |
Forward propagation starting operator name | data (not specified则 auto find Data/GetNext nodes) |
bp_point |
Backward propagation ending operator name | gradients (not specified则 auto find AllReduce nodes) |
task_trace |
Whether enable operator task-level tracing | on / off |
hccl |
Whether enable collective communication tracing | on / off |
aicpu |
Whether enable AI CPU operator tracing | on / off |
aic_metrics |
AI Core performance metrics type | PipeUtilization / ArithmeticUtilization / Memory等 |
msproftx |
Whether enable msproftx function | on / off |
2.5 AI Core Metrics Indicator Types
| Enum Value | Description |
|---|---|
kAicoreArithmeticUtilization (0) |
Computation-type metrics percentage |
kAicorePipeUtilization (1) |
Compute unit和搬运 unit time percentage |
kAicoreMemory (2) |
UB/L1/L2 read/write bandwidth |
kAicoreMemoryL0 (3) |
L0 read/write bandwidth |
kAicoreResourceConflictRatio (4) |
Pipeline queue-type instruction percentage |
kAicoreMemoryUB (5) |
Fine-grained UB read/write bandwidth |
kAicoreL2Cache (6) |
Cache hit/miss counts |
3. Overall Architecture Design
GE Profiling system采用分层架构, from user API call to Device-side operator execution, each layer has independent collection mechanism,最终 unified through msprof library report.
graph TB
subgraph "User Layer"
A[User code] --> B[ACL API]
A --> C[GE API]
end
subgraph "API Layer Profiling"
B --> D[AclProfilingReporter]
C --> E[GraphProfilingReporter]
D --> F[MsprofReportApi]
E --> F
end
subgraph "Host Layer Profiling"
G[ProfilingProperties] --> H[GlobalProfilingWrapper]
H --> I[GlobalProfiler]
I --> J[ScopeProfiler RAII]
end
subgraph "Compilation Layer Profiling"
K[ProfilingTaskUtils] --> L[Insert ProfilerTrace Task]
L --> M[domi::TaskDef]
end
subgraph "Runtime V1 (Hybrid)"
N[HybridProfiler] --> O[CannTracingProfiler]
O --> P[ProfilerCollector]
P --> Q[RecordStart/RecordEnd]
end
subgraph "Runtime V2 (Model Executor)"
R[CannProfilerV2] --> S[CannHostProfiler]
R --> T[GeHostProfiler]### 4.3 API Layer Profiling
API layer collects user API call duration through RAII pattern Reporter classes.
User calls aclmdlExecute(modelId, input, output) ↓ ACL_PROFILING_REG(AclProfType::AclmdlExecute) macro expands ↓ Create AclProfilingReporter object (constructor records start time) ↓ Execute actual API logic ↓ AclProfilingReporter destructs (record end time and report) ↓ MsprofReportApi() report to msprof library
**API Layer RAII Profiling Mechanism**:
`ACL_PROFILING_REG(apiId)` macro (defined in `api/acl/common/prof_api_reg.h`) declares a const type `AclProfilingReporter` local object in function scope. Constructor checks global profiling running state and records start time, destructor gets end time then constructs `MsprofApi` structure to report to msprof library.
**Graph API Layer Profiling** uses similar `GRAPH_PROFILING_REG(api_id)` macro (defined in `inc/framework/runtime/subscriber/global_profiler.h`) to create `GraphProfilingReporter` object, through `GlobalProfilingWrapper` judge enable state then report.
### 4.4 Host Layer Profiling
Host layer collects framework internal execution duration through `GlobalProfilingWrapper` and `ScopeProfiler`.
Host-side execution flow (e.g. InferShape, Tiling) ↓ PROFILING_SCOPE(element, event) macro expands ↓ Create ScopeProfiler object (RAII) ↓ Execute actual logic ↓ ScopeProfiler destructs (record start/end events) ↓ ProfilingContext::RecordCurrentThread() record ↓ GlobalProfiler::Record() write to ring buffer ↓ Final Dump output
**Host Layer Scope Profiling Mechanism**:
`PROFILING_SCOPE(element, event)` macro (defined in `inc/framework/common/profiling_definitions.h`) expands to create `ge::profiling::ScopeProfiler` local object, adopts RAII pattern to automatically record execution duration within scope. Constructor checks profiling enable state and records start timestamp, destructor records start and end two events to `ProfilingContext`.
**Runtime V2 Scope Profiling** uses `RT2_PROFILING_SCOPE(element, event)` macro (defined in `inc/framework/runtime/subscriber/global_profiler.h`) to create `gert::ScopeProfiler` object, through `GlobalProfilingWrapper` judge enable state, during destruction records `kExecuteStart` and `kExecuteEnd` events.
### 4.5 Runtime V2 Profiling (Core Implementation)
Runtime V2 is GE's main executor, `CannProfilerV2` is its core Profiling component.
Model execution flow ↓ CannProfilerV2::OnExecuteEvent() receive execution event ↓ kModelStart event → profiler->Init() initialize Profiling info ↓ kExecuteStart event → profiler->RecordLaunchBeginTime() record operator start time ↓ Operator kernel execution ↓ kExecuteEnd event → profiler->DoProf() report operator Profiling data ↓
- Report MsprofApi (operator API level info)
- Report MsprofCompactInfo (operator basic info: name, type, taskType, blockDim)
- Report MsprofAdditionalInfo (Tensor info: shape, format, dataType)
- Report Context ID info (for PMU data matching)
**V2 Profiling Initialization Flow**:
`CannProfilerV2::Init` method (defined in `runtime/v2/subscriber/profiler/cann_profiler_v2.cc`) checks initialization flag and enable state, then calls `InitForCannDevice` to execute complete initialization: Initialize operator name and type Hash mapping; Deserialize DfxExtendInfo from execute_graph's zero-copy property; Traverse all execution nodes to initialize basic info and Tensor info; Fill shape info to tensor info wrapper.
**V2 Profiling Data Reporting Flow**:
`DoProf` method is called when operator execution ends. First check whether DavinciModel type node, if yes trigger model internal profiling data report. For normal operators, get end time then call `MsprofReportApi` to report API level info, then traverse related nodes call `DoProfByNodeId` to report operator basic info and Tensor info. `RecordNodeBasicInfo` method fills and reports `MsprofCompactInfo` structure.
### 4.6 Runtime V1 Hybrid Profiling
V1 Hybrid executor uses `ProfilerCollector` for model execution level time collection.
Model execution flow (V1 Hybrid) ↓ ProfilerCollector::RecordStart(stream) record model start ↓
- Report kModelExecute event
- Report StepTrace Start Tag ↓ Model execution ↓ ProfilerCollector::RecordEnd(stream) record model end ↓
- Report StepTrace End Tag
- Report kModelExecute event
- Report GraphIdMap (graph_id to model_id mapping)
**V1 Hybrid Profiling Implementation**:
`ProfilerCollector::RecordStart` method (defined in `runtime/v1/common/profiling/profiling_manager.cc`) is called at model execution start, checks enable state then reports `kModelExecute` type event, and reports StepTrace Start Tag to specified stream.
`ProfilerCollector::RecordEnd` method is called at model execution end, reports StepTrace End Tag, `kModelExecute` event and graph_id to model_id mapping relationship, finally step_id increments.
---
## 5. Profiling Data Flow Panorama
```mermaid
sequenceDiagram
participant User as User code
participant API as API layer
participant GE as GE framework
participant Compiler as Compiler
participant Runtime as Runtime
participant Device as Device side
participant Msprof as msprof library
participant Tool as MSProfiler tool
Note over User,Compiler: Initialization phase
User->>API: GEInitialize(options with profiling)
API->>GE: ProfilingInit::Init(options)
GE->>GE: Parse profilingMode/profilingOptions
GE->>Msprof: MsprofInit()
GE->>Msprof: MsprofRegisterCallback(GE, ProfCtrlHandle)
GE-->>API: SUCCESS
Note over User,Compiler: Compilation phase
User->>GE: BuildGraph / AddGraph
GE->>Compiler: ProfilingTaskUtils::FindProfilingTaskIndex()
Compiler->>Compiler: Find FP/BP points
Compiler->>Compiler: Insert ProfilerTrace Task
Compiler-->>GE: Compilation complete (with profiling tasks)
Note over User,Device: Execution phase - API layer
User->>API: aclmdlExecute()
API->>API: ACL_PROFILING_REG record start time
API->>GE: Execute model
API->>API: ~AclProfilingReporter record end time
API->>Msprof: MsprofReportApi()
Note over User,Device: Execution phase - Host layer
GE->>GE: PROFILING_SCOPE(InferShape)
GE->>GE: ScopeProfiler RAII record duration
GE->>GE: GlobalProfiler::Record()
Note over User,Device: Execution phase - Runtime V2
GE->>Runtime: CannProfilerV2::OnExecuteEvent(kExecuteStart)
Runtime->>Runtime: RecordLaunchBeginTime()
Runtime->>Device: Launch operator kernel
Device-->>Runtime: Operator execution complete
Runtime->>Runtime: CannProfilerV2::DoProf()
Runtime->>Msprof: MsprofReportApi()
Runtime->>Msprof: MsprofReportCompactInfo()
Runtime->>Msprof: MsprofReportAdditionalInfo()
Note over User,Device: Execution phase - Device layer
Device->>Device: Execute ProfilerTrace Task
Device->>Msprof: Report timestamps (FP/BP/AR/IterEnd)
Note over User,Tool: End phase
User->>API: aclgrphProfStop()
API->>GE: ProfilingManager::ProfStopProfiling()
GE->>GE: Cleanup ProfilingProperties
User->>API: aclgrphProfFinalize()
API->>Msprof: MsprofFinalize()
Msprof->>Tool: Output profiling data files
6. Core Data Structures
6.1 ProfilingProperties (Global State Management)
ProfilingProperties class (defined in base/common/profiling/profiling_properties.h) is profiling system's global state singleton, manages all profiling switches and configurations, including load/execute profiling switches, training trace switches, operator detail switches, task event switches, fp/bp point configurations, device configuration data etc.
6.2 Profiling Event Enumeration
Defined in inc/framework/common/profiling_definitions.h's ge::profiling namespace, contains about 80+ profiling event types, covering from API call to operator execution various stages, including ACL interface layer, ACL internal layer, executor layer, static single operator layer, V2 executor layer, FFTS Plus layer etc.
6.3 GeProfInfoType (GE Level Profiling Info Type)
Defined in inc/framework/runtime/subscriber/global_profiler.h, divided into Model level, Node level and ACL level three categories by layer.
6.4 AclProfType (ACL API Profiling Type)
Defined in api/acl/common/prof_api_reg.h, divided into operator compilation type, operator execution type, model type, CBLAS type four categories by function, using different starting offsets to distinguish.
7. Profiling Type and Enable Bits
GE Profiling system controls different type profiling enable through bitmask (enable_flags). GlobalProfilingWrapper::IsEnabled(ProfilingType profiling_type) method checks corresponding type enable bit through bitwise AND operation.
Main ProfilingTypes include:
| ProfilingType | Description | Collection Content |
|---|---|---|
kTaskTime |
Task time profiling | API call duration, operator execution time |
kGeHost |
GE Host layer profiling | InferShape, Tiling etc. framework internal duration |
kDevice |
Device layer profiling | Operator basic info, Tensor info |
kCannHost |
CANN Host layer profiling | Host-side scheduling info |
kCannHostL1 |
CANN Host L1 layer profiling | More fine-grained Host scheduling info |
kMemory |
Memory profiling | Static operator memory info |
8. msprof Library Integration
GE Profiling system depends on external msprof library for data collection and reporting. CMake through Findmsprof.cmake defines three targets: msprofiler_fwk_share corresponds to libmsprofiler.so main library, profapi_share corresponds to libprofapi.so Profiling API library, msprof_headers provides profiling/aprof_pub.h etc. header file paths.
Dynamic Loading Mechanism: runtime/c/dbg/profiling/profiling_dynamic.c through dlsym dynamically loads msprof function pointers, including MsprofInit, MsprofFinalize, MsprofGetHashId, MsprofSysCycleTime, MsprofReportData, MsprofRegisterCallback, MsprofNotifySetDevice etc. core functions.
Core Reporting Functions:
| Function | Usage |
|---|---|
MsprofInit() |
Initialize msprof library |
MsprofFinalize() |
End profiling, trigger data flush to disk |
MsprofSysCycleTime() |
Get high precision timestamp (CPU cycle) |
MsprofGetHashId() |
Calculate string Hash (to reduce data transmission) |
MsprofReportApi() |
Report API level profiling data |
MsprofReportEvent() |
Report event level profiling data |
MsprofReportCompactInfo() |
Report compact info (operator basic info) |
MsprofReportAdditionalInfo() |
Report additional info (Tensor info, Context ID) |
MsprofRegisterCallback() |
Register control callback (for dynamic start/stop profiling) |
9. Summary
GE's Profiling system is a layered, on-demand enabled, unified reporting performance collection framework. It achieves through the following core mechanisms:
- Initialization: Through options or environment variables configuration, call
ProfilingInitto initialize msprof library - Compilation Period:
ProfilingTaskUtilsinserts ProfilerTrace tasks into computational graph, for training trace collection - API Layer: Through RAII pattern
AclProfilingReporter/GraphProfilingReportercollect API call duration - Host Layer: Through
ScopeProfilerandGlobalProfilercollect framework internal execution duration - Runtime Layer:
CannProfilerV2(V2) andProfilerCollector(V1) collect operator execution time and Tensor info - Unified Reporting: All data through msprof library report to MSProfiler tool for analysis