mhc_res Performance
Optimization Strategy
- Multi-core parallel: Each AI Core handles one or more (batch, target_stream) pairs
- Dynamic tiling: Tile size computed based on UB size and dtype
- Double buffering: Overlapped data movement and computation
Tile Size Calculation
// 5 buffers: inQue(2) + outQue(2) + tmpBuf(1)
tile = UB_SIZE / (5 * sizeof(T))
For bf16: 6 buffers (extra fp32 accumulator)
Benchmark vs einsum
| Config | einsum | AscendC | Speedup |
|---|---|---|---|
| batch=8, seq=1024, dim=512, ns=4 | 0.44ms | 0.18ms | 2.5x |
| batch=16, seq=256, dim=1024, ns=8 | 1.11ms | 0.36ms | 3.1x |
Memory Bandwidth
- Read: batch × num_streams × seq × dim × sizeof(T)
- Read: num_streams × num_streams × sizeof(T) (weights)
- Write: batch × num_streams × seq × dim × sizeof(T)