mhc_pre Performance Analysis

Comparison with mhc_post

Metric mhc_post mhc_pre
Read 1× input [B,S,D] N× input [B×N,S,D]
Compute 1 Muls per element N Muls + (N-1) Adds per element
Write N× output [B×N,S,D] 1× output [B,S,D]
blockDim B × N B

Memory Bandwidth Analysis

For typical N=4 streams:

  • mhc_post: reads B×S×D, writes B×N×S×D → 5× data movement
  • mhc_pre: reads B×N×S×D, writes B×S×D → 5× data movement

Both operators have similar memory bandwidth requirements.

Optimization Notes

Dynamic Tiling

  • Tile size computed based on UB_SIZE (192KB)
  • Formula for fp32/fp16: tile = UB_SIZE / ((BUFFER_NUM * 2 + 1) * sizeof(T))
    • Buffers: inQue (2) + outQue (2) + tmpBuf (1) = 5 buffers
  • Formula for bf16: tile = UB_SIZE / ((BUFFER_NUM * 2 + 2) * sizeof(float))
    • Additional fp32 accBuf for intermediate accumulation
  • Tile is aligned to 8 (fp32) or 16 (fp16/bf16) elements

BF16 Handling

  • BF16 lacks native Muls support on some AI cores
  • Compute path: Cast(bf16→fp32) → Muls → Add → Cast(fp32→bf16)
  • Accumulation done in fp32 for numerical stability

Benchmark Results

Run ./perf_test and ./perf_sweep to collect hardware-specific data.