mhc_pre Performance Analysis

Comparison with mhc_post

For typical N=4 streams:

Both operators have similar memory bandwidth requirements.

Tile size computed based on UB_SIZE (192KB)
Formula for fp32/fp16: tile = UB_SIZE / ((BUFFER_NUM * 2 + 1) * sizeof(T))
- Buffers: inQue (2) + outQue (2) + tmpBuf (1) = 5 buffers
Formula for bf16: tile = UB_SIZE / ((BUFFER_NUM * 2 + 2) * sizeof(float))
- Additional fp32 accBuf for intermediate accumulation
Tile is aligned to 8 (fp32) or 16 (fp16/bf16) elements

Run ./perf_test and ./perf_sweep to collect hardware-specific data.