加速库RingMLA C++ Demo

介绍

该目录下为加速库RingMLA C++调用示例。

示例中生成的数据不代表实际场景，如需数据生成参考请查看根目录下的python用例目录： tests/apitest/opstest/python/operations/ring_mla/

RingMLA仅Atlas A2/A3系列

RingMLA：
- 基础场景，对于query，key分别传入带与不带rope转置的矩阵
- 传入固定shape，512x512的上三角mask
- 默认编译脚本可编译运行
- 该demo仅支持在Atlas A2/A3系列上运行

因为RingMLA和其他的算子较为不同，第一次运算时不带前次生成的prevOut和prevLse，但是从第二次起使用时需要带上，这里分两段描述：

参数设置：

注意：qkScale设置值为RingMLA做rope转置前query，key合一的headSize，即128(nope) + 64(rope) = 192

数据规格：

tensor名字	数据类型	数据格式	维度信息	cpu/npu
`queryNope`	bf16	nd	[1228, 16, 128]	npu
`queryRope`	bf16	nd	[1228, 16, 64]	npu
`keyNope`	bf16	nd	[828, 8, 128]	npu
`keyRope`	bf16	nd	[828, 8, 64]	npu
`value`	bf16	nd	[828, 8, 128]	npu
`mask`	bf16	nd	[512, 512]	npu
`seqLen`	int32	nd	[2, 3]	cpu
Output
`output`	bf16	nd	[1228, 16, 128]	npu
`softmaxLse`	float	nd	[16, 1228]	npu

q第一维度为总词元长度，对应sum(seqlen[0])，k，v第一维度对应sum(seqlen[1])

参数设置：

需要额外更改param里的calcType为CALC_TYPE_DEFAULT，其他保持一致

数据规格：

tensor名字	数据类型	数据格式	维度信息	cpu/npu
`queryNope`	bf16	nd	[1228, 16, 128]	npu
`queryRope`	bf16	nd	[1228, 16, 64]	npu
`keyNope`	bf16	nd	[828, 8, 128]	npu
`keyRope`	bf16	nd	[828, 8, 64]	npu
`value`	bf16	nd	[828, 8, 128]	npu
`mask`	bf16	nd	[512, 512]	npu
`seqLen`	bf16	nd	[2, 3]	cpu
`prevOut`	bf16	nd	[1228, 16, 128]	npu
`prevLse`	float	nd	[16, 1228]	npu
Output
`output`	bf16	nd	[1228, 16, 128]	npu
`softmaxLse`	float	nd	[16, 1228]	npu

第二轮使用第一轮新生成的output和softmaxLse来作为prevOut和prevLse。