cann-robot新增 inplace_fused_causal_conv1d

算子列表

使用说明：

算子目录：目录名为算子名小写下划线形式，每个目录承载该算子所有交付件，包括代码实现、examples、文档等，目录介绍参见项目目录。

算子执行硬件单元：大部分算子运行在AI Core，少部分算子运行在AI CPU。默认情况下，项目中提到的算子一般指AI Core算子。关于AI Core和AI CPU详细介绍参见《Ascend C算子开发》，其中版本号大于等于8.5.0中对应章节为“硬件实现”，其余版本中对应章节为“概念原理和术语 > 硬件架构与数据处理原理”。

算子接口列表：为方便调用算子，CANN提供一套C API执行算子，一般以aclnn为前缀，全量接口参见aclnn列表。

V版本演进说明：部分算子存在多个V版本，使用时选择最高V版本即可（高版本算子已兼容低版本算子的所有能力）。

项目提供的所有算子分类和算子列表如下：

算子分类	算子目录	算子实现		aclnn调用	图模式调用	算子执行硬件单元	说明
算子分类	算子目录	op_kernel	op_host	op_api	op_graph	算子执行硬件单元	说明
attention	attention_update	✓	✓	✓	✗	AI Core	将各SP域PA算子的输出的中间结果lse，localOut两个局部变量结果更新成全局结果。
attention	attention_worker_scheduler	✓	✓	✓	✗	AI CPU	Attention和FFN分离部署场景下，Attention侧数据扫描算子。
attention	block_sparse_attention_grad	✓	✓	✓	✗	AI Core	训练场景下计算注意力的反向输出，即BlockSparseAttention的反向计算。
attention	chunk_gated_delta_rule	✓	✓	✓	✗	AI Core	推理场景下线性注意力Gated Delta Rule的chunk版本，用于prefill节点。
attention	flash_attention_score	✓	✓	✓	✗	AI Core	使用FlashAttention算法实现self-attention（自注意力）的计算。
attention	flash_attention_score_grad	✓	✓	✓	✗	AI Core	训练场景下计算注意力的反向输出，即FlashAttentionScore的反向计算。
attention	fused_causal_conv1d	✓	✓	✓	✓	AI Core	对序列执行因果一维卷积，沿序列维度使用缓存数据（长度为卷积核宽减1）对各序列头部进行padding，确保输出依赖当前及历史输入；卷积完成后，将当前序列部分数据更新到缓存；在因果一维卷积输出的基础上，将原始输入加到输出上以实现残差连接。支持 APC（Automatic Prefix Caching）、MTP（投机解码）、残差连接等特性。
attention	fused_floyd_attention	✓	✓	✓	✗	AI Core	训练场景下，使用FloydAttention算法实现多维自注意力的计算。
attention	fused_floyd_attention_grad	✓	✓	✓	✗	AI Core	训练场景下，计算Floyd注意力的反向输出，FloydAttn相较于传统FA主要是计算qk/pv注意力时会额外将seq作为batch轴从而转换为batchMatmul。
attention	fused_infer_attention_score	✓	✓	✓	✓	AI Core	decode & prefill场景的FlashAttention算子。
attention	gather_pa_kv_cache	✓	✓	✓	✓	AI Core	根据blockTables中的blockId值、seqLens中key/value的seqLen从keyCache/valueCache中将内存不连续的token搬运、拼接成连续的key/value序列。
attention	incre_flash_attention	✓	✓	✓	✓	AI Core	增量推理场景的FlashAttention算子。
attention	inplace_fused_causal_conv1d	✓	✓	✓	✓	AI Core	对序列执行因果一维卷积，沿序列维度使用缓存数据（长度为卷积核宽减1）对各序列头部进行padding，确保输出依赖当前及历史输入；卷积完成后，将当前序列部分数据更新到缓存；在因果一维卷积输出的基础上，将原始输入加到输出上以实现残差连接。支持 APC（Automatic Prefix Caching）、MTP（投机解码）、残差连接、原地更新等特性。
attention	kv_quant_sparse_flash_attention	✓	✓	✗	✓	AI Core	在Sparse Flash Attention的基础上支持了[Per-Token-Head-Tile-128量化]输入。
attention	lightning_indexer	✓	✓	✗	✓	AI Core	基于一系列操作得到每一个token对应的Top-k个位置。
attention	lightning_indexer_grad	✓	✓	✓	✗	AI Core	lightning_indexer的反向梯度计算算子。
attention	masked_causal_conv1d	✓	✓	✓	✓	AI Core	对hidden层的token之间进行带mask的因果一维分组卷积操作。
attention	masked_causal_conv1d_backward	✓	✓	✓	✗	AI Core	对hidden层的token之间进行一维分组卷积操作的反向梯度计算。
attention	mla_preprocess	✓	✓	✗	✗	AI Core	推理MlaPreprocess算子
attention	mla_prolog	✓	✓	✗	✗	AI Core	推理MlaProlog算子。
attention	mla_prolog_v2	✓	✓	✓	✗	AI Core	推理MlaPrologV2WeightNz算子。
attention	mla_prolog_v3	✓	✓	✓	✗	AI Core	推理MlaPrologV3WeightNz算子。
attention	nsa_compress	✓	✓	✓	✗	AI Core	训练场景下，使用NSA Compress算法减轻long-context的注意力计算，实现在KV序列维度进行压缩。
attention	nsa_compress_attention	✓	✓	✓	✗	AI Core	NSA中compress attention以及select topk索引计算。
attention	nsa_compress_attention_infer	✓	✓	✓	✗	AI Core	实现Native Sparse Attention推理过程中，Compress Attention的计算。
attention	nsa_compress_grad	✓	✓	✗	✗	AI Core	aclnnNsaCompress算子的反向计算。
attention	nsa_compress_with_cache	✓	✓	✓	✗	AI Core	实现Native-Sparse-Attention推理阶段的KV压缩。
attention	nsa_selected_attention_infer	✓	✓	✓	✗	AI Core	实现Native Sparse Attention推理过程中，Selected Attention的计算。
attention	nsa_selected_attention	✓	✓	✓	✗	AI Core	训练场景下，实现NativeSparseAttention算法中selected-attention（选择注意力）的计算。
attention	sparse_lightning_indexer_grad_kl_loss	✓	✓	✓	✗	AI Core	SparselightningIndexerGradKlLoss算子是LightningIndexer的反向算子，再额外融合了Loss计算功能输出。
attention	nsa_selected_attention_grad	✓	✓	✓	✗	AI Core	根据topkIndices对key和value选取大小为selectedBlockSize的数据重排，接着进行训练场景下计算注意力的反向输出。
attention	prompt_flash_attention	✓	✓	✓	✓	AI Core	全量推理场景的FlashAttention算子。
attention	quant_lightning_indexer	✓	✓	✗	✓	AI Core	推理场景下，SparseFlashAttention前处理的计算，选出关键的稀疏token，并对输入query和key进行量化实现存8算8。
attention	recurrent_gated_delta_rule	✓	✓	✓	✓	AI Core	增量推理场景的Recurrent Gated Delta Rule算子。
attention	ring_attention_update	✓	✓	✓	✗	AI Core	训练场景下，更新两次FlashAttention的结果。
attention	scatter_pa_cache	✓	✓	✓	✓	AI Core	更新KCache中指定位置的key。
attention	sparse_flash_attention	✓	✓	✗	✓	AI Core	针对大序列长度推理场景的高效注意力计算模块。
attention	sparse_flash_attention_grad	✓	✓	✗	✗	AI Core	训练场景下，计算sparse_flash_attention注意力的反向输出。
attention	sparse_flash_mla_grad	✓	✓	✗	✗	AI Core	训练场景下，计算sparse_flash_mla注意力的反向输出。
ffn	ffn	✓	✓	✓	✗	AI Core	提供MoeFFN和FFN的计算功能。
ffn	swin_attention_ffn	✓	✓	✗	✓	AI Core	全量推理场景的FlashAttention算子。
ffn	ffn_worker_scheduler	✓	✓	✓	✗	AI CPU	Attention和FFN分离场景下，FFN侧数据扫描算子。
ffn	ffn_worker_batching	✓	✓	✓	✓	AI Core	Attention和FFN分离场景下，FFN侧数据扫描及token重排。
ffn	swin_transformer_ln_qkv	✓	✓	✗	✓	AI Core	完成fp16权重场景下的Swin Transformer 网络模型的Q、K、V 的计算。
ffn	swin_transformer_ln_qkv_quant	✓	✓	✗	✓	AI Core	Swin Transformer 网络模型完成 Q、K、V 的计算。
gmm	grouped_matmul	✓	✓	✓	✗	AI Core	实现分组矩阵乘计算。
gmm	grouped_matmul_add	✓	✓	✓	✗	AI Core	实现分组矩阵乘计算，每组矩阵乘的维度大小可以不同。
gmm	grouped_matmul_finalize_routing	✓	✓	✓	✗	AI Core	GroupedMatmul和MoeFinalizeRouting的融合算子，GroupedMatmul计算后的输出按照索引做combine动作。
gmm	grouped_matmul_swiglu_quant	✓	✓	✓	✗	AI Core	融合GroupedMatmul 、dequant、swiglu和quant。
gmm	grouped_matmul_swiglu_quant_v2	✓	✓	✓	✗	AI Core	融合GroupedMatmul 、dequant、swiglu和quant，新增了MXFP8量化场景（仅Ascend 950PR/Ascend 950DT AI处理器支持）
gmm	quant_grouped_matmul_inplace_add	✓	✓	✓	✗	AI Core	实现分组矩阵乘计算和加法计算。
mc2	all_gather_matmul	✓	✓	✓	✓	AI Core	完成AllGather通信与MatMul计算融合。
mc2	allto_all_all_gather_batch_mat_mul	✓	✓	✓	✓	AI Core	完成AllToAll、AllGather集合通信与BatchMatMul计算融合、并行。
mc2	allto_all_matmul	√	√	√	×	AI Core	完成AlltoAll通信与MatMul计算融合。
mc2	allto_allv_grouped_mat_mul	✓	✓	✓	✓	AI Core	完成路由专家AlltoAllv、Permute、GroupedMatMul融合并实现与共享专家MatMul并行融合，先通信后计算。
mc2	ffn_to_attention	✓	✓	✓	✓	AI Core	一个通信域内的FFN节点对Attention节点发送数据并写状态位，以检测通信链路是否正常。
mc2	attention_to_ffn	✓	✓	✓	✓	AI Core	一个通信域内的Attention节点对FFN节点发送数据并写状态位，以检测通信链路是否正常。
mc2	batch_mat_mul_reduce_scatter_allto_all	✓	✓	✓	✓	AI Core	实现BatchMatMul计算与ReduceScatter、AllToAll集合通信并行。
mc2	distribute_barrier	✓	✓	✓	✓	AI Core	完成通信域内的全卡同步，xRef仅用于构建Tensor依赖，接口内不对xRef做任何操作。
mc2	distribute_barrier_extend	✓	✓	✗	✓	AI Core	完成通信域内的全卡同步，xRef仅用于构建Tensor依赖，接口内不对xRef做任何操作。
mc2	grouped_mat_mul_all_reduce	✓	✓	✓	✓	AI Core	在融合GroupedMatMul的基础上实现多卡并行AllReduce功能，实现分组矩阵乘计算，每组矩阵乘的维度大小可以不同。
mc2	grouped_mat_mul_allto_allv	✓	✓	✓	✓	AI Core	完成路由专家GroupedMatMul、Unpermute、AlltoAllv融合并实现与共享专家MatMul并行融合，先计算后通信。
mc2	inplace_matmul_all_reduce_add_rms_norm	✓	✓	✓	✓	AI Core	完成mm + all_reduce + add + rms_norm计算。
mc2	matmul_allto_all	√	√	√	×	AI Core	完成MatMul计算与AlltoAll通信融合。
mc2	matmul_all_reduce	✓	✓	✓	✓	AI Core	完成MatMul计算与AllReduce通信融合。
mc2	matmul_all_reduce_add_rms_norm	✓	✓	✓	✓	AI Core	完成mm + all_reduce + add + rms_norm计算
mc2	matmul_reduce_scatter	✓	✓	✓	✓	AI Core	完成mm + reduce_scatter_base计算。
mc2	mega_moe	✓	✓	✗	✗	AI Core	完成dispatch + group_matmul1 + swiglu_quant + group_matmul2 + combine的端到端融合计算。
mc2	moe_distribute_combine	✓	✓	✓	✓	AI Core	当存在TP域通信时，先进行ReduceScatterV通信，再进行AlltoAllV通信，最后将接收的数据整合（乘权重再相加）；当不存在TP域通信时，进行AlltoAllV通信，最后将接收的数据整合（乘权重再相加）。
mc2	moe_distribute_combine_add_rms_norm	✓	✓	✓	✓	AI Core	当存在TP域通信时，先进行ReduceScatterV通信，再进行AlltoAllV通信，最后将接收的数据整合（乘权重再相加）；当不存在TP域通信时，进行AlltoAllV通信，最后将接收的数据整合（乘权重再相加），之后完成Add + RmsNorm融合。
mc2	moe_distribute_combine_v2	✓	✓	✓	✓	AI Core	当存在TP域通信时，先进行ReduceScatterV通信，再进行AlltoAllV通信，最后将接收的数据整合（乘权重再相加）；当不存在TP域通信时，进行AlltoAllV通信，最后将接收的数据整合（乘权重再相加）。
mc2	moe_distribute_dispatch	✓	✓	✓	✓	AI Core	对Token数据进行量化（可选），当存在TP域通信时，先进行EP（Expert Parallelism）域的AllToAllV通信，再进行TP（Tensor Parallelism）域的AllGatherV通信；当不存在TP域通信时，进行EP（Expert Parallelism）域的AllToAllV通信。
mc2	moe_distribute_dispatch_v2	✓	✓	✓	✓	AI Core	对Token数据进行量化（可选），当存在TP域通信时，先进行EP（Expert Parallelism）域的AllToAllV通信，再进行TP（Tensor Parallelism）域的AllGatherV通信；当不存在TP域通信时，进行EP（Expert Parallelism）域的AllToAllV通信。
mc2	moe_distribute_dispatch_v3	✓	✓	✓	✓	AI Core	对Token数据进行量化（可选），当存在TP域通信时，先进行EP（Expert Parallelism）域的AllToAllV通信，再进行TP（Tensor Parallelism）域的AllGatherV通信；当不存在TP域通信时，进行EP（Expert Parallelism）域的AllToAllV通信。
mc2	moe_update_expert	✓	✓	✓	✓	AI Core	完成每个token的topK个专家逻辑专家号到物理卡号的映射。
moe	moe_compute_expert_tokens	✓	✓	✗	✓	AI Core	MoE计算中，通过二分查找的方式查找每个专家处理的最后一行的位置。
moe	moe_finalize_routing	✓	✓	✗	✓	AI Core	MoE计算中，最后处理合并MoE FFN的输出结果。
moe	moe_finalize_routing_v2	✓	✓	✓	✓	AI Core	MoE计算中，最后处理合并MoE FFN的输出结果。
moe	moe_finalize_routing_v2_grad	✓	✓	✓	✓	AI Core	aclnnMoeFinalizeRoutingV2的反向传播。
moe	moe_gating_top_k	✓	✓	✓	✓	AI Core	MoE计算中，对输入x做Sigmoid计算，对计算结果分组进行排序，最后根据分组排序的结果选取前k个专家。
moe	moe_gating_top_k_backward	✓	✓	✓	✗	AI Core	MoeGatingTopK的反向算子。
moe	moe_gating_top_k_softmax	✓	✓	✗	✓	AI Core	MoE计算中，对x的输出做Softmax计算，取TopK操作。
moe	moe_gating_top_k_softmax_v2	✓	✓	✓	✓	AI Core	MoE计算中，如果renorm=0，先对x的输出做Softmax计算，再取topk操作；如果renorm=1，先对x的输出做topk操作，再进行Softmax操作。
moe	moe_init_routing	✓	✓	✓	✓	AI Core	MoE的routing计算，根据aclnnMoeGatingTopKSoftmax的计算结果做routing处理。
moe	moe_init_routing_quant	✓	✓	✗	✓	AI Core	MoE的routing计算，根据aclnnMoeGatingTopKSoftmax的计算结果做routing处理，并对结果进行量化。
moe	moe_init_routing_quant_v2	✓	✓	✓	✓	AI Core	MoE的routing计算，根据aclnnMoeGatingTopKSoftmaxV2的计算结果做routing处理。
moe	moe_init_routing_v2	✓	✓	✓	✓	AI Core	以MoeGatingTopKSoftmax算子的输出x和expert_idx作为输入，并输出Routing矩阵expanded_x等结果供后续计算使用。
moe	moe_init_routing_v2_grad	✓	✓	✓	✓	AI Core	aclnnMoeInitRoutingV2的反向传播，完成tokens的加权求和。
moe	moe_init_routing_v3	✓	✓	✓	✓	AI Core	MoE的routing计算，根据aclnnMoeGatingTopKSoftmaxV2的计算结果做routing处理，支持不量化和动态量化模式。
moe	moe_re_routing	✓	✓	✗	✓	AI Core	MoE网络中，进行AlltoAll操作从其他卡上拿到需要算的token后，将token按照专家顺序重新排列。
moe	moe_token_permute	✓	✓	✓	✓	AI Core	MoE的permute计算，根据索引indices将tokens广播并排序。
moe	moe_token_permute_grad	✓	✓	✓	✓	AI Core	aclnnMoeTokenPermute的反向传播计算。
moe	moe_token_permute_with_ep	✓	✓	✗	✗	AI Core	MoE的permute计算，根据索引indices将tokens和可选probs广播后排序并按照rangeOptional中范围切片。
moe	moe_token_permute_with_ep_grad	✓	✓	✗	✗	AI Core	aclnnMoeTokenPermuteWithEp的反向传播计算。
moe	moe_token_permute_with_routing_map	✓	✓	✓	✗	AI Core	aclnnMoeTokenPermuteWithRoutingMap的反向传播。
moe	moe_token_permute_with_routing_map_grad	✓	✓	✓	✗	AI Core	MoE的permute计算，根据索引indices将tokens和可选probs广播后排序并按照rangeOptional中范围切片。
moe	moe_token_unpermute	✓	✓	✓	✓	AI Core	根据sortedIndices存储的下标，获取permutedTokens中存储的输入数据；如果存在probs数据，permutedTokens会与probs相乘；最后进行累加求和，并输出计算结果。
moe	moe_token_unpermute_grad	✓	✓	✓	✓	AI Core	aclnnMoeTokenUnpermuteGrad的反向传播。
moe	moe_token_unpermute_with_ep	✓	✓	✗	✗	AI Core	根据sortedIndices存储的下标位置，去获取permutedTokens中的输入数据与probs相乘，并进行合并累加。
moe	moe_token_unpermute_with_ep_grad	✓	✓	✗	✗	AI Core	aclnnMoeTokenUnpermuteWithEp的反向传播。
moe	moe_token_unpermute_with_routing_map	✓	✓	✓	✗	AI Core	对经过aclnnMoeTokenpermuteWithRoutingMap处理的permutedTokens，累加回原unpermutedTokens。根据sortedIndices存储的下标，获取permutedTokens中存储的输入数据；如果存在probs数据，permutedTokens会与probs相乘，最后进行累加求和，并输出计算结果。
moe	moe_token_unpermute_with_routing_map_grad	✓	✓	✗	✗	AI Core	aclnnMoeTokenUnpermuteWithRoutingMap的反向传播。
mhc	mhc_sinkhorn	✓	✓	✓	✓	AI Core	基于用Sinkhorn-Knopp迭代算法将超连接的混合矩阵投影到双随机矩阵流形，以此稳定深度网络信号传播、解决梯度消失 / 爆炸问题
mhc	mhc_sinkhorn_backward	✓	✓	✓	✗	AI Core	mhc_sinkhorn的反向算子
mhc	mhc_post	✓	✓	✓	✓	AI Core	基于一系列计算对mHC架构中上一层输出进行Post Mapping，对上一层的输入进行Res Mapping，然后对二者进行残差连接，得到下一层的输入
mhc	mhc_post_backward	✓	✓	✓	✓	AI Core	mhc_post基于一系列计算对mHC架构中上一层输出进行Post Mapping，对上一层的输入进行Res Mapping，然后对二者进行残差连接，得到下一层的输入。该算子实现前述过程的反向
mhc	mhc_pre	✓	✓	✓	✓	AI Core	基于一系列计算得到MHC架构中hidden层的$H^{res}$和$H^{post}$投影矩阵以及Attention或MLP层的输入矩阵$h^{in}$。
mhc	mhc_pre_backward	✓	✓	✓	✗	AI Core	mhc_pre算子的反向传播，基于一系列计算得到MHC架构中hidden层的梯度。
mhc	mhc_pre_sinkhorn	✓	✓	✓	✓	AI Core	mhc_pre_sinkhorn算子，支持MHC架构中Hin、Hpre、Hres和Hpost矩阵输出。
mhc	mhc_pre_sinkhorn_backward	✓	✓	✓	✓	AI Core	mhc_pre_sinkhorn算子的反向传播，基于一系列计算得到MHC架构中hidden层的梯度。
posembedding	apply_rotary_pos_emb	✓	✓	✓	✓	AI Core	执行旋转位置编码计算，推理网络为了提升性能，将query和key两路算子融合成一路。
posembedding	dequant_rope_quant_kvcache	✓	✓	✗	✓	AI Core	对输入张量（x）进行dequant（可选）后，按`sizeSplits`（为切分的长度）对尾轴进行切分，划分为q、k、vOut，对q、k进行旋转位置编码，生成qOut和kOut，之后对kOut和vOut进行量化并按照`indices`更新到kCacheRef和vCacheRef上。
posembedding	interleave_rope	✓	✓	✓	✓	AI Core	针对单输入 x 进行旋转位置编码。
posembedding	qkv_rms_norm_rope_cache	✓	✓	✓	✓	AI Core	输入qkv融合张量，通过SplitVD拆分q、k、v张量，执行RmsNorm、ApplyRotaryPosEmb、Quant、Scatter融合操作，输出q_out、k_cache、v_cache、q_out_before_quant(可选)、k_out_before_quant(可选)、v_out_before_quant(可选)。
posembedding	rope_quant_kvcache	✓	✓	✗	✓	AI Core	对输入张量的尾轴进行切分。
posembedding	rope_with_sin_cos_cache	✓	✓	✓	✓	AI Core	推理网络为了提升性能，将sin和cos输入通过cache传入，执行旋转位置编码计算。
posembedding	rotary_position_embedding	✓	✓	✓	✓	AI Core	执行单路旋转位置编码计算。
posembedding	rotary_position_embedding_grad	✓	✓	✗	✓	AI Core	执行单路旋转位置编码的反向计算。
posembedding	kv_rms_norm_rope_cache	✓	✓	✓	✓	AI Core	对输入张量(kv)的尾轴，拆分出左半边用于rms_norm计算，右半边用于rope计算，再将计算结果分别scatter到两块cache中。