0ba1e2e0创建于 4 天前历史提交

文件	最后提交记录	最后更新时间
docs	增加mxfp8 innerprecise 拦截，增加quantScale1支持范围 Co-authored-by: fengzixiao<fengzixiao1@huawei.com> # message auto-generated for no-merge-commit merge: !5238 merge lanjie into master 增加mxfp8 innerprecise 拦截，增加quantScale1支持范围 Created-by: fengzixiao Commit-by: fengzixiao Merged-by: cann-robot Description: ## 描述增加mxfp8 innerprecise 拦截，增加quantScale1支持范围 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> ## 测试二级冒烟 ## 文档更新更新了aclnnFusedInferAttentionScoreV5.md ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!5238	20 天前
examples	FIA算子全量化，伪量化examples补充 Co-authored-by: sunhao<sunhao104@huawei.com> # message auto-generated for no-merge-commit merge: !4076 merge fia_examples_0413 into master FIA算子全量化，伪量化examples补充 Created-by: Bugslover Commit-by: sunhao Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> 补充per-block全量化、fp8 perchannel伪量化、mxfp4伪量化examples示例 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> 关联Issue #1655 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> examples运行 ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> 补充了运行示例 ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [x] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!4076	1 个月前
op_api	[FIA]修改FA的aclnninner调用通过头文件方式 Co-authored-by: linengyao<linengyao@huawei.com> # message auto-generated for no-merge-commit merge: !3258 merge modify_opapi_headfile into master [FIA]修改FA的aclnninner调用通过头文件方式 Created-by: linengyao Commit-by: linengyao Merged-by: cann-robot Description: ## 描述 FA类推理算子的aclnn实现调用aclnninner通过头文件方式而非extern方式 ## 背景：当前使用aclnninner都是extern接口，如果生成的接口修改了（extern的地方不会自动修改），编译可能会检查不出来。原因为extern只有符号，和连接的实体接口并不一致。 ## 修改点 FA类推理算子的aclnninner文件需要用到自动生成的文件aclnnInner_fused_infer_attention_score.h中的函数接口，之前的方式是在每个需要调用的地方使用extern对其进行声明，现在统一改为直接include该文件 ## 关联的Issue [#1618](https://gitcode.com/cann/ops-transformer/issues/1618) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!3258	2 个月前
op_graph	移动opapi目录 Co-authored-by: linengyao<linengyao@huawei.com> # message auto-generated for no-merge-commit merge: !3313 merge opapi_change_dir into master 移动opapi目录 Created-by: linengyao Commit-by: linengyao Merged-by: cann-robot Description: ## 描述修改opapi目录至op_host同级修改fallback文件至op_graph目录下 ## 关联的Issue [#1522](https://gitcode.com/cann/ops-transformer/issues/1522) ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!3313	2 个月前
op_host	fix(fia): 修复GQA模式下mm1RopeKa计算与BNSD layout的actualSeqLengthsQ处理 Co-authored-by: leiqingji<leiqingji@h-partners.com> # message auto-generated for no-merge-commit merge: !6008 merge fix/fia-rope-gqa-actualseqlen into master fix(fia): 修复GQA模式下mm1RopeKa计算与BNSD layout的actualSeqLengthsQ处理 Created-by: leiqingji Commit-by: leiqingji Merged-by: cann-robot Description: ## 描述修复 FIA 算子在 GQA 模式下 mm1RopeKa 计算逻辑错误，以及 BNSD layout 下 actualSeqLengthsQ 处理不当的问题。 ## 改动 - flash_attention_no_quant_kernel_base.h: GQA 模式下 mm1RopeKa 应为 n2SizegSizedSizeRope，非 GQA 模式下应为 dSizeRope - fused_infer_attention_score_tiling_impl.cpp: BNSD layout 下 gsMergeFlag 时 actualSeqLengthsQ 应使用 S1Size 而非 actSeqLenQData ## 关联的Issue #2724 ## 测试 - 已在 ascend950 (arch35) 平台验证 GQA 和非 GQA 模式下 RoPE 计算正确性 - 已验证 BNSD layout 下 actualSeqLengthsQ 计算符合预期 ## 类型标签 - [x] 🐛 Bug 修复 See merge request: cann/ops-transformer!6008	5 天前
op_kernel	attention 重复安装头文件修改 Co-authored-by: chenglongyu<chenglongyu@huawei.com> # message auto-generated for no-merge-commit merge: !6020 merge repeat_clean_ins into master attention 重复安装头文件修改 Created-by: chenglongyu Commit-by: chenglongyu Merged-by: cann-robot Description: ## 描述各算子的目录下各自维护的头文件存在命名重复的问题。各头文件加上算子名前缀做区分。 \| 原头文件名称 \| 新头文件名称 \| 文件路径 \| \|-----------\|-----------\|---------\| \| common_header.h \| sparse_flash_mla_grad_common_header.h \| attention/sparse_flash_mla_grad/op_kernel/arch22/basic_modules/sparse_flash_mla_grad_common_header.h \| \| common_header.h \| sparse_flash_attention_grad_common_header.h \| attention/sparse_flash_attention_grad/basic_modules/sparse_flash_attention_grad_common_header.h \| \| common_header.h \| nsa_selected_attention_grad_common_header.h \| attention/nsa_selected_attention_grad/basic_modules/nsa_selected_attention_grad_common_header.h \| \| common_header.h \| flash_attention_score_grad_common_header.h \| attention/flash_attention_score_grad/op_kernel/arch22/basic_modules/flash_attention_score_grad_common_header.h \| \| common_utils.h \| attention_worker_combine_common_utils.h \| attention/attention_worker_combine/op_kernel/attention_worker_combine_common_utils.h \| \| gm_to_l1_iterator.h \| mla_preprocess_gm_to_l1_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_gm_to_l1_iterator.h \| \| gm_to_ub_iterator.h \| mla_preprocess_gm_to_ub_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_gm_to_ub_iterator.h \| \| kernel_common.hpp \| rain_fusion_attention_kernel_common.hpp \| attention/rain_fusion_attention/op_kernel/rain_fusion_attention_kernel_common.hpp \| \| kernel_common.hpp \| fia_kernel_common.hpp \| attention/fused_infer_attention_score/op_kernel/fia_kernel_common.hpp \| \| kernel_common.hpp \| block_sparse_attention_kernel_common.hpp \| attention/block_sparse_attention/op_kernel/block_sparse_attention_kernel_common.hpp \| \| l0c_to_gm_iterator.h \| mla_preprocess_l0c_to_gm_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l0c_to_gm_iterator.h \| \| l0c_to_l1_iterator.h \| mla_preprocess_l0c_to_l1_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l0c_to_l1_iterator.h \| \| l0c_to_ub_iterator.h \| mla_preprocess_l0c_to_ub_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l0c_to_ub_iterator.h \| \| l1_to_bt_iterator.h \| mla_preprocess_l1_to_bt_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l1_to_bt_iterator.h \| \| l1_to_fb_iterator.h \| mla_preprocess_l1_to_fb_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l1_to_fb_iterator.h \| \| l1_to_l0_iterator.h \| mla_preprocess_l1_to_l0_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l1_to_l0_iterator.h \| \| l1_to_ub_iterator.h \| mla_preprocess_l1_to_ub_iterator.h \| attention/mla_preprocess/op_kernel/mla_preprocess_l1_to_ub_iterator.h \| \| mla_common.h \| prompt_flash_attention_mla_common.h \| attention/prompt_flash_attention/op_kernel/arch22/prompt_flash_attention_mla_common.h \| \| mla_common.h \| mla_preprocess_mla_common.h \| attention/mla_preprocess/op_kernel/mla_preprocess_mla_common.h \| \| cube_op.h \| sparse_flash_mla_grad_cube_op.h \| attention/sparse_flash_mla_grad/op_kernel/arch22/basic_modules/sparse_flash_mla_grad_cube_op.h \| \| matmul.h \| sparse_flash_mla_grad_matmul.h \| attention/sparse_flash_mla_grad/op_kernel/arch22/basic_modules/sparse_flash_mla_grad_matmul.h \| \| vec_op.h \| sparse_flash_mla_grad_vec_op.h \| attention/sparse_flash_mla_grad/op_kernel/arch22/basic_modules/sparse_flash_mla_grad_vec_op.h \| \| cube_op.h \| sparse_flash_attention_grad_cube_op.h \| attention/sparse_flash_attention_grad/basic_modules/sparse_flash_attention_grad_cube_op.h \| \| matmul.h \| sparse_flash_attention_grad_matmul.h \| attention/sparse_flash_attention_grad/basic_modules/sparse_flash_attention_grad_matmul.h \| \| vec_op.h \| sparse_flash_attention_grad_vec_op.h \| attention/sparse_flash_attention_grad/basic_modules/sparse_flash_attention_grad_vec_op.h \| ## 关联的Issue https://gitcode.com/cann/ops-transformer/issues/2680 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ⚡ 性能优化 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!6020	4 天前
tests	[FIA][mxfp8] update test Co-authored-by: shen_weiling<shenweiling@huawei.com> # message auto-generated for no-merge-commit merge: !6125 merge dev_0529 into master [FIA][mxfp8] update test Created-by: shen_weiling Commit-by: shen_weiling Merged-by: cann-robot Description: ## 描述 <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> mxfp8更新测试脚本 ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> https://gitcode.com/cann/ops-transformer/issues/2333 ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> 不涉及 ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> 不涉及 ## 类型标签 <!-- [x] 表示选中 --> - [x] 🐛 Bug 修复 - [ ] ✨ 新特性 - [ ] ♻️ 重构 - [ ] 🧪 测试 - [ ] 📦 构建/CI - [ ] 🔧 配置变更 - [ ] 📝 文档更新 - [ ] ⬆️ 依赖升级 - [ ] 🔒 安全修复 - [ ] 🧹 代码清理 - [ ] ❓ 其他，请描述： See merge request: cann/ops-transformer!6125	4 天前
CMakeLists.txt	更新license Co-authored-by: PerrySkywalker<wangmingkang1@huawei.com> # message auto-generated for no-merge-commit merge: !568 merge lic into master 更新license Created-by: PerrySkywalker Commit-by: PerrySkywalker Merged-by: cann-robot Description: ## 描述更新license <!--在这里详细描述你的改动，包括改动的原因和所采取的方法。--> ## 关联的Issue <!-- 如果这个PR是为了解决特定的Issue，请在这里提供Issue链接。例如：关联Issue #000--> <!-- 如果这个PR是为了解决特定的问题单，请在这里描述问题单单号。--> ## 测试 <!--描述进行了哪些测试来验证你的改动。包括但不限于二级冒烟、算子泛化等。--> ## 文档更新 <!--如果这个PR包含文档的更新，请在这里指出。例如：更新了README.md文件。--> ## 类型标签 <!-- [x] 表示选中 --> - [ ] Bug修复 - [ ] 新特性 - [ ] 性能优化 - [ ] 文档更新 - [ ] 其他，请描述： See merge request: cann/ops-transformer!568	5 个月前
README.md	doc Tools扫描错误修改 Co-authored-by: gitee-yanglulu<yanglulul@h-partners.com> # message auto-generated for no-merge-commit merge: !3348 merge master into master doc Tools扫描错误修改 Created-by: gitee-yanglulu Commit-by: gitee-yanglulu Merged-by: cann-robot Description: doc Tools扫描错误修改 See merge request: cann/ops-transformer!3348	2 个月前

FusedInferAttentionScore

产品支持情况

产品	是否支持
Ascend 950PR/Ascend 950DT	√
Atlas A3 训练系列产品/Atlas A3 推理系列产品	√
Atlas A2 训练系列产品/Atlas A2 推理系列产品	√
Atlas 200I/500 A2 推理产品	×
Atlas 推理系列加速卡产品	×
Atlas 训练系列产品	×

功能说明

算子功能：适配增量&全量推理场景的FlashAttention算子，既可以支持全量计算场景（PromptFlashAttention），也可支持增量计算场景（IncreFlashAttention）。
计算公式：

self-attention（自注意力）利用输入样本自身的关系构建了一种注意力模型。其原理是假设有一个长度为 $n$ 的输入样本序列 $x$ ， $x$ 的每个元素都是一个 $d$ 维向量，可以将每个 $d$ 维向量看作一个token embedding，将这样一条序列经过3个权重矩阵变换得到3个维度为 $n * d$ 的矩阵。

self-attention的计算公式一般定义如下，其中 $Q$ 、 $K$ 、 $V$ 为输入样本的重要属性元素，是输入样本经过空间变换得到，且可以统一到一个特征空间中。公式及算子名称中的"Attention"为"self-attention"的简写。
$A t t e n t i o n (Q, K, V) = S c o r e (Q, K) V$
本算子中Score函数采用Softmax函数，self-attention计算公式为：
$Attention(Q,K,V)=Softmax(QKTd)VAttention(Q,K,V)=Softmax(\frac{QK^T}{\sqrt{d}})V$
其中： $Q$ 和 $K^T$ 的乘积代表输入 $x$ 的注意力，为避免该值变得过大，通常除以 $d$ 的平方根进行缩放，并对每行进行softmax归一化，与 $V$ 相乘后得到一个 $n * d$ 的矩阵。

参数说明

参数名	输入/输出	描述	数据类型	数据格式
query	输入	公式中的输入Q。	FLOAT16、BFLOAT16、INT8	ND
key	输入	公式中的输入K。	FLOAT16、BFLOAT16、INT8、INT4	ND
value	输入	公式中的输入V。	FLOAT16、BFLOAT16、INT8、INT4	ND
attentionOut	输出	公式中的输出。	FLOAT16、BFLOAT16、INT8	ND

约束说明

该接口与PyTorch配合使用时，需要保证CANN相关包与PyTorch相关包的版本匹配。
入参为空的处理：算子内部需要判断参数query是否为空，如果是空则直接返回。参数query不为空Tensor，参数key、value为空tensor(即S2为0)，则attentionOut填充为全零。attentionOut为空Tensor时，AscendCLNN框架会处理。其余在上述参数说明中标注了“可传入nullptr”的入参为空指针时，不进行处理。
参数key、value中对应Tensor的shape需要完全一致；非连续场景下 key、value的tensorlist中的batch只能为1，个数等于query的B，N和D需要相等。由于tensorlist限制, 非连续场景下B不能大于256。
当Q_S大于1时，query，key，value输入，功能使用限制如下：
- 支持B轴小于等于65536。
- 如果输入类型为INT8且D轴不是32字节对齐，则B轴的最大支持值为128。若输入类型为FLOAT16或BFLOAT16且D轴不是16字节对齐，B轴同样仅支持到128。
- 支持N轴小于等于256，支持D轴小于等于512。inputLayout为BSH或者BSND时，建议N*D小于65535。
- S支持小于等于20971520（20M）。部分长序列场景下，如果计算量过大可能会导致pfa算子执行超时（aicore error类型报错，errorStr为:timeout or trap error），此场景下建议做S切分处理，注：这里计算量会受B、S、N、D等的影响，值越大计算量越大。典型的会超时的长序列（即B、S、N、D的乘积较大）场景包括但不限于：
  
  B Q_N Q_S D KV_N KV_S
  
  1 20 2097152 256 1 2097152
  
  1 2 20971520 256 2 20971520
  
  20 1 2097152 256 1 2097152
  
  1 10 2097152 512 1 2097152
- D轴限制：query、key、value或attentionOut类型包含INT8时，D轴需要32对齐；query、key、value或attentionOut类型包含INT4时，D轴需要64对齐；类型全为FLOAT16、BFLOAT16时，D轴需16对齐。
当Q_S等于1时：query，key，value输入，功能使用限制如下：
- 支持B轴小于等于65536，支持N轴小于等于256，支持D轴小于等于512。
- query、key、value输入类型均为INT8的场景暂不支持。
- 在INT4伪量化场景下，aclnn单算子调用支持KV INT4输入或者INT4拼接成INT32输入（建议通过dynamicQuant生成INT4格式的数据，因为dynamicQuant就是一个INT32包括8个INT4）。
- 在INT4伪量化场景下，若KV INT4拼接成INT32输入，那么KV的N、D或者H是实际值的八分之一（prefix同理）。
- key、value在特定数据类型下存在对于D轴的限制
  - key、value输入类型为INT4（INT32）时，D轴需要64对齐（INT32仅支持D 8对齐）。
query、key、value数据排布格式支持从多种维度解读，其中B（Batch）表示输入样本批量大小、S（Seq-Length）表示输入样本序列长度、H（Head-Size）表示隐藏层的大小、N（Head-Num）表示多头数、D（Head-Dim）表示隐藏层最小的单元尺寸，且满足D=H/N、T表示所有Batch输入样本序列长度的累加和。

B	Q_N	Q_S	D	KV_N	KV_S
1	20	2097152	256	1	2097152
1	2	20971520	256	2	20971520
20	1	2097152	256	1	2097152
1	10	2097152	512	1	2097152

调用说明

调用方式	样例代码	说明
aclnn接口	test_aclnn_FusedInferAttentionScoreV4	通过aclnnFusedInferAttentionScoreV4调用PromptFlashAttentionV3算子