Wwanggangguofix

93bdb9c3创建于 1 天前历史提交

文件	最后提交记录	最后更新时间
add	feat/fix/docs/style/refactor/adaptor/chore/test(backend): refactor Table of Contents Co-authored-by: shiyuan680<yangcheng104@huawei.com>	1 个月前
causal_conv1d_bwd	fix	1 天前
chunk_gated_delta_rule_fwd_h	feat: add chunk_gated_delta_rule_fwd_h operator Co-authored-by: LinShua<707894133@qq.com> # message auto-generated for no-merge-commit merge: !28 merge master_fwd_h into master feat: add chunk_gated_delta_rule_fwd_h operator Created-by: LinShua Commit-by: LinShua Merged-by: ascend-robot Description: ## What this PR does / why we need it? add chunk_gated_delta_rule_fwd_h operator [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and ATK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/efc6d84c-5b47-46e5-8b4f-e47cffe169fe/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/0c0e8daa-1066-4034-9dfd-565b7e182c11/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!28	7 天前
chunk_kda_bwd	[Feat] Add chunk kda backward op for Kimi Linear Co-authored-by: zhuweichen<calvin_zhu0210@outlook.com> # message auto-generated for no-merge-commit merge: !26 merge kda into master [Feat] Add chunk kda backward op for Kimi Linear Created-by: zhuweichen Commit-by: zhuweichen Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR adds the `chunk_kda_bwd_wy_dqkg_fused` Triton operator for KDA chunk backward on Ascend arch32. The operator computes fused backward outputs `dq`, `dk`, `dv`, `db`, `dg`, and `dA`. It also adds the public API, arch32 implementation, UT, ATK cases, documentation, and README entry. https://gitcode.com/Ascend/MindSpeed-Ops/issues/28 ## Does this PR introduce any user-facing change? Yes. A new Triton API is added: `from mindspeed_ops.api.triton.chunk_kda_bwd import chunk_kda_bwd_wy_dqkg_fused` Documentation: docs/triton/chunk_kda_bwd.md Limitations: - Supports arch32 only; arch35 raises NotImplementedError. - Main inputs support float16 / float32. - g, h, and dh are expected to be float32. - bf16 is not declared as supported. ## How was this patch tested? UT: ```shell pytest tests/unit_tests/triton/test_chunk_kda_bwd.py -s pytest tests/unit_tests/triton/test_chunk_kda_bwd.py -m model_shape -s ``` ATK: ```shell cd tests/atk_tests/triton/chunk_kda_bwd atk case -f chunk_kda_bwd.yaml -p generate_chunk_kda_bwd.py atk node --backend triton --devices 0 node --backend npu --devices 0 task \ -c result/chunk_kda_bwd/json/all_chunk_kda_bwd.json \ --task accuracy -tup ./ -p triton_chunk_kda_bwd.py ``` ![image.png](https://raw.gitcode.com/user-images/assets/9612429/75125840-6c01-439c-bb35-f3507a829c19/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/d5d76f60-92bd-4f7a-b97c-11c3e1f44e60/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/69abe3da-d3d0-4b8c-b167-dfd8555e25ad/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!26	13 天前
conv1d	feat: conv1d wrapper Co-authored-by: liuxi_<liuxi75@huawei.com>	25 天前
fused_cross_entropy_loss	docs/bug: Readme Update & Atk Modify Co-authored-by: liu_zhi_xu<liuzhexu1@huawei.com> # message auto-generated for no-merge-commit merge: !27 merge bugfix into master docs/bug: Readme Update & Atk Modify Created-by: liu_zhi_xu Commit-by: liu_zhi_xu Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1、Readme Update：ReadMe 2、Atk Modify：generate_fused_cross_entropy_loss [#Bugfix](https://gitcode.com/Ascend/MindSpeed-Ops/issues/27) ## Does this PR introduce any user-facing change? Not Related. ## How was this patch tested? atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task accuracy -p triton_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/fcdeb679-72af-4872-9763-0d3fd112d6ce/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!27	16 天前
layer_norm_gated_bwd	fix	1 天前
mhc_post	add mhc ops Co-authored-by: wangxuefei10<wangxuefei10@huawei.com> # message auto-generated for no-merge-commit merge: !35 merge dev_mhc_0525 into master feat: add mhc triton ops Created-by: Ling_i Commit-by: wangxuefei10 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add mhc triton ops https://gitcode.com/Ascend/MindSpeed-Ops/issues/1 ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and UTK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/f75f1ab2-4d23-4dc6-aacf-3d6e95d335b6/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!35	9 天前
mhc_pre_bmm	add mhc ops Co-authored-by: wangxuefei10<wangxuefei10@huawei.com> # message auto-generated for no-merge-commit merge: !35 merge dev_mhc_0525 into master feat: add mhc triton ops Created-by: Ling_i Commit-by: wangxuefei10 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add mhc triton ops https://gitcode.com/Ascend/MindSpeed-Ops/issues/1 ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and UTK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/f75f1ab2-4d23-4dc6-aacf-3d6e95d335b6/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!35	9 天前
mhc_pre_only	add mhc ops Co-authored-by: wangxuefei10<wangxuefei10@huawei.com> # message auto-generated for no-merge-commit merge: !35 merge dev_mhc_0525 into master feat: add mhc triton ops Created-by: Ling_i Commit-by: wangxuefei10 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add mhc triton ops https://gitcode.com/Ascend/MindSpeed-Ops/issues/1 ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and UTK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/f75f1ab2-4d23-4dc6-aacf-3d6e95d335b6/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!35	9 天前
recompute_w_u_fwd	feat: add recompute_w_u_fwd triton operator Co-authored-by: LinShua<707894133@qq.com> # message auto-generated for no-merge-commit merge: !31 merge master_recompute_w_u_fwd into master feat: add recompute_w_u_fwd triton operator Created-by: LinShua Commit-by: LinShua Merged-by: ascend-robot Description: ## What this PR does / why we need it? add recompute_w_u_fwd triton operator [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and ATK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/38859bc3-71f3-4213-9443-c60617336ead/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/9b3013a5-7e99-4a28-8eb8-82cfb89c7767/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!31	7 天前
rmsnorm_without_weight	feat: Add FusedCrossEntropyLoss for Qwen3.5 Co-authored-by: liu_zhi_xu<liuzhexu1@huawei.com> # message auto-generated for no-merge-commit merge: !25 merge new_func into master feat: Add FusedCrossEntropyLoss for Qwen3.5 Created-by: liu_zhi_xu Commit-by: liu_zhi_xu Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1、New model adaptation operator completion [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) 2、Modify ATK config related RMS/SINK ## Does this PR introduce any user-facing change? Reference Operator Markdown Description ## How was this patch tested? [UT] pytest test_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2afb8d87-6bf6-4113-8078-1ea9eb30aebc/image.png 'image.png') [ATK] atk case -f rmsnorm_without_weight.yaml -p generate_rmsnorm_without_weight.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/rmsnorm_without_weight/json/all_rmsnorm_without_weight.json --task accuracy -p triton_rmsnorm_without_weight.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2440a7b5-0522-4e2e-a995-26623dc16d46/image.png 'image.png') atk case -f sinkhorn.yaml -p generate_sinkhorn.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/sinkhorn/json/all_sinkhorn.json --task accuracy -p triton_sinkhorn.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/6bba5494-02a3-427d-b7d4-30997cc8a91a/image.png 'image.png') atk case -f fused_cross_entropy_loss.yaml -p generate_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task accuracy -p triton_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task performance_device -p triton_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/ba106c72-2104-4d55-82c4-87203544529d/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/d7951ef0-e096-430c-9030-3e690e64d807/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!25	18 天前
rmsnormgated	feat: add RmsNormGated Co-authored-by: feng0w0<houyufeng4@huawei.com> # message auto-generated for no-merge-commit merge: !22 merge master into master feat: add RmsNormGated Created-by: feng0w0 Commit-by: feng0w0 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add RmsNormGated Triton [#1](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? ut ![image.png](https://raw.gitcode.com/user-images/assets/9612429/12418371-1f2d-4966-ab96-986077b3c558/image.png 'image.png') atk 精度 ![image.png](https://raw.gitcode.com/user-images/assets/9612429/3879ca4d-ab45-42c2-ab46-fa2920736536/image.png 'image.png') atk性能 ![image.png](https://raw.gitcode.com/user-images/assets/9612429/5efab269-aefb-4d02-8be5-fe36df89b857/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!22	9 天前
sinkhorn	feat: Add FusedCrossEntropyLoss for Qwen3.5 Co-authored-by: liu_zhi_xu<liuzhexu1@huawei.com> # message auto-generated for no-merge-commit merge: !25 merge new_func into master feat: Add FusedCrossEntropyLoss for Qwen3.5 Created-by: liu_zhi_xu Commit-by: liu_zhi_xu Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1、New model adaptation operator completion [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) 2、Modify ATK config related RMS/SINK ## Does this PR introduce any user-facing change? Reference Operator Markdown Description ## How was this patch tested? [UT] pytest test_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2afb8d87-6bf6-4113-8078-1ea9eb30aebc/image.png 'image.png') [ATK] atk case -f rmsnorm_without_weight.yaml -p generate_rmsnorm_without_weight.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/rmsnorm_without_weight/json/all_rmsnorm_without_weight.json --task accuracy -p triton_rmsnorm_without_weight.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2440a7b5-0522-4e2e-a995-26623dc16d46/image.png 'image.png') atk case -f sinkhorn.yaml -p generate_sinkhorn.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/sinkhorn/json/all_sinkhorn.json --task accuracy -p triton_sinkhorn.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/6bba5494-02a3-427d-b7d4-30997cc8a91a/image.png 'image.png') atk case -f fused_cross_entropy_loss.yaml -p generate_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task accuracy -p triton_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task performance_device -p triton_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/ba106c72-2104-4d55-82c4-87203544529d/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/d7951ef0-e096-430c-9030-3e690e64d807/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!25	18 天前
wy_fast	feat: prepare_wy_repr ops Co-authored-by: liuxi_<liuxi75@huawei.com> # message auto-generated for no-merge-commit merge: !20 merge prepare_wy_repr into master feat: prepare_wy_repr ops Created-by: liuxi_ Commit-by: liuxi_ Merged-by: ascend-robot Description: ## What this PR does / why we need it? prepare_wy_repr ops [#1](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? 不涉及 ## How was this patch tested? TA 3.2.1 ut ![image.png](https://raw.gitcode.com/user-images/assets/9612429/e080ebd9-8a25-45ae-ad87-dccda27e1e77/image.png 'image.png') atk ![image.png](https://raw.gitcode.com/user-images/assets/9612429/78ebd5d6-8a38-4316-8ee3-ff1a055f75eb/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2c25d24b-773d-43c0-a714-4461a6cf51e9/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!20	18 天前
README.md	feat/fix/docs/style/refactor/adaptor/chore/test(backend): refactor Table of Contents Co-authored-by: shiyuan680<yangcheng104@huawei.com>	1 个月前
__init__.py	add ci& add ut Co-authored-by: shiyuan680<917935075@qq.com>	1 个月前

Triton算子ATK测试指南

环境准备

1、安装ATK工具

ATK下载

# 例如，atk-26.3.6-py3-none-any.whl
pip install atk-26.3.6-py3-none-any.whl

2、配置PATH环境变量（conda环境）

export PATH="/home/miniconda3/envs/{conda 容器名}/bin:$PATH"

3、在{安装路径}/MindSpeed-Ops/tests/atk_tests/triton下新增atk算子测试用例格式说明：

├── triton # {安装路径}/MindSpeed-Ops/tests/atk_tests/triton
│   ├── {op_name} # 算子名称定义文件夹
│   │   ├── {op_name}.yaml # 算子测试用例生成说明（涵盖精度、性能等指标）
│   │   ├── generate_{op_name}.py # 算子测试用例生成约束条件
│   │   └── triton_{op_name}.py # 算子执行逻辑文件
│   ....

示例 add 算子如下：
│   ├── add
│   │   ├── add.yaml
│   │   ├── generate_add.py
│   │   └── triton_add.yaml
│   ....

4、运行atk测试用例评价精度和性能命令示例

# 如下以 add 算子作为示例
cd {安装路径}/MindSpeed-Ops/tests/atk_tests/triton/add/

# 生成测试用例
 atk case -f add.yaml -p generate_add.py

# 执行精度评估
atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task accuracy -tup ./ -p triton_add.py

# 执行性能评估
atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task  performance_device -tup ./ -p triton_add.py

测试用例生成

编写测试设计yaml

根据需要测试的算子输入参数信息，编写对应的测试设计yaml文件，详细参数说明可参考：用例生成

yaml中，与自定义相关的几个参数如下：

自定义算子

generate: 默认为default, 表示不需要参数约束, 如果需要参数约束, 如 generate_add 则填对应自定义参数约束py脚本中@GENERATOR_REGISTRY.register("generate_add")的注册名称
api_type: 默认为function，表示执行eval(api_name), api_name为yaml文件中的name, 也可以自定义py脚本
triton_api_type: 必须填写, 表示triton kernel/api的自定义调用py脚本, 如triton_add，对应自定义脚本中@register("triton_add")的注册名称。注：这里是triton接口的逻辑标识符号（用于调用匹配）
triton_name：这里是具体函数的接口，atk代码处理逻辑中会按照.拆分成，如triton_add.add会拆成模块triton_add和接口add，加载模块triton_add获取接口add

精度和性能指标

通过如下两个参数可以设置精度和性能评估标准，可以自行设置，也可使用默认参数配置，详细参数设置可以参考：结果分析

acc：配置精度评估标准，single_bm 是ATK工具内置的单标杆比对-高精度选项（参考: 单标杆设置自定义阈值）。若仅设置acc: single_bm, 则默认数值设置如下：

standard:
acc:
    single_bm:
    # type 选择高精度/高性能/Vector融合算子标准类型，默认为高精度
    # 其中fp16和bf16对应的npu标杆算子，会做升精fp32处理; fp32对应的npu标杆算子, 则不做升精处理，仍为fp32
    # 高精度:high_precision;高性能:high_performance;Vector融合算子:vector_fused
    type: high_precision          # 默认设置为high_precision
    fp16_error: 2 ** -11          # float16默认误差阈值
    fp16_eb: 2 ** -10             # float16默认EB阈值
    bf16_error:  2 ** -8
    bf16_eb: 2 ** -7
    fp32_error: 2 ** -14
    fp32_eb: 2 ** -14

如果需要设置双标杆（参考：双标杆设置自定义阈值）。若仅设置acc: cv_fused_double_benchmark，则默认数值如下：

## 高性能双标杆 & CV融合类算子(cv_fused_double_benchmark)
standard:
acc:
    cv_fused_double_benchmark:
    max_re_ratio: 10
    avg_re_ratio: 2
    root_mean_squared_ratio: 2

性能标准

perf：配置性能评估标准，可以设置默认为not_key, 如果不自定义参数，则默认为[0.95, 0.98]，详细参数设置参考：性能标准

其他参数

max_length: 是测试用例生成配置中用于限制张量维度最大长度的参数，max_length: 196608表示生成的测试用例中张量的每个维度长度不会超过 196608。这一参数的作用是控制测试用例的规模，避免生成过长的维度导致测试效率低下或资源不足。
dtype_numbers: 表示在测试用例设计中设置的数据类型数量为70个，乘以设置的dtypes数量类型[ fp32, fp16, bf16 ]共3个，可以得到总共210个测试用例。该参数用于控制测试用例中所使用数据类型的范围，通常用于生成具有不同数据类型的测试用例，以全面验证算子在多种数据类型下的行为。减小该值可以成比例减少测试用例数量。

triton接口的参数说明，请参考triton的官方链接：https://triton-lang.org/main/index.html

以add算子为例，add.yaml完整文件如下：

# add.yaml

api: pytorch
version: v2.1
name: torch.add
triton_name: triton_add.add
api_type: function
triton_api_type: triton_add
generate: generate_add
dtype_numbers: 70
standard:
  acc: single_bm
  perf: not_key
inputs:
  - name:
    type: tensor
    required: true
    dtypes:
      values: [ fp32, fp16, bf16 ]
    ranges:
      valid:
        values: [ [-10,10], [-100,100] ]
      invalid:
        values: [ [-10,10], [-100,100] ]
    shapes:
      dim_numbers:
        values: [ 1, 2, 3, 4, 5 ]
      dim_values:
        values: [ 1,2,3,4,5,7,8,11,15,16,17,22,23,25,27,255,256,257 ]
      max_length: 196608
  - name:
    type: tensor
    required: true
    dtypes:
      values: [ fp32, fp16, bf16 ]
    ranges:
      valid:
        values: [ [-10,10], [-100,100] ]
      invalid:
        values: [ [-10,10], [-100,100] ]
    shapes:
      dim_numbers:
        values: [ 1, 2, 3, 4, 5 ]
      dim_values:
        values: [ 1,2,3,4,5,7,8,11,15,16,17,22,23,25,27,255,256,257 ]
      max_length: 196608

编写自定义参数约束

如果算子的输入参数之间存在约束，需要编写对应的参数约束脚本，具体方法请参考：自定义规则约束

以add算子为例，generate_add.py完整文件如下：

# generate_add.py

import random
from atk.case_generator.generator.generate_types import GENERATOR_REGISTRY
from atk.case_generator.generator.base_generator import CaseGenerator
from atk.configs.case_config import CaseConfig


@GENERATOR_REGISTRY.register("generate_add")  # generate_add为注册的生成器名称，对应yaml中的generate参数
class ReduceGenerator(CaseGenerator):

    def after_case_config(self, case_config: CaseConfig) -> CaseConfig:
        '''
        用例参数约束修改入口
        :param case_config:  生成的用例信息，可能不满足参数间约束，导致用例无效
        :return: 返回修改后符合参数间约束关系的用例，需要用例保障用例有效
        '''
        case_config.inputs[1].shape = case_config.inputs[0].shape # 使得第二个输入shape与第一个输入shape一致, 可以进行相加
        case_config.inputs[1].dtype = case_config.inputs[0].dtype # 使得第二个输入dtype与第一个输入dtype一致, 可以进行相加

        return case_config

执行ATK工具生成测试用例

执行以下命令生成对应算子的泛化测试用例，其中XXX.yaml为测试设计yaml文件（必选），XXX.py为参数约束脚本（可选）

atk case -f XXX.yaml -p XXX.py

# 例：atk case -f add.yaml -p generate_add.py

执行结果如下图所示：

[2026-04-11 17:41:14] [INFO] [MainProcess] [223249] [reports.py:239]  save case json file: result/add/json/all_add.json
[2026-04-11 17:41:14] [INFO] [MainProcess] [223249] [reports.py:268]  save case excel file: result/add/excel/add20260411174114.xlsx
[2026-04-11 17:41:14] [INFO] [MainProcess] [223249] [reports.py:305]  save case csv file: result/add/csv/add20260411174114.csv

其中，result/add/json/all_add.json后续会用做测试用例的输入。

自定义标杆适配

如果torch没有对应的标杆接口，请参考: 自定义API实现
如果torch有对应的标杆接口，可以通过设置yaml文件的name: 接口名和api_type: function调用标杆，例子如下：

# add.yaml
...
name: torch.add
api_type: function
...

function.py标杆的代码如下所示:

# function.py
import torch

from atk.configs.dataset_config import InputDataset
from atk.tasks.api_execute import register
from atk.tasks.api_execute.base_api import BaseApi


@register("function")    # function 对应 yaml 文件中的 api_type
class FunctionApi(BaseApi):
    def __call__(self, input_data: InputDataset, with_output: bool = False):
        if not with_output:
            eval(self.api_name)(*input_data.args, **input_data.kwargs)   # self.api_name 对应 yaml 文件中的 name
            return
        output = eval(self.api_name)(*input_data.args, **input_data.kwargs)
        return output

如果我们需要实现一个add + matmul算子的自定义标杆，示例如下:

# addmm.py
import torch

from atk.configs.dataset_config import InputDataset
from atk.tasks.api_execute import register
from atk.tasks.api_execute.base_api import BaseApi


@register("addmm")    # 对应 yaml 文件中的 api_type
class AddmmApi(BaseApi):
    def __call__(self, input_data: InputDataset, with_output: bool = False):
        # cpu 小算子拼接作为标杆
        if sef.device == "cpu":
            a = input_data.args[0]
            b = input_data.args[1]
            tmp = torch.matmul(a, b)
            c = input_data.args[2]
            output = torch.add(tmp, c)

        # npu 融合算子作为标杆
        elif sef.device == "npu":
            import torch_npu
            a = input_data.args[0]
            b = input_data.args[1]
            c = input_data.args[2]
            output = torch_npu.addmm(a, b, c)
        return output

自定义api适配

目前执行Triton算子接口的调用，需要在自定义api中实现相应的代码。相应的Triton算子接口实现可以参考triton_ascend仓https://gitcode.com/Ascend/triton-ascend中的ut测试脚本。

编写ATK的自定义api文件：在__call__函数中，实现Triton算子的的接口调用；

以下是适配triton_add算子的triton_add.py代码样例：

# triton_add.py

from atk.configs.dataset_config import InputDataset
from atk.tasks.api_execute import register
from atk.tasks.api_execute.triton_base_api import TritonBaseApi
from mindspeed_ops.api.triton.add import add


@register("triton_add")  # triton_add 对应 yaml 文件中的 triton_api_type
class TritonFunctionApi(TritonBaseApi):  # 集成 基类TritonBaseApi 实现自定义的类
    def __call__(self, input_data: InputDataset, with_output: bool = False):
        # 读取 input_data 中的输入参数, 并构造 output
        x = input_data.args[0]
        y = input_data.args[1]

        # 直接使用 mindspeed_ops/triton 目录下的 add 算子（内部调用内核算子 triton_add）
        output = add(x, y)

        return output

精度测试

执行精度测试时，第一个节点设置为 --backend triton，第二个节点根据选取的标杆类型可设置--backend npu或者--backend cpu，执行任务选择--task accuracy。

关键参数设置：-tup，可选参数，表示包含待测 triton kernel 函数triton_add的文件所在目录。例如，测试的triton_add接口在test_general_add.py文件中，包含该文件的目录为test_kernel，则需要传入参数-tup ./test_kernel。

├── test_kernel
│   ├── test_general_add.py # 包含 triton_add 待测算子
│   ├── test_abs.py         # 包含 triton_abs 待测算子
│   .....

一些现成的triton kernel文件, 可从triton-ascend仓下载：generalization_cases

其他执行参数的具体说明参考链接：任务执行-参数说明

执行命令示例如下：

# CPU作为标杆
atk node --backend triton --devices 0 node --backend cpu task -c all_torch.add.json --task accuracy -tup ./test_kernel -p triton_add.py

# NPU作为标杆
atk node --backend triton --devices 0 node --backend npu --devices 0 task -c all_torch.add.json --task accuracy -tup ./test_kernel -p triton_add.py

如果需要和GPU进行比较，则需要在GPU节点上安装ATK工具，并启动ATK的服务，参考链接：多机执行然后执行如下测试命令：

# GPU作为标杆
atk node --backend triton --devices 0 node --backend gpu -h gpu环境的IP -p 环境端口号 --devices 0  -c all_torch.add.json --task accuracy -tup ./test_kernel -p triton_add.py

例如，在MindSpeed-Ops\tests\atk_tests\triton\add文件夹下执行命令精度命令如下：

atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task accuracy -tup ./ -p triton_add.py

如果精度测试用例都执行完后，将输出的excel报表，路径为./atk_output/all_算子名_时间戳/report/all_算子名_reports_时间戳.xlsx，保存到ATK_CIDA仓中。

如果有不通过的测试用例，可以通过查看表格中精度详情了解精度未通过原因。同时，通过atk命令增加--save_data input保存输入和--save_data output保存输出，加载不通过测试用例的输入tensor在本地复现问题。运行结束后，在执行命令目录下的atk_output/all_算子名_时间戳目录中保存结果形式如下：

├── input
│   └── all_add
│       ├── 0 # 存放第0个测试用例输入的tensor
|       ...
├── output
│   ├── triton_0
│   │   └── all_add
│   │       ├── 0 # 存放第0个测试用例triton输出的tensor
│   │       ...
│   └── npu_0
│       └── all_add
│           ├── 0 # 存放第0个测试用例npu标杆接口输出的tensor
|           ...
├── report
│   └──all_算子名_reports_时间戳.xlsx # 算子输出报告
├── mss
└── log

执行结果如下所示：

+-------+----------+------------------+------------------+--------------+----------------------+--------+--------------+
|  名称 | 总用例数 | 执行成功用例个数 | 执行失败用例个数 | 通过用例个数 | 错误信息匹配用例个数 | 通过率 | 精度是否达标 |
+-------+----------+------------------+------------------+--------------+----------------------+--------+--------------+
| npu_0 |   210    |       210        |        0         |     210      |          0           | 100.0  |     Pass     |
+-------+----------+------------------+------------------+--------------+----------------------+--------+--------------+

性能测试

直接和标杆比较性能

执行性能测试时，节点参数设置与精度测试相同，执行任务选择--task performance_device，如果需要保存性能采集数据，则可以添加参数--save_data profile。其他执行参数的具体说明参考：任务执行-参数说明

命令示例如下：

atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task  performance_device -tup ./ -p triton_add.py

其中单个测试用例会通过device_perf中耗时（单位us）的反比作为性能指标评价单个算子测试用例是否通过，如1.197000(npu标准接口耗时)/1.145000(triton接口耗时)=1.0454>0.98(not_key中定义的单个算子测试用例的评价标准0.98)，则判断该测试用例性能满足要求，认定通过。

执行结果如下图所示:

+-------+----------+------------------+------------------+--------------------+
|  名称 | 总用例数 | device性能通过率 | 平均device性能比 | device性能是否达标 |
+-------+----------+------------------+------------------+--------------------+
| npu_0 |   210    |     98.5714      |      1.0746      |        Pass        |
+-------+----------+------------------+------------------+--------------------+

上述结果中，平均device性能比为1.0746大于0.95(not_key中定义的算子的评价标准0.95)，所以判断triton算子精度满足要求

导入基线性能进行比较

如果需要与其他cann版本或者是之前保存的基线性能数据进行比较，测试算子的性能是否存在劣化，执行命令如下：

先保存基线的性能数据，执行命令如下：（如果ATK_CIDA仓中已有基线性能数据，可跳过这一步，直接进行第二步）

atk node --backend triton --devices 0 node --backend cpu --is_compare false  task -c ./result/torch.add/json/all_torch.add.json --task performance_device -tup ./test_kernel -e 10 -p triton_add.py

得到基线的性能数据excel文件all_torch.add_reports_时间戳.xlsx，并保存至ATK_CIDA仓。执行结果如下：

+----------+-----------------+--------------------+
|   id_0   | device_perf(us) | perf_device_result |
+----------+-----------------+--------------------+
| triton_0 |     1.044000    |         -          |
+----------+-----------------+--------------------+

导入步骤1中得到的基线性能数据excel文件（如果ATK_CIDA仓中已有基线性能数据，直接从仓上下载），进行算子性能测试。

atk node --backend triton --devices 0 node --backend cpu --is_compare False node --backend triton --devices 0 --bm_file 基线性能数据excel文件路径 task -c 用例json路径 --task performance_device -tup triton的UT测试用例目录 -e 10

# 例：
atk node --backend triton --devices 0 node --backend cpu --is_compare False node --backend triton --devices 0 --bm_file all_torch.add_reports_时间戳.xlsx task -c ./result/torch.add/json/all_torch.add.json --task performance_device -tup ./test_kernel -e 10 -p triton_add.py

关键执行参数含义如下：

参数	子参数	说明
node	--bm_file	标杆/基线性能测试数据的excel文件路径，用于和已有性能数据比对

执行结果如下：

+----------+----------+------------------+------------------+--------------------+
|   名称   | 总用例数 | device性能通过率 | 平均device性能比 | device性能是否达标 |
+----------+----------+------------------+------------------+--------------------+
| triton_1 |    10    |       90.0       |      1.0171      |        Pass        |
+----------+----------+------------------+------------------+--------------------+

参考

ATK使用指南