Wwanggangguofix
93bdb9c3创建于 1 天前历史提交
文件最后提交记录最后更新时间
feat/fix/docs/style/refactor/adaptor/chore/test(backend): refactor Table of Contents Co-authored-by: shiyuan680<yangcheng104@huawei.com> 1 个月前
fix 1 天前
feat: add chunk_gated_delta_rule_fwd_h operator Co-authored-by: LinShua<707894133@qq.com> # message auto-generated for no-merge-commit merge: !28 merge master_fwd_h into master feat: add chunk_gated_delta_rule_fwd_h operator Created-by: LinShua Commit-by: LinShua Merged-by: ascend-robot Description: ## What this PR does / why we need it? add chunk_gated_delta_rule_fwd_h operator [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and ATK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/efc6d84c-5b47-46e5-8b4f-e47cffe169fe/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/0c0e8daa-1066-4034-9dfd-565b7e182c11/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!287 天前
[Feat] Add chunk kda backward op for Kimi Linear Co-authored-by: zhuweichen<calvin_zhu0210@outlook.com> # message auto-generated for no-merge-commit merge: !26 merge kda into master [Feat] Add chunk kda backward op for Kimi Linear Created-by: zhuweichen Commit-by: zhuweichen Merged-by: ascend-robot Description: ## What this PR does / why we need it? This PR adds the chunk_kda_bwd_wy_dqkg_fused Triton operator for KDA chunk backward on Ascend arch32. The operator computes fused backward outputs dq, dk, dv, db, dg, and dA. It also adds the public API, arch32 implementation, UT, ATK cases, documentation, and README entry. https://gitcode.com/Ascend/MindSpeed-Ops/issues/28 ## Does this PR introduce any user-facing change? Yes. A new Triton API is added: from mindspeed_ops.api.triton.chunk_kda_bwd import chunk_kda_bwd_wy_dqkg_fused Documentation: docs/triton/chunk_kda_bwd.md Limitations: - Supports arch32 only; arch35 raises NotImplementedError. - Main inputs support float16 / float32. - g, h, and dh are expected to be float32. - bf16 is not declared as supported. ## How was this patch tested? UT: ```shell pytest tests/unit_tests/triton/test_chunk_kda_bwd.py -s pytest tests/unit_tests/triton/test_chunk_kda_bwd.py -m model_shape -s ``` ATK: ```shell cd tests/atk_tests/triton/chunk_kda_bwd atk case -f chunk_kda_bwd.yaml -p generate_chunk_kda_bwd.py atk node --backend triton --devices 0 node --backend npu --devices 0 task \ -c result/chunk_kda_bwd/json/all_chunk_kda_bwd.json \ --task accuracy -tup ./ -p triton_chunk_kda_bwd.py ``` ![image.png](https://raw.gitcode.com/user-images/assets/9612429/75125840-6c01-439c-bb35-f3507a829c19/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/d5d76f60-92bd-4f7a-b97c-11c3e1f44e60/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/69abe3da-d3d0-4b8c-b167-dfd8555e25ad/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!2613 天前
feat: conv1d wrapper Co-authored-by: liuxi_<liuxi75@huawei.com> 25 天前
docs/bug: Readme Update & Atk Modify Co-authored-by: liu_zhi_xu<liuzhexu1@huawei.com> # message auto-generated for no-merge-commit merge: !27 merge bugfix into master docs/bug: Readme Update & Atk Modify Created-by: liu_zhi_xu Commit-by: liu_zhi_xu Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1、Readme Update:ReadMe 2、Atk Modify:generate_fused_cross_entropy_loss [#Bugfix](https://gitcode.com/Ascend/MindSpeed-Ops/issues/27) ## Does this PR introduce any user-facing change? Not Related. ## How was this patch tested? atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task accuracy -p triton_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/fcdeb679-72af-4872-9763-0d3fd112d6ce/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!2716 天前
fix 1 天前
add mhc ops Co-authored-by: wangxuefei10<wangxuefei10@huawei.com> # message auto-generated for no-merge-commit merge: !35 merge dev_mhc_0525 into master feat: add mhc triton ops Created-by: Ling_i Commit-by: wangxuefei10 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add mhc triton ops https://gitcode.com/Ascend/MindSpeed-Ops/issues/1 ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and UTK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/f75f1ab2-4d23-4dc6-aacf-3d6e95d335b6/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!359 天前
add mhc ops Co-authored-by: wangxuefei10<wangxuefei10@huawei.com> # message auto-generated for no-merge-commit merge: !35 merge dev_mhc_0525 into master feat: add mhc triton ops Created-by: Ling_i Commit-by: wangxuefei10 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add mhc triton ops https://gitcode.com/Ascend/MindSpeed-Ops/issues/1 ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and UTK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/f75f1ab2-4d23-4dc6-aacf-3d6e95d335b6/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!359 天前
add mhc ops Co-authored-by: wangxuefei10<wangxuefei10@huawei.com> # message auto-generated for no-merge-commit merge: !35 merge dev_mhc_0525 into master feat: add mhc triton ops Created-by: Ling_i Commit-by: wangxuefei10 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add mhc triton ops https://gitcode.com/Ascend/MindSpeed-Ops/issues/1 ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and UTK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/f75f1ab2-4d23-4dc6-aacf-3d6e95d335b6/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!359 天前
feat: add recompute_w_u_fwd triton operator Co-authored-by: LinShua<707894133@qq.com> # message auto-generated for no-merge-commit merge: !31 merge master_recompute_w_u_fwd into master feat: add recompute_w_u_fwd triton operator Created-by: LinShua Commit-by: LinShua Merged-by: ascend-robot Description: ## What this PR does / why we need it? add recompute_w_u_fwd triton operator [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? Reference Operator Markdown Description. ## How was this patch tested? UT and ATK ![image.png](https://raw.gitcode.com/user-images/assets/9612429/38859bc3-71f3-4213-9443-c60617336ead/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/9b3013a5-7e99-4a28-8eb8-82cfb89c7767/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!317 天前
feat: Add FusedCrossEntropyLoss for Qwen3.5 Co-authored-by: liu_zhi_xu<liuzhexu1@huawei.com> # message auto-generated for no-merge-commit merge: !25 merge new_func into master feat: Add FusedCrossEntropyLoss for Qwen3.5 Created-by: liu_zhi_xu Commit-by: liu_zhi_xu Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1、New model adaptation operator completion [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) 2、Modify ATK config related RMS/SINK ## Does this PR introduce any user-facing change? Reference Operator Markdown Description ## How was this patch tested? [UT] pytest test_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2afb8d87-6bf6-4113-8078-1ea9eb30aebc/image.png 'image.png') [ATK] atk case -f rmsnorm_without_weight.yaml -p generate_rmsnorm_without_weight.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/rmsnorm_without_weight/json/all_rmsnorm_without_weight.json --task accuracy -p triton_rmsnorm_without_weight.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2440a7b5-0522-4e2e-a995-26623dc16d46/image.png 'image.png') atk case -f sinkhorn.yaml -p generate_sinkhorn.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/sinkhorn/json/all_sinkhorn.json --task accuracy -p triton_sinkhorn.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/6bba5494-02a3-427d-b7d4-30997cc8a91a/image.png 'image.png') atk case -f fused_cross_entropy_loss.yaml -p generate_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task accuracy -p triton_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task performance_device -p triton_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/ba106c72-2104-4d55-82c4-87203544529d/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/d7951ef0-e096-430c-9030-3e690e64d807/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!2518 天前
feat: add RmsNormGated Co-authored-by: feng0w0<houyufeng4@huawei.com> # message auto-generated for no-merge-commit merge: !22 merge master into master feat: add RmsNormGated Created-by: feng0w0 Commit-by: feng0w0 Merged-by: ascend-robot Description: ## What this PR does / why we need it? add RmsNormGated Triton [#1](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? Please describe whether the PR will result in any user-facing usage changes. If there is related documentation, please specify its path. ## How was this patch tested? ut ![image.png](https://raw.gitcode.com/user-images/assets/9612429/12418371-1f2d-4966-ab96-986077b3c558/image.png 'image.png') atk 精度 ![image.png](https://raw.gitcode.com/user-images/assets/9612429/3879ca4d-ab45-42c2-ab46-fa2920736536/image.png 'image.png') atk性能 ![image.png](https://raw.gitcode.com/user-images/assets/9612429/5efab269-aefb-4d02-8be5-fe36df89b857/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!229 天前
feat: Add FusedCrossEntropyLoss for Qwen3.5 Co-authored-by: liu_zhi_xu<liuzhexu1@huawei.com> # message auto-generated for no-merge-commit merge: !25 merge new_func into master feat: Add FusedCrossEntropyLoss for Qwen3.5 Created-by: liu_zhi_xu Commit-by: liu_zhi_xu Merged-by: ascend-robot Description: ## What this PR does / why we need it? 1、New model adaptation operator completion [#related roadmap](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) 2、Modify ATK config related RMS/SINK ## Does this PR introduce any user-facing change? Reference Operator Markdown Description ## How was this patch tested? [UT] pytest test_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2afb8d87-6bf6-4113-8078-1ea9eb30aebc/image.png 'image.png') [ATK] atk case -f rmsnorm_without_weight.yaml -p generate_rmsnorm_without_weight.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/rmsnorm_without_weight/json/all_rmsnorm_without_weight.json --task accuracy -p triton_rmsnorm_without_weight.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2440a7b5-0522-4e2e-a995-26623dc16d46/image.png 'image.png') atk case -f sinkhorn.yaml -p generate_sinkhorn.py atk node --backend triton --devices 0 node --backend cpu --devices 0 task -c result/sinkhorn/json/all_sinkhorn.json --task accuracy -p triton_sinkhorn.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/6bba5494-02a3-427d-b7d4-30997cc8a91a/image.png 'image.png') atk case -f fused_cross_entropy_loss.yaml -p generate_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task accuracy -p triton_fused_cross_entropy_loss.py atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/fused_cross_entropy_loss/json/all_fused_cross_entropy_loss.json --task performance_device -p triton_fused_cross_entropy_loss.py ![image.png](https://raw.gitcode.com/user-images/assets/9612429/ba106c72-2104-4d55-82c4-87203544529d/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/d7951ef0-e096-430c-9030-3e690e64d807/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!2518 天前
feat: prepare_wy_repr ops Co-authored-by: liuxi_<liuxi75@huawei.com> # message auto-generated for no-merge-commit merge: !20 merge prepare_wy_repr into master feat: prepare_wy_repr ops Created-by: liuxi_ Commit-by: liuxi_ Merged-by: ascend-robot Description: ## What this PR does / why we need it? prepare_wy_repr ops [#1](https://gitcode.com/Ascend/MindSpeed-Ops/issues/1) ## Does this PR introduce any user-facing change? 不涉及 ## How was this patch tested? TA 3.2.1 ut ![image.png](https://raw.gitcode.com/user-images/assets/9612429/e080ebd9-8a25-45ae-ad87-dccda27e1e77/image.png 'image.png') atk ![image.png](https://raw.gitcode.com/user-images/assets/9612429/78ebd5d6-8a38-4316-8ee3-ff1a055f75eb/image.png 'image.png') ![image.png](https://raw.gitcode.com/user-images/assets/9612429/2c25d24b-773d-43c0-a714-4461a6cf51e9/image.png 'image.png') See merge request: Ascend/MindSpeed-Ops!2018 天前
feat/fix/docs/style/refactor/adaptor/chore/test(backend): refactor Table of Contents Co-authored-by: shiyuan680<yangcheng104@huawei.com> 1 个月前
add ci& add ut Co-authored-by: shiyuan680<917935075@qq.com> 1 个月前
README.md

Triton算子ATK测试指南

环境准备

1、安装ATK工具

ATK下载

# 例如,atk-26.3.6-py3-none-any.whl
pip install atk-26.3.6-py3-none-any.whl

2、配置PATH环境变量(conda环境)

export PATH="/home/miniconda3/envs/{conda 容器名}/bin:$PATH"

3、在{安装路径}/MindSpeed-Ops/tests/atk_tests/triton下新增atk算子测试用例格式说明:

├── triton # {安装路径}/MindSpeed-Ops/tests/atk_tests/triton
│   ├── {op_name} # 算子名称定义文件夹
│   │   ├── {op_name}.yaml # 算子测试用例生成说明(涵盖精度、性能等指标)
│   │   ├── generate_{op_name}.py # 算子测试用例生成约束条件
│   │   └── triton_{op_name}.py # 算子执行逻辑文件
│   ....

示例 add 算子如下:
│   ├── add
│   │   ├── add.yaml
│   │   ├── generate_add.py
│   │   └── triton_add.yaml
│   ....

4、运行atk测试用例评价精度和性能命令示例

# 如下以 add 算子作为示例
cd {安装路径}/MindSpeed-Ops/tests/atk_tests/triton/add/

# 生成测试用例
 atk case -f add.yaml -p generate_add.py

# 执行精度评估
atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task accuracy -tup ./ -p triton_add.py

# 执行性能评估
atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task  performance_device -tup ./ -p triton_add.py

测试用例生成

编写测试设计yaml

根据需要测试的算子输入参数信息,编写对应的测试设计yaml文件,详细参数说明可参考:用例生成

yaml中,与自定义相关的几个参数如下:

自定义算子

  • generate: 默认为default, 表示不需要参数约束, 如果需要参数约束, 如 generate_add 则填对应自定义参数约束py脚本中@GENERATOR_REGISTRY.register("generate_add")的注册名称
  • api_type: 默认为function,表示执行eval(api_name), api_name为yaml文件中的name, 也可以自定义py脚本
  • triton_api_type: 必须填写, 表示triton kernel/api的自定义调用py脚本, 如triton_add,对应自定义脚本中@register("triton_add")的注册名称。注:这里是triton接口的逻辑标识符号(用于调用匹配)
  • triton_name:这里是具体函数的接口,atk代码处理逻辑中会按照.拆分成,如triton_add.add会拆成模块triton_add和接口add,加载模块triton_add获取接口add

精度和性能指标

通过如下两个参数可以设置精度和性能评估标准,可以自行设置,也可使用默认参数配置,详细参数设置可以参考:结果分析

  • acc:配置精度评估标准,single_bm 是ATK工具内置的单标杆比对-高精度选项(参考: 单标杆设置自定义阈值)。若仅设置acc: single_bm, 则默认数值设置如下:
    standard:
    acc:
        single_bm:
        # type 选择高精度/高性能/Vector融合算子标准类型,默认为高精度
        # 其中fp16和bf16对应的npu标杆算子,会做升精fp32处理; fp32对应的npu标杆算子, 则不做升精处理,仍为fp32
        # 高精度:high_precision;高性能:high_performance;Vector融合算子:vector_fused
        type: high_precision          # 默认设置为high_precision
        fp16_error: 2 ** -11          # float16默认误差阈值
        fp16_eb: 2 ** -10             # float16默认EB阈值
        bf16_error:  2 ** -8
        bf16_eb: 2 ** -7
        fp32_error: 2 ** -14
        fp32_eb: 2 ** -14
    
    如果需要设置双标杆(参考:双标杆设置自定义阈值)。若仅设置acc: cv_fused_double_benchmark,则默认数值如下:
    ## 高性能双标杆 & CV融合类算子(cv_fused_double_benchmark)
    standard:
    acc:
        cv_fused_double_benchmark:
        max_re_ratio: 10
        avg_re_ratio: 2
        root_mean_squared_ratio: 2
    
性能标准
  • perf:配置性能评估标准,可以设置默认为not_key, 如果不自定义参数,则默认为[0.95, 0.98],详细参数设置参考:性能标准

其他参数

  • max_length: 是测试用例生成配置中用于限制张量维度最大长度的参数,max_length: 196608表示生成的测试用例中张量的每个维度长度不会超过 196608。这一参数的作用是控制测试用例的规模,避免生成过长的维度导致测试效率低下或资源不足。
  • dtype_numbers: 表示在测试用例设计中设置的数据类型数量为70个,乘以设置的dtypes数量类型[ fp32, fp16, bf16 ]共3个,可以得到总共210个测试用例。该参数用于控制测试用例中所使用数据类型的范围,通常用于生成具有不同数据类型的测试用例,以全面验证算子在多种数据类型下的行为。减小该值可以成比例减少测试用例数量。

triton接口的参数说明,请参考triton的官方链接:https://triton-lang.org/main/index.html

以add算子为例,add.yaml完整文件如下:

# add.yaml

api: pytorch
version: v2.1
name: torch.add
triton_name: triton_add.add
api_type: function
triton_api_type: triton_add
generate: generate_add
dtype_numbers: 70
standard:
  acc: single_bm
  perf: not_key
inputs:
  - name:
    type: tensor
    required: true
    dtypes:
      values: [ fp32, fp16, bf16 ]
    ranges:
      valid:
        values: [ [-10,10], [-100,100] ]
      invalid:
        values: [ [-10,10], [-100,100] ]
    shapes:
      dim_numbers:
        values: [ 1, 2, 3, 4, 5 ]
      dim_values:
        values: [ 1,2,3,4,5,7,8,11,15,16,17,22,23,25,27,255,256,257 ]
      max_length: 196608
  - name:
    type: tensor
    required: true
    dtypes:
      values: [ fp32, fp16, bf16 ]
    ranges:
      valid:
        values: [ [-10,10], [-100,100] ]
      invalid:
        values: [ [-10,10], [-100,100] ]
    shapes:
      dim_numbers:
        values: [ 1, 2, 3, 4, 5 ]
      dim_values:
        values: [ 1,2,3,4,5,7,8,11,15,16,17,22,23,25,27,255,256,257 ]
      max_length: 196608

编写自定义参数约束

如果算子的输入参数之间存在约束,需要编写对应的参数约束脚本,具体方法请参考:自定义规则约束

以add算子为例,generate_add.py完整文件如下:

# generate_add.py

import random
from atk.case_generator.generator.generate_types import GENERATOR_REGISTRY
from atk.case_generator.generator.base_generator import CaseGenerator
from atk.configs.case_config import CaseConfig


@GENERATOR_REGISTRY.register("generate_add")  # generate_add为注册的生成器名称,对应yaml中的generate参数
class ReduceGenerator(CaseGenerator):

    def after_case_config(self, case_config: CaseConfig) -> CaseConfig:
        '''
        用例参数约束修改入口
        :param case_config:  生成的用例信息,可能不满足参数间约束,导致用例无效
        :return: 返回修改后符合参数间约束关系的用例,需要用例保障用例有效
        '''
        case_config.inputs[1].shape = case_config.inputs[0].shape # 使得第二个输入shape与第一个输入shape一致, 可以进行相加
        case_config.inputs[1].dtype = case_config.inputs[0].dtype # 使得第二个输入dtype与第一个输入dtype一致, 可以进行相加

        return case_config

执行ATK工具生成测试用例

执行以下命令生成对应算子的泛化测试用例,其中XXX.yaml为测试设计yaml文件(必选),XXX.py为参数约束脚本(可选)

atk case -f XXX.yaml -p XXX.py

# 例:atk case -f add.yaml -p generate_add.py

执行结果如下图所示:

[2026-04-11 17:41:14] [INFO] [MainProcess] [223249] [reports.py:239]  save case json file: result/add/json/all_add.json
[2026-04-11 17:41:14] [INFO] [MainProcess] [223249] [reports.py:268]  save case excel file: result/add/excel/add20260411174114.xlsx
[2026-04-11 17:41:14] [INFO] [MainProcess] [223249] [reports.py:305]  save case csv file: result/add/csv/add20260411174114.csv

其中,result/add/json/all_add.json后续会用做测试用例的输入。

自定义标杆适配

  • 如果torch没有对应的标杆接口,请参考: 自定义API实现

  • 如果torch有对应的标杆接口,可以通过设置yaml文件的name: 接口名api_type: function调用标杆,例子如下:

# add.yaml
...
name: torch.add
api_type: function
...

function.py标杆的代码如下所示:

# function.py
import torch

from atk.configs.dataset_config import InputDataset
from atk.tasks.api_execute import register
from atk.tasks.api_execute.base_api import BaseApi


@register("function")    # function 对应 yaml 文件中的 api_type
class FunctionApi(BaseApi):
    def __call__(self, input_data: InputDataset, with_output: bool = False):
        if not with_output:
            eval(self.api_name)(*input_data.args, **input_data.kwargs)   # self.api_name 对应 yaml 文件中的 name
            return
        output = eval(self.api_name)(*input_data.args, **input_data.kwargs)
        return output

如果我们需要实现一个add + matmul算子的自定义标杆,示例如下:

# addmm.py
import torch

from atk.configs.dataset_config import InputDataset
from atk.tasks.api_execute import register
from atk.tasks.api_execute.base_api import BaseApi


@register("addmm")    # 对应 yaml 文件中的 api_type
class AddmmApi(BaseApi):
    def __call__(self, input_data: InputDataset, with_output: bool = False):
        # cpu 小算子拼接作为标杆
        if sef.device == "cpu":
            a = input_data.args[0]
            b = input_data.args[1]
            tmp = torch.matmul(a, b)
            c = input_data.args[2]
            output = torch.add(tmp, c)

        # npu 融合算子作为标杆
        elif sef.device == "npu":
            import torch_npu
            a = input_data.args[0]
            b = input_data.args[1]
            c = input_data.args[2]
            output = torch_npu.addmm(a, b, c)
        return output

自定义api适配

目前执行Triton算子接口的调用,需要在自定义api中实现相应的代码。相应的Triton算子接口实现可以参考triton_ascendhttps://gitcode.com/Ascend/triton-ascend中的ut测试脚本。

  • 编写ATK的自定义api文件:在__call__函数中,实现Triton算子的的接口调用;

以下是适配triton_add算子的triton_add.py代码样例:

# triton_add.py

from atk.configs.dataset_config import InputDataset
from atk.tasks.api_execute import register
from atk.tasks.api_execute.triton_base_api import TritonBaseApi
from mindspeed_ops.api.triton.add import add


@register("triton_add")  # triton_add 对应 yaml 文件中的 triton_api_type
class TritonFunctionApi(TritonBaseApi):  # 集成 基类TritonBaseApi 实现自定义的类
    def __call__(self, input_data: InputDataset, with_output: bool = False):
        # 读取 input_data 中的输入参数, 并构造 output
        x = input_data.args[0]
        y = input_data.args[1]

        # 直接使用 mindspeed_ops/triton 目录下的 add 算子(内部调用内核算子 triton_add)
        output = add(x, y)

        return output

精度测试

执行精度测试时,第一个节点设置为 --backend triton,第二个节点根据选取的标杆类型可设置--backend npu或者--backend cpu,执行任务选择--task accuracy

关键参数设置:-tup,可选参数,表示包含待测 triton kernel 函数triton_add的文件所在目录。 例如,测试的triton_add接口在test_general_add.py文件中,包含该文件的目录为test_kernel,则需要传入参数-tup ./test_kernel

├── test_kernel
│   ├── test_general_add.py # 包含 triton_add 待测算子
│   ├── test_abs.py         # 包含 triton_abs 待测算子
│   .....

一些现成的triton kernel文件, 可从triton-ascend仓下载:generalization_cases

其他执行参数的具体说明参考链接:任务执行-参数说明

执行命令示例如下:

# CPU作为标杆
atk node --backend triton --devices 0 node --backend cpu task -c all_torch.add.json --task accuracy -tup ./test_kernel -p triton_add.py

# NPU作为标杆
atk node --backend triton --devices 0 node --backend npu --devices 0 task -c all_torch.add.json --task accuracy -tup ./test_kernel -p triton_add.py

如果需要和GPU进行比较,则需要在GPU节点上安装ATK工具,并启动ATK的服务,参考链接:多机执行 然后执行如下测试命令:

# GPU作为标杆
atk node --backend triton --devices 0 node --backend gpu -h gpu环境的IP -p 环境端口号 --devices 0  -c all_torch.add.json --task accuracy -tup ./test_kernel -p triton_add.py

例如,在MindSpeed-Ops\tests\atk_tests\triton\add文件夹下执行命令精度命令如下:

atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task accuracy -tup ./ -p triton_add.py

如果精度测试用例都执行完后,将输出的excel报表,路径为./atk_output/all_算子名_时间戳/report/all_算子名_reports_时间戳.xlsx,保存到ATK_CIDA仓中。

如果有不通过的测试用例,可以通过查看表格中精度详情了解精度未通过原因。同时,通过atk命令增加--save_data input保存输入和--save_data output保存输出,加载不通过测试用例的输入tensor在本地复现问题。运行结束后,在执行命令目录下的atk_output/all_算子名_时间戳目录中保存结果形式如下:

├── input
│   └── all_add
│       ├── 0 # 存放第0个测试用例输入的tensor
|       ...
├── output
│   ├── triton_0
│   │   └── all_add
│   │       ├── 0 # 存放第0个测试用例triton输出的tensor
│   │       ...
│   └── npu_0
│       └── all_add
│           ├── 0 # 存放第0个测试用例npu标杆接口输出的tensor
|           ...
├── report
│   └──all_算子名_reports_时间戳.xlsx # 算子输出报告
├── mss
└── log

执行结果如下所示:

+-------+----------+------------------+------------------+--------------+----------------------+--------+--------------+
|  名称 | 总用例数 | 执行成功用例个数 | 执行失败用例个数 | 通过用例个数 | 错误信息匹配用例个数 | 通过率 | 精度是否达标 |
+-------+----------+------------------+------------------+--------------+----------------------+--------+--------------+
| npu_0 |   210    |       210        |        0         |     210      |          0           | 100.0  |     Pass     |
+-------+----------+------------------+------------------+--------------+----------------------+--------+--------------+

性能测试

直接和标杆比较性能

执行性能测试时,节点参数设置与精度测试相同,执行任务选择--task performance_device,如果需要保存性能采集数据,则可以添加参数--save_data profile。 其他执行参数的具体说明参考:任务执行-参数说明

命令示例如下:

atk node --backend triton --devices 0 node --backend npu --devices 0 task -c result/add/json/all_add.json  --task  performance_device -tup ./ -p triton_add.py

其中单个测试用例会通过device_perf中耗时(单位us)的反比作为性能指标评价单个算子测试用例是否通过,如1.197000(npu标准接口耗时)/1.145000(triton接口耗时)=1.0454>0.98(not_key中定义的单个算子测试用例的评价标准0.98),则判断该测试用例性能满足要求,认定通过。

执行结果如下图所示:

+-------+----------+------------------+------------------+--------------------+
|  名称 | 总用例数 | device性能通过率 | 平均device性能比 | device性能是否达标 |
+-------+----------+------------------+------------------+--------------------+
| npu_0 |   210    |     98.5714      |      1.0746      |        Pass        |
+-------+----------+------------------+------------------+--------------------+

上述结果中,平均device性能比为1.0746大于0.95(not_key中定义的算子的评价标准0.95),所以判断triton算子精度满足要求

导入基线性能进行比较

如果需要与其他cann版本或者是之前保存的基线性能数据进行比较,测试算子的性能是否存在劣化,执行命令如下:

  1. 先保存基线的性能数据,执行命令如下:(如果ATK_CIDA仓中已有基线性能数据,可跳过这一步,直接进行第二步)
atk node --backend triton --devices 0 node --backend cpu --is_compare false  task -c ./result/torch.add/json/all_torch.add.json --task performance_device -tup ./test_kernel -e 10 -p triton_add.py

得到基线的性能数据excel文件all_torch.add_reports_时间戳.xlsx,并保存至ATK_CIDA仓。 执行结果如下:

+----------+-----------------+--------------------+
|   id_0   | device_perf(us) | perf_device_result |
+----------+-----------------+--------------------+
| triton_0 |     1.044000    |         -          |
+----------+-----------------+--------------------+
  1. 导入步骤1中得到的基线性能数据excel文件(如果ATK_CIDA仓中已有基线性能数据,直接从仓上下载),进行算子性能测试。
atk node --backend triton --devices 0 node --backend cpu --is_compare False node --backend triton --devices 0 --bm_file 基线性能数据excel文件路径 task -c 用例json路径 --task performance_device -tup triton的UT测试用例目录 -e 10

# 例:
atk node --backend triton --devices 0 node --backend cpu --is_compare False node --backend triton --devices 0 --bm_file all_torch.add_reports_时间戳.xlsx task -c ./result/torch.add/json/all_torch.add.json --task performance_device -tup ./test_kernel -e 10 -p triton_add.py

关键执行参数含义如下:

参数 子参数 说明
node --bm_file 标杆/基线性能测试数据的excel文件路径,用于和已有性能数据比对

执行结果如下:

+----------+----------+------------------+------------------+--------------------+
|   名称   | 总用例数 | device性能通过率 | 平均device性能比 | device性能是否达标 |
+----------+----------+------------------+------------------+--------------------+
| triton_1 |    10    |       90.0       |      1.0171      |        Pass        |
+----------+----------+------------------+------------------+--------------------+

参考

ATK使用指南