MindSpeed-LLM 测试用例贡献说明
所有测试用例仅支持Megatron-Mcore模型结构。
CI门禁看护列表
CI门禁用例看护仓库重点模型和基本特性,覆盖冒烟测试场景,PR合入前都须通过全量CI门禁用例测试。
| Tests | Module | Features | Scripts | Acc. | Throu. | Mem. |
|---|---|---|---|---|---|---|
| ST | Pretrain | TP,PP,VPP,distributed_optimizer,o2_gradient,o2_optimizer,重计算,enable_recompute_layers_per_pp_rank,FA_TND,use_fused_rotary_pos_emb | llama2_tp2_pp4_vpp2_ptd.sh | Y | Y | Y |
| swap_attention,recompute_activation_function,enable_recompute_layers_per_pp_rank,reuse_fp32_param | llama2_tp2_pp4_vpp2_swap.sh | Y | Y | Y | ||
| cp_ring,general_cp,double_ring,分布式优化器,reuse_fp32_param,recompute_activation_function,fused_rmsnorm,fused_swiglu,fused_rope,overlap_grad_reduce, overlap_param_gather | llama2_tp2_cp4_general_double_ring.sh | Y | Y | Y | ||
| noop_layers, recompute_norm | llama3_mcore_tp2_pp2_vpp2_noop_layer.sh | Y | Y | Y | ||
| cp_hybrid,gqa | chatglm3_gqa_cp4.sh | Y | Y | Y | ||
| mla_attention,moe_grouped_gemm,EP,allgather_dispatcher,moe_allgather_overlap_comm,use_fused_rotary_pos_emb,recompute_norm | deepseek_v2_mcore_tp1_pp1_ep8.sh | Y | Y | Y | ||
| n_group,seq_aux,gradient_accumulation_fusion,recompute_mtp_layer,recompute_mtp_norm | deepseek_v3_mcore_tp1_pp2_ep4.sh | Y | Y | Y | ||
| moe_alltoall_overlap_comm,moe-zero-memory,swap-attention,reuse_fp32_param,fused_rmsnorm,fused_swiglu | deepseek_500b_tp1_pp2_ep2_cp2_overlap.sh | Y | Y | Y | ||
| moe-fb-overlap, mtp-mem-efficient-logits, mla-mm-split, mla-fa-without-pad | deepseek32_tp1_pp2_vpp1_ep4.sh | Y | Y | Y | ||
| EP,CP,num_experts,moe_router_topk,aux_loss,moe_allgather,group_query_attention,rotary_base | mixtral_mcore_tp4_cp2_ep2_ptd.sh | Y | Y | Y | ||
| mamba_cp_algo,分布式优化器,reuse_fp32_param,recompute-granularity,enable-recompute-layers-per-pp-rank | mamba2_8b_tp4_pp1_cp2_recompute_4k_ptd.sh | Y | Y | Y | ||
| triton, topk-softmax-in-fp32, moe-router-pre-softmax | qwen3_next_80b_4K_A3_ptd.sh | Y | Y | Y | ||
| Finetune | CCLoRA, QLoRA | tune_llama2_tp1_pp1_qlora_ptd.sh | Y | Y | Y | |
| LoRA, lora-fusion, llama3-rope, no-pad-to-seq-lengths, enable-hf2mg-convert, auto_data_process | tune_llama3_8b_lora_tp1pp8.sh | Y | Y | Y | ||
| DPO | is_pairwise_dataset, cyclic | dpo_llama2_tp1_pp1_cyclic_pairwise.sh | Y | Y | Y | |
| FSDP | fsdp pretrain | pretrain_qwen3_8b_4k_fsdp2.sh | Y | Y | Y | |
| fsdp sft | tune_gpt_oss_20b_a3b_4k_fsdp2.sh | Y | Y | Y | ||
| UT | Inference | greedy_search, lora_inference, deterministic_computation | test_inference.py | Y | ||
| Evaluation | mmlu, prompt_mmlu, qwen2_mmlu, agieval, bbh | test_evaluate.py | Y | |||
| Checkpoint | hf2mcore, mcore2hf, TP, PP, EP, DPP, VPP, moe, noop_layers, lora | test_checkpoint.py | Y | |||
| deepseek2, deepseek2_lite, llama2, llama3, qwen2 | Y | |||||
| ProcessData | pretrain_data_alpaca, pretrain_merge_datasets, instruction_data_alpaca, instruction_merge_datasets | test_preprocess_data.py | Y |
Pipeline看护列表
Pipeline用例看护全量覆盖仓库所有模型和所有特性,每天夜里拉起运行,次日输出测试报告。
| Tests | Module | Features | Scripts | Acc. | Throu. | Mem. |
|---|---|---|---|---|---|---|
| ST | baichuan2-13B | baichuan2_13b, no-gradient-accumulation-fusion | baichuan2_13b_tp8_pp1_mcore.sh | Y | Y | Y |
| chatglm3-6B | chatglm3-6B, use-glm-rope, overlap-grad-reduce, overlap-param-gather | chatglm3_tp1_pp2_rope.sh | Y | Y | Y | |
| deepseek | deepseekv3, dualpipev, mla-up-proj-tp-overlap, moe-fb-overlap | deepseek_v3_mcore_tp2_pp2_ep2_dualpipev_fb.sh | Y | Y | Y | |
| deepseekv2, moe-grouped-gemm, moe-permutation-async-comm, first-k-dense-replace | deepseek2_tp1_pp1_mcore_moe.sh | Y | Y | Y | ||
| glm4 | glm4, overlap_grad_reduce, overlap_param_gather, distributed_optimizer, GQA, GLM-rope | glm4_9b_8k_tp2_pp2_ptd.sh | Y | Y | Y | |
| gpt4 | gpt4, distributed_optimizer, overlap_grad_reduce, overlap_param_gather | gpt4_mcore_tp4_cp2_32k_moe_drop.sh | Y | Y | Y | |
| grok1 | grok1, distributed_optimizer, reuse_fp32_param | grok1_40b_tp4_ep2_ptd.sh | Y | Y | Y | |
| high_availability | high_availability error_dump | high_availability_error_dump_ptd.sh | Y | Y | Y | |
| high_availability uce_error | high_availability_uce_error_ptd.sh | Y | Y | Y | ||
| hunyuan | hunyuan, distributed_optimizer, share_kvstates | tune_hunyuanLarge_389b_tp1_pp1_ep8_ptd.sh | Y | Y | Y | |
| interlm3 | internlm3, distributed_optimizer, fused_ring_attention_update | internlm3_8b_tp1_pp4_cp2_ptd.sh | Y | Y | Y | |
| llama2 | llama2, distributed_optimizer, overlap_grad_reduce, gloo | llama2_tp1_pp8_patch_gloo_ptd.sh | Y | Y | Y | |
| llama2, TP-2D, ring_cp, distributed_optimizer, overlap_grad_reduce | llama2_tp4cp2pp1_tp2d_tpx2tpy2_ringcp.sh | Y | Y | Y | ||
| llama2, TP-2D, ulysses_cp, distributed_optimizer, overlap_grad_reduce | llama2_tp4cp2pp1_tp2d_tpx2tpy2_ulysses.sh | Y | Y | Y | ||
| llama2, TP-2D, VPP, distributed_optimizer, overlap_grad_reduce | llama2_tp4pp2vpp2_tp2d_tpx2tpy2.sh | Y | Y | Y | ||
| llama2, distributed_optimizer, overlap_grad_reduce, ascend_coc | llama2_tp8_pp1_coc_ptd.sh | Y | Y | Y | ||
| llama2, LoRA, lora-fusion, SFT | tune_llama2_tp1_pp1_lora_ptd.sh | Y | Y | Y | ||
| llama2, LU-LoRA, lora-fusion, SFT | tune_llama2_tp1_pp1_lu_lora_ptd.sh | Y | Y | Y | ||
| llama2, VPP, distributed_optimizer, overlap_grad_reduce, SFT | tune_llama2_tp2_pp4_vpp2_mcore_full.sh | Y | Y | Y | ||
| llama3 | llama3, VPP, GQA, recompute, manual_gc | llama3_tp2_pp2_vpp1.sh | Y | Y | Y | |
| longcat-flash | longcat-flash, ETP, MLA, distributed_optimizer | longcat_flash_560b_tp2pp1ep2etp1.sh | Y | Y | Y | |
| mamba2 | mamba2, mamba_cp_algo, distributed_optimizer, reuse_fp32_param, overlap_grad_reduce, overlap_param_gather | mamba2_2.7b_tp1_pp1.sh | Y | Y | Y | |
| phi35-moe | phi35-moe, distributed_optimizer, overlap_grad_reduce, overlap_param_gather, longrope | phi35_moe_tp1_pp8_mcore.sh | Y | Y | Y | |
| qwen25 | qwen2-moe, distributed_optimizer, overlap_grad_reduce, overlap_param_gather, profile | qwen2_moe_tp1_pp2_ep2_cp2_32k.sh | Y | Y | Y | |
| qwen25, distributed_optimizer, overlap_grad_reduce, SFT, neat_pack, padded_samples | tune_qwen25_0point5b_tp1_pp1_pack.sh | Y | Y | Y | ||
| qwen3-30b | qwen3-30b DPO, distributed_optimizer, recompute, DPO | dpo_qwen3_30b_a3b_16K_A3_ptd_tp2pp4.sh | Y | Y | Y | |
| rlhf | rlhf GRPO | test_rlhf_qwen25_7b_tp2_pp2.sh | Y | Y | Y | |
| seed-oss | seed-oss, distributed_optimizer, calculate-per-token-loss | seed_oss_36b_tp2pp2.sh | Y | Y | Y | |
| UT | checkpoint | test_checkpoint_param | test_checkpoint_param.py | Y | ||
| test_checkpoint_v2 | test_checkpoint_v2.py | Y | ||||
| test_checkpoint | test_checkpoint.py | Y | ||||
| context_parallel | test_hybrid_context_parallel | test_hybrid_context_parallel.py | Y | |||
| test_ringattn_context_parallel | test_ringattn_context_parallel.py | Y | ||||
| test_ulysses_context_parallel | test_ulysses_context_parallel.py | Y | ||||
| elastic_training | test_elastic_training_common | test_elastic_training_common.py | Y | |||
| test_elastic_training_register | test_elastic_training_register.py | Y | ||||
| test_elastic_training_repair | test_elastic_training_repair.py | Y | ||||
| test_elastic_training_rollback | test_elastic_training_rollback.py | Y | ||||
| test_elastic_training_scale_in_rebuild | test_elastic_training_scale_in_rebuild.py | Y | ||||
| test_elastic_training_scale_out_rebuild | test_elastic_training_scale_out_rebuild.py | Y | ||||
| evaluation | mmlu, cmmlu, humaneval, ceval, boolq, gsm8k, agieval, bbh, needlebench | test_evaluate.py | Y | |||
| inference | greedy_search, do_sample_search, beam_search, chat | test_inference.py | Y | |||
| model_module | test_attention, DotProductAttention, FlashSelfAttention, alibi | test_attention.py | Y | |||
| test_rotary_pos_embedding | test_rotary_pos_embedding.py | Y | ||||
| test_topk_router, sparsemixer_topk | test_topk_router.py | Y | ||||
| process_data | test_process_instruction_data, alpaca, sharegpt, openai, merge_datasets, multi_handler, template | test_process_instruction_data.py | Y | |||
| test_process_pairwise_data, alpaca, sharegpt | test_process_pairwise_data.py | Y | ||||
| test_process_pretrain_data, merge_datasets, GPTSentencePieceTokenizer | test_process_pretrain_data.py | Y |
DT覆盖率看护
覆盖率分析脚本
执行tests/run_coverage.sh脚本时,可添加UT, ST, PIPELINE, all等运行参数
cd MindSpeed-LLM
bash tests/run_coverage.sh UT # 分析UT用例覆盖率
bash tests/run_coverage.sh ST # 分析ST用例覆盖率
bash tests/run_coverage.sh PIPELINE # 分析PIPELINE用例覆盖率
bash tests/run_coverage.sh all # 分析UT,ST,PIPELINE所有用例覆盖率
设置脚本中的branch 的值为 False时只分析行覆盖率,将branch 的值改为 True 则可以测试分支覆盖率。
覆盖率报告
在NPU机器运行 run_coverage.sh 脚本后,项目目录下将生成 COVERAGE 文件夹,其中 COVERAGE/logs文件夹保存了详细的用例执行情况,COVERAGE/report文件夹保存了仓库用例覆盖率报告。
COVERAGE/report/htmlcov.tgz包含了仓库所有文件的详细覆盖率信息,将该文件复制到本地电脑进行解压,然后在浏览器中打开 htmlcov/index.html 文件即可进行查看。
开发流程
1.权重和数据集配置
用例所需使用的权重、Tokenizer、数据集文件,请按照以下要求存放在蓝区服务器的/data目录下,并按照要求在蓝区资源清单中进行登记,否则不予上库!
注意:
- /data/ci目录下只保存用例相关的文件,不要引入其他无关文件
- 为了节省蓝区空间并提高运行效率,请尽量复用原有权重数据,如需上传权重数据,请将权重层数设置为最小
- 模型名称需与huggingface保持一致,严禁省略或自定义
数据路径和命名规则:
- hf权重和词表路径:/data/ci/models/模型名称/hf/权重或词表文件
- mg权重路径:/data/ci/models/模型名称/mg/模型名称_切分方式
- 原数据集:/data/ci/datasets/origin/数据集名称
- 处理后的数据集:/data/ci/datasets/processed/数据集名称
- 评估数据集:/data/ci/datasets/eva_dataset/数据集名称
- 缓存文件夹:/data/ci/cache/缓存文件,用例执行结束前请调用shutil.rmtree(dir_path)删除缓存文件
2.本地验证
用例编写后,先确保用例在本地运行无误,然后用蓝区备用服务器生成基线,用例和基线一同上仓
3.用例登记
- 测试用例信息登记 为了方便后续对用例进行维护,需要对用例的作者、上仓日期、简要描述以及其他信息进行标注
ST用例需在运行脚本开始时标注以下信息
#=============================================
# Author: xxx
# Date: xxxx-xx-xx
# Description:Model or feature covered by the testcase
# Remarks: Instructions for the checkpoint, datasets and tokenizer or other more information
#=============================================
UT用例在用例执行函数内标注以下信息
def test_featureA():
'''
Author: xxx
Date: xxxx-xx-xx
Description:Model or feature covered by the testcase
Remarks: Instructions for the checkpoint, datasets and tokenizer or other more information
'''
......
-
用例看护特性列表登记 在
/tests/README.md文件中登记测试用例所看护的模型,特性信息 -
蓝区资源登记 在
/tests/resource_record.md文件中登记用例使用的权重、词表、数据集信息
4.用例上仓
特性须与看护用例一同上仓,只有业务代码而无看护用例的PR不予合入!
开发规则
测试用例全部放置在tests目录下,具体层级如下:
tests/st/目录下维护CI门禁会拉起的ST用例tests/ut/目录下维护CI门禁会拉起的UT用例tests/pipeline/st目录下维护每日PIPELINE流水线会拉起的ST用例tests/pipeline/ut目录下维护每日PIPELINE流水线会拉起的UT用例
ST
① 贡献脚本用例请放置于 st/shell_scripts 文件夹下,命名规则为 {模型名}_{切分策略} 或者 {模型名}_{特性名称}, 如 llama2_tp2_pp4_vpp2_ptd.sh,请贡献者严格对齐;
② 注意脚本用例中不需要单独重定向log,日志收集工作已在 st_run.sh 中统一管理;
③ 标杆数据请放置于 st/baseline_results 文件夹下,命名保证完全与 shell 脚本对齐,否则自动化脚本执行将扫描不到;
④ 获取标杆数据:通过门禁任务执行获得首次数据,并将结果保存至本地 log 或者 txt 文件中,后通过本地执行 st/st_utils/common.py 中的 transfer_logs_as_json 函数进行提取,最后再连同用例脚本上仓即可;
⑤ 在贡献时候需要考虑最终校验的具体指标,精度(Acc.)、性能(Throu.)、显存(Mem.),在对应指标空白处填上 Y,如无校验的保留空白即可。
UT
① 建议所有 UT 用例通过分布式 pytest 来拉起,即继承 tests/common.py 文件下的 DistributedTest,指定 world_size,具体参照已有用例即可;
② 建议按照功能特性进行文件夹命名区分,至多不超过两层目录,所有用例以 test 作为命名前缀;
③ 新增用例可以在原有用例基础上做 test_xxx 的补充,尽量保证测试功能的集成性;对于存在 .json 文件的用例,贡献时在 .json 中加入 test_xxx 配置,然后在 .py 中通过 @pytest.mark.parametrize 传入参数、构造用例,请注意 .json 中的 key 值命名需与 .py 中的 test_xxx 保持统一;
④ 在贡献时候需要考虑最终校验的具体指标,精度(Acc.)、性能(Throu.)、显存(Mem.),在对应指标空白处填上 Y,如无校验的保留空白即可。