TransformerEngineNPU/tests/pytorch · Ascend/TransformerEngineNPU - AtomGit

ascend-robotfix: support attn_mask_type=padding_causal in NPU FlashAttention

文件	最后提交记录	最后更新时间
conftest.py	test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com>	19 天前
distributed_testing.py	test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com>	19 天前
test_basic_fuser_ops.py	feat(ops): basic fuser ops API Co-authored-by: mingzhenwang<wangmingzhen4@huawei.com>	1 个月前
test_cpu_offloading.py	refactor(grouped_linear,gemm,fp8): overhaul grouped matmul with unified dispatch, NPU version check, and FP8 quantization cleanup Co-authored-by: Muu<koimuu@163.com>	17 天前
test_distributed_ops.py	feat: add BasicLinear UT Co-authored-by: junhang<wangjunhang7@huawei.com>	15 天前
test_fsdp.py	refactor(grouped_linear,gemm,fp8): overhaul grouped matmul with unified dispatch, NPU version check, and FP8 quantization cleanup Co-authored-by: Muu<koimuu@163.com>	17 天前
test_fused_optimizer.py	test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com>	19 天前
test_fused_rope.py	feat:tenpu rope adaptor Co-authored-by: Liz_<lizhi166@huawei.com>	1 个月前
test_fusible_ops.py	feat: add BasicLinear UT Co-authored-by: junhang<wangjunhang7@huawei.com>	15 天前
test_multi_tensor.py	[feat][Optimizers]multi_tensor相关实现 Co-authored-by: Keilo_W<wangkaiyu11@h-partners.com>	1 个月前
test_mxfp4_tensor.py	feat(FP4): Support W4A4-MXFP4 Co-authored-by: mingzhenwang<wangmingzhen4@huawei.com>	15 天前
test_mxfp8_forward_only_then_train.py	fix: set input_quantizer usage explicitly so a forward-only pass does not pollute the next training forward Co-authored-by: Bruce-rl-hw<okwsl201210@gmail.com> # message auto-generated for no-merge-commit merge: !95 merge fix/mxfp8-thd-columnwise into main fix: set input_quantizer usage explicitly so a forward-only pass does not pollute the next training forward Created-by: Bruce-rl-hw Commit-by: Bruce-rl-hw Merged-by: ascend-robot Description: Fix mxfp8 training in LayerNormLinear by setting the input quantizer's usage explicitly every forward (rowwise=True, columnwise=backward_needs_input), so a prior forward-only pass (e.g. an RL log-prob recompute) can't leave columnwise=False and make the backward weight-gradient GEMM crash with "Cannot access storage of UndefinedTensorImpl". Verified on qwen3-0.6B/Ascend 950 DT See merge request: Ascend/TransformerEngineNPU!95	13 小时前
test_ops_fuser.py	Added support for the fused operator framework (graph capture mode not supported). Co-authored-by: dao_qian<tanao4@huawei.com>	1 个月前
test_permutation.py	fix: ut for ci Co-authored-by: clc2025<chenlucong@huawei.com>	23 天前
test_quantized_tensor.py	fix: stabilize quantized tensor npu paths	1 个月前
test_router.py	adapt aux_loss Co-authored-by: dao_qian<tanao4@huawei.com>	1 个月前
test_sanity.py	test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com>	19 天前
test_selective_activation_checkpoint.py	tenpu layernorm mlp Co-authored-by: clc2025<chenlucong@huawei.com>	2 个月前
test_sequential.py	add sequential support Co-authored-by: dao_qian<tanao4@huawei.com>	1 个月前
test_thd_padding_causal.py	fix: support attn_mask_type=padding_causal in NPU FlashAttention Co-authored-by: Bruce-rl-hw<okwsl201210@gmail.com> # message auto-generated for no-merge-commit merge: !93 merge fix/thd-padding-causal into main fix: support attn_mask_type=padding_causal in NPU FlashAttention Created-by: Bruce-rl-hw Commit-by: Bruce-rl-hw Merged-by: ascend-robot Description: Megatron-core passes attn_mask_type=padding_causal for packed (qkv_format=thd) sequences, which NVIDIA TransformerEngine treats as causal masking within each packed segment. TE-NPU only handled causal: get_fa_config() let padding_causal fall through to sparse_mode=0, and DotProductAttention only built the compressed causal mask for == causal, so packed/thd attention ran with NO causal mask (full bidirectional). This corrupts per-token logprobs in RL training (actor recompute pearson corr ~0.2 vs ~0.999 for the non-packed path). Map padding_causal (and the documented equivalents padding,causal / causal,padding) to the same path as causal: sparse_mode=2 + compressed causal mask. Per-segment padding is handled via actual_seq_qlen/cu_seqlens, not the attention_mask. Top-left causal only; *_bottom_right is intentionally excluded (that is right-down causal = sparse_mode 3, not 2). Adds tests/pytorch/test_thd_padding_causal.py: asserts the get_fa_config mapping and that a packed thd padding_causal forward matches per-segment causal ground truth. Also verified on Qwen3-0.6B / Ascend 950DT: isolated packed-vs-nonpacked forward corr 0.33 -> 1.0; GRPO thd rollout_actor_probs_pearson_corr 0.18 -> 0.999. See merge request: Ascend/TransformerEngineNPU!93	9 小时前
utils.py	feat(FP4): Support W4A4-MXFP4 Co-authored-by: mingzhenwang<wangmingzhen4@huawei.com>	15 天前