| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
fix: support attn_mask_type=padding_causal in NPU FlashAttention Co-authored-by: Bruce-rl-hw<okwsl201210@gmail.com> # message auto-generated for no-merge-commit merge: !93 merge fix/thd-padding-causal into main fix: support attn_mask_type=padding_causal in NPU FlashAttention Created-by: Bruce-rl-hw Commit-by: Bruce-rl-hw Merged-by: ascend-robot Description: Megatron-core passes attn_mask_type=padding_causal for packed (qkv_format=thd) sequences, which NVIDIA TransformerEngine treats as causal masking within each packed segment. TE-NPU only handled causal: get_fa_config() let padding_causal fall through to sparse_mode=0, and DotProductAttention only built the compressed causal mask for == causal, so packed/thd attention ran with NO causal mask (full bidirectional). This corrupts per-token logprobs in RL training (actor recompute pearson corr ~0.2 vs ~0.999 for the non-packed path). Map padding_causal (and the documented equivalents padding,causal / causal,padding) to the same path as causal: sparse_mode=2 + compressed causal mask. Per-segment padding is handled via actual_seq_qlen/cu_seqlens, not the attention_mask. Top-left causal only; *_bottom_right is intentionally excluded (that is right-down causal = sparse_mode 3, not 2). Adds tests/pytorch/test_thd_padding_causal.py: asserts the get_fa_config mapping and that a packed thd padding_causal forward matches per-segment causal ground truth. Also verified on Qwen3-0.6B / Ascend 950DT: isolated packed-vs-nonpacked forward corr 0.33 -> 1.0; GRPO thd rollout_actor_probs_pearson_corr 0.18 -> 0.999. See merge request: Ascend/TransformerEngineNPU!93 | 16 小时前 | |
feat:adaptor attention kvallagther/ulysses Co-authored-by: Liz<lizhi166@huawei.com> | 1 个月前 | |
feat:tenpu rope adaptor Co-authored-by: Liz_<lizhi166@huawei.com> | 1 个月前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 16 小时前 | ||
| 1 个月前 | ||
| 1 个月前 |