| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com> | 19 天前 | |
test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com> | 19 天前 | |
feat(ops): basic fuser ops API Co-authored-by: mingzhenwang<wangmingzhen4@huawei.com> | 1 个月前 | |
refactor(grouped_linear,gemm,fp8): overhaul grouped matmul with unified dispatch, NPU version check, and FP8 quantization cleanup Co-authored-by: Muu<koimuu@163.com> | 17 天前 | |
feat: add BasicLinear UT Co-authored-by: junhang<wangjunhang7@huawei.com> | 15 天前 | |
refactor(grouped_linear,gemm,fp8): overhaul grouped matmul with unified dispatch, NPU version check, and FP8 quantization cleanup Co-authored-by: Muu<koimuu@163.com> | 17 天前 | |
test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com> | 19 天前 | |
feat:tenpu rope adaptor Co-authored-by: Liz_<lizhi166@huawei.com> | 1 个月前 | |
feat: add BasicLinear UT Co-authored-by: junhang<wangjunhang7@huawei.com> | 15 天前 | |
[feat][Optimizers]multi_tensor相关实现 Co-authored-by: Keilo_W<wangkaiyu11@h-partners.com> | 1 个月前 | |
feat(FP4): Support W4A4-MXFP4 Co-authored-by: mingzhenwang<wangmingzhen4@huawei.com> | 15 天前 | |
fix: set input_quantizer usage explicitly so a forward-only pass does not pollute the next training forward Co-authored-by: Bruce-rl-hw<okwsl201210@gmail.com> # message auto-generated for no-merge-commit merge: !95 merge fix/mxfp8-thd-columnwise into main fix: set input_quantizer usage explicitly so a forward-only pass does not pollute the next training forward Created-by: Bruce-rl-hw Commit-by: Bruce-rl-hw Merged-by: ascend-robot Description: Fix mxfp8 training in LayerNormLinear by setting the input quantizer's usage explicitly every forward (rowwise=True, columnwise=backward_needs_input), so a prior forward-only pass (e.g. an RL log-prob recompute) can't leave columnwise=False and make the backward weight-gradient GEMM crash with "Cannot access storage of UndefinedTensorImpl". Verified on qwen3-0.6B/Ascend 950 DT See merge request: Ascend/TransformerEngineNPU!95 | 13 小时前 | |
Added support for the fused operator framework (graph capture mode not supported). Co-authored-by: dao_qian<tanao4@huawei.com> | 1 个月前 | |
fix: ut for ci Co-authored-by: clc2025<chenlucong@huawei.com> | 23 天前 | |
fix: stabilize quantized tensor npu paths | 1 个月前 | |
adapt aux_loss Co-authored-by: dao_qian<tanao4@huawei.com> | 1 个月前 | |
test: add distributed runner for test Co-authored-by: yanzhengyang<yanzhengyang@huawei.com> | 19 天前 | |
tenpu layernorm mlp Co-authored-by: clc2025<chenlucong@huawei.com> | 2 个月前 | |
add sequential support Co-authored-by: dao_qian<tanao4@huawei.com> | 1 个月前 | |
fix: support attn_mask_type=padding_causal in NPU FlashAttention Co-authored-by: Bruce-rl-hw<okwsl201210@gmail.com> # message auto-generated for no-merge-commit merge: !93 merge fix/thd-padding-causal into main fix: support attn_mask_type=padding_causal in NPU FlashAttention Created-by: Bruce-rl-hw Commit-by: Bruce-rl-hw Merged-by: ascend-robot Description: Megatron-core passes attn_mask_type=padding_causal for packed (qkv_format=thd) sequences, which NVIDIA TransformerEngine treats as causal masking within each packed segment. TE-NPU only handled causal: get_fa_config() let padding_causal fall through to sparse_mode=0, and DotProductAttention only built the compressed causal mask for == causal, so packed/thd attention ran with NO causal mask (full bidirectional). This corrupts per-token logprobs in RL training (actor recompute pearson corr ~0.2 vs ~0.999 for the non-packed path). Map padding_causal (and the documented equivalents padding,causal / causal,padding) to the same path as causal: sparse_mode=2 + compressed causal mask. Per-segment padding is handled via actual_seq_qlen/cu_seqlens, not the attention_mask. Top-left causal only; *_bottom_right is intentionally excluded (that is right-down causal = sparse_mode 3, not 2). Adds tests/pytorch/test_thd_padding_causal.py: asserts the get_fa_config mapping and that a packed thd padding_causal forward matches per-segment causal ground truth. Also verified on Qwen3-0.6B / Ascend 950DT: isolated packed-vs-nonpacked forward corr 0.33 -> 1.0; GRPO thd rollout_actor_probs_pearson_corr 0.18 -> 0.999. See merge request: Ascend/TransformerEngineNPU!93 | 9 小时前 | |
feat(FP4): Support W4A4-MXFP4 Co-authored-by: mingzhenwang<wangmingzhen4@huawei.com> | 15 天前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 19 天前 | ||
| 19 天前 | ||
| 1 个月前 | ||
| 17 天前 | ||
| 15 天前 | ||
| 17 天前 | ||
| 19 天前 | ||
| 1 个月前 | ||
| 15 天前 | ||
| 1 个月前 | ||
| 15 天前 | ||
| 13 小时前 | ||
| 1 个月前 | ||
| 23 天前 | ||
| 1 个月前 | ||
| 1 个月前 | ||
| 19 天前 | ||
| 2 个月前 | ||
| 1 个月前 | ||
| 9 小时前 | ||
| 15 天前 |