# Layers Module API Reference

This document describes the public interfaces exposed by the layers module in the mindiesd package. All interfaces can be imported directly via from mindiesd import <interface_name>.

Interface Overview

FA Series

The FA (Flash Attention) series provides Ascend-optimized attention computation capabilities, covering standard attention, variable-length sequence attention, and sparse attention scenarios.

Interface Type Description
attention_forward Function Standard attention forward computation with automatic operator selection
attention_forward_varlen Function Variable-length sequence attention forward computation
sparse_attention Function Sparse attention forward computation, supporting rf_v2 / ada_bsa strategies

Fused Operator Series

The fused operator series provides high-performance Ascend fused operators, covering position encoding, normalization, and activation functions.

Interface Type Description
rotary_position_embedding Function Rotary Position Embedding (RoPE) fused operator
RMSNorm Class RMS normalization fused operator
fast_layernorm Function High-performance LayerNorm fused operator
layernorm_scale_shift Function Adaptive LayerNorm (AdaLayerNorm) fused operator
get_activation_layer Function Get an activation function instance (includes NPU-accelerated variants)

FA Series

attention_forward

Standard attention forward computation interface, supporting multiple underlying operators (PFA, FASCore, LaserAttention, etc.) and automatic tuning.

from mindiesd import attention_forward

Function Signature

attention_forward(
    query, key, value,
    attn_mask=None,
    scale=None,
    fused=True,
    head_first=False,
    **kwargs
) -> torch.Tensor

Parameters

Parameter Type Required Default Description
query torch.Tensor Yes - Query tensor, 4D, layout [B,S,N,D] or [B,N,S,D]
key torch.Tensor Yes - Key tensor, 4D, same layout as query
value torch.Tensor Yes - Value tensor, 4D, same layout as query
attn_mask torch.Tensor No None Attention mask
scale float No None Scale factor; defaults to head_dim ** -0.5 when None
fused bool No True Whether to use fused operators; falls back to native computation when False
head_first bool No False Whether the head dimension precedes the sequence dimension; True means [B,N,S,D], False means [B,S,N,D]
kwargs.opt_mode str No "runtime" Operator dispatch mode: "runtime", "static", or "manual"
kwargs.op_type str No "fused_attn_score" Operator type, only effective when opt_mode="manual"; supports "prompt_flash_attn", "fused_attn_score", "ascend_laser_attention"
kwargs.layout str No "BNSD" Operator layout, only effective when opt_mode="manual"; supports "BNSD", "BSND", "BSH"

Returns

torch.Tensor: Attention computation result, same layout as input.

Example

import torch
from mindiesd import attention_forward

query = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
key = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
value = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)

out = attention_forward(query, key, value)

Migration Guide

  • When migrating from torch.nn.functional.scaled_dot_product_attention, adjust the input layout from [B,N,S,D] to [B,S,N,D] and remove transpose operations.
  • When migrating from flash_attn.flash_attn_func, the input layout is already [B,S,N,D] and can be replaced directly.
  • This interface provides forward inference only and does not support backward gradient computation. Remove dropout and set requires_grad=False on input tensors when migrating.

attention_forward_varlen

Variable-length sequence attention forward computation interface, suitable for scenarios where sequence lengths vary within a batch.

from mindiesd import attention_forward_varlen

Function Signature

attention_forward_varlen(
    q, k, v,
    cu_seqlens_q,
    cu_seqlens_k,
    max_seqlen_q=None,
    max_seqlen_k=None,
    dropout_p=0.0,
    softmax_scale=None,
    causal=False,
    window_size=None,
    softcap=None,
    alibi_slopes=None,
    deterministic=None,
    return_attn_probs=None,
    block_table=None
) -> torch.Tensor

Parameters

Parameter Type Required Default Description
q torch.Tensor Yes - Query tensor, 3D, layout [T, N, D] (T is total token count across all sequences)
k torch.Tensor Yes - Key tensor, 3D, layout [T, N, D]
v torch.Tensor Yes - Value tensor, 3D, layout [T, N, D]
cu_seqlens_q list[torch.Tensor] Yes - Cumulative sequence lengths for query, shape (batch_size + 1,), dtype torch.int32
cu_seqlens_k list[torch.Tensor] Yes - Cumulative sequence lengths for key, shape (batch_size + 1,), dtype torch.int32
max_seqlen_q int No None Reserved parameter
max_seqlen_k int No None Reserved parameter
dropout_p float No 0.0 Dropout probability; currently only 0.0 is supported
softmax_scale float No None Scale factor; defaults to head_dim ** -0.5 when None
causal bool No False Whether to apply causal attention mask
window_size int No None Reserved parameter
softcap float No None Reserved parameter
alibi_slopes torch.Tensor No None Reserved parameter
deterministic bool No None Reserved parameter
return_attn_probs bool No None Reserved parameter
block_table torch.Tensor No None Reserved parameter

Returns

torch.Tensor: Attention computation result, shape (total, nheads, headdim).

Example

import torch
from mindiesd import attention_forward_varlen

q = torch.randn(8192, 24, 128, device="npu", dtype=torch.float16)
k = torch.randn(8192, 24, 128, device="npu", dtype=torch.float16)
v = torch.randn(8192, 24, 128, device="npu", dtype=torch.float16)
cu_seqlens_q = torch.tensor([0, 2048, 4096, 6144, 8192], dtype=torch.int32, device="npu")
cu_seqlens_k = torch.tensor([0, 2048, 4096, 6144, 8192], dtype=torch.int32, device="npu")

out = attention_forward_varlen(q, k, v, cu_seqlens_q, cu_seqlens_k, causal=False)

Migration Guide

  • When migrating from flash_attn.flash_attn_varlen_func, the interface parameters are largely compatible and can be replaced directly.

sparse_attention

Sparse attention forward computation interface, supporting two sparse strategies: RainFusion (rf_v2) and Adaptive Block Sparse Attention (ada_bsa).

from mindiesd import sparse_attention

Function Signature

sparse_attention(
    q, k, v,
    attn_mask=None,
    scale=None,
    is_causal=False,
    head_num=1,
    input_layout="BNSD",
    inner_precise=0,
    sparse_type=None,
    txt_len=0,
    block_size=128,
    latent_shape_q=None,
    latent_shape_k=None,
    keep_sink=True,
    keep_recent=True,
    cdf_threshold=1.0,
    sparsity=0.0,
    **kwargs
) -> torch.Tensor

Parameters

Parameter Type Required Default Description
q torch.Tensor Yes - Query tensor, 4D, layout determined by input_layout
k torch.Tensor Yes - Key tensor, 4D, layout determined by input_layout
v torch.Tensor Yes - Value tensor, 4D, layout determined by input_layout
attn_mask torch.Tensor No None Attention mask, reserved parameter
scale float No None Scale factor; defaults to head_dim ** -0.5 when None
is_causal bool No False Whether to apply causal attention mask
head_num int No 1 Number of attention heads
input_layout str No "BNSD" Tensor layout, supports "BNSD" or "BSND"
inner_precise int No 0 Compute precision mode: 0 for high precision, 1 for high performance
sparse_type str No None Sparse type: None, "rf_v2", or "ada_bsa"
txt_len int No 0 Text sequence length, only effective when sparse_type="rf_v2"
block_size int No 128 Block size; currently only 128 is supported
latent_shape_q list No None Latent shape for query [t, h, w], t*h*w = qseqlen, only effective when sparse_type="rf_v2"
latent_shape_k list No None Latent shape for key [t, h, w], t*h*w = kseqlen, only effective when sparse_type="rf_v2"
keep_sink bool No True Whether to retain sink tokens, only effective when sparse_type="ada_bsa"
keep_recent bool No True Whether to retain recent tokens, only effective when sparse_type="ada_bsa"
cdf_threshold float No 1.0 CDF threshold, only effective when sparse_type="ada_bsa"
sparsity float No 0.0 Sparsity ratio, range [0, 1]; 0 disables sparse algorithm

Returns

torch.Tensor: Attention computation result, same layout as input.

Example

import torch
from mindiesd import sparse_attention

q = torch.randn(2, 24, 4096, 128, device="npu", dtype=torch.float16)
k = torch.randn(2, 24, 4096, 128, device="npu", dtype=torch.float16)
v = torch.randn(2, 24, 4096, 128, device="npu", dtype=torch.float16)

out = sparse_attention(
    q, k, v,
    head_num=24,
    input_layout="BNSD",
    sparse_type="ada_bsa",
    sparsity=0.5
)

Fused Operator Series

rotary_position_embedding

Rotary Position Embedding (RoPE) fused operator, injecting positional information into query and key tensors through rotation matrices.

from mindiesd import rotary_position_embedding

Function Signature

rotary_position_embedding(
    x, cos, sin,
    rotated_mode="rotated_half",
    head_first=False,
    fused=True
) -> torch.Tensor

Parameters

Parameter Type Required Default Description
x torch.Tensor Yes - Query or key tensor, 4D, supports layouts [B,N,S,D], [B,S,N,D], [S,B,N,D]
cos torch.Tensor Yes - Precomputed cosine frequency tensor, 2D [S,D] or 4D [1,1,S,D]/[1,S,1,D]/[S,1,1,D]
sin torch.Tensor Yes - Precomputed sine frequency tensor, same dimensions as cos
rotated_mode str No "rotated_half" Rotation mode: "rotated_half" for half rotation, "rotated_interleaved" for interleaved rotation
head_first bool No False Whether the head dimension precedes the sequence dimension
fused bool No True Whether to use fused operators

Returns

torch.Tensor: Tensor with rotary position embeddings applied, same shape as input x.

Example

import torch
from mindiesd import rotary_position_embedding

x = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
cos = torch.randn(1, 4096, 1, 128, device="npu", dtype=torch.float16)
sin = torch.randn(1, 4096, 1, 128, device="npu", dtype=torch.float16)

out = rotary_position_embedding(x, cos, sin, rotated_mode="rotated_half", head_first=False, fused=True)

Rotation Mode Description

  • rotated_half: Suitable for models such as OpenSoraPlan and Stable Audio. Splits x into front and back halves for rotation.
  • rotated_interleaved: Suitable for models such as HunyuanDiT, OpenSora, Flux, and CogVideox. Rotates x by interleaving adjacent elements.

RMSNorm

RMS normalization fused operator, equivalent to T5LayerNorm. It avoids explicit mean computation and focuses on the root mean square of the input tensor.

from mindiesd import RMSNorm

Class Signature

RMSNorm(hidden_size, eps=1e-6)

Constructor Parameters

Parameter Type Required Default Description
hidden_size int Yes - Hidden dimension size
eps float No 1e-6 Numerical stability parameter

forward Method

forward(hidden_states, if_fused=True) -> torch.Tensor
Parameter Type Required Default Description
hidden_states torch.Tensor Yes - Input tensor, dimension range 2~8
if_fused bool No True Whether to use NPU fused operators

Example

import torch
from mindiesd import RMSNorm

norm = RMSNorm(1024, eps=1e-6)
x = torch.randn(2, 4096, 1024, device="npu", dtype=torch.float16)
out = norm(x)

fast_layernorm

High-performance LayerNorm fused operator, supporting multiple compute precision modes.

from mindiesd import fast_layernorm

Function Signature

fast_layernorm(
    norm, x,
    impl_mode=0,
    fused=True
) -> torch.Tensor

Parameters

Parameter Type Required Default Description
norm torch.nn.LayerNorm Yes - PyTorch LayerNorm instance
x torch.Tensor Yes - Input tensor, 3D, layout [B,S,H]
impl_mode int No 0 Compute mode: 0 high precision, 1 high performance, 2 float16 mode (only available when all inputs are float16)
fused bool No True Whether to use fused operators; falls back to standard torch.nn.LayerNorm when False

Returns

torch.Tensor: LayerNorm computation result, same shape as input x.

Example

import torch
import torch.nn as nn
from mindiesd import fast_layernorm

norm = nn.LayerNorm(1024, eps=1e-5)
x = torch.randn(2, 4096, 1024, device="npu", dtype=torch.float16)

out = fast_layernorm(norm, x, impl_mode=0, fused=True)

layernorm_scale_shift

Adaptive LayerNorm (AdaLayerNorm) fused operator, adding adaptive scaling and shifting on top of LayerNorm.

Computation formula: out = layernorm(x) * (1 + scale) + shift

from mindiesd import layernorm_scale_shift

Function Signature

layernorm_scale_shift(
    layernorm, x, scale, shift,
    fused=True
) -> torch.Tensor

Parameters

Parameter Type Required Default Description
layernorm torch.nn.LayerNorm Yes - PyTorch LayerNorm instance
x torch.Tensor Yes - Input tensor, 3D, layout [B,S,H]
scale torch.Tensor Yes - Adaptive scaling parameter, 2D [B,H] or 3D [B,1,H]
shift torch.Tensor Yes - Adaptive shifting parameter, 2D [B,H] or 3D [B,1,H]
fused bool No True Whether to use fused operators

Returns

torch.Tensor: AdaLayerNorm computation result, same shape as input x.

Example

import torch
import torch.nn as nn
from mindiesd import layernorm_scale_shift

norm = nn.LayerNorm(1024, eps=1e-5)
x = torch.randn(2, 4096, 1024, device="npu", dtype=torch.float16)
scale = torch.randn(2, 1024, device="npu", dtype=torch.float16)
shift = torch.randn(2, 1024, device="npu", dtype=torch.float16)

out = layernorm_scale_shift(norm, x, scale, shift, fused=True)

Constraints

  • The last dimension of x must equal the last dimensions of scale and shift.
  • If scale or shift is a 3D tensor, the second dimension must be 1.

get_activation_layer

Get an activation function instance by name. Some activation functions provide NPU-accelerated variants.

from mindiesd import get_activation_layer

Function Signature

get_activation_layer(act_type: str) -> nn.Module

Parameters

Parameter Type Required Default Description
act_type str Yes - Activation function name, case-insensitive

Supported Activation Functions

Name Implementation Description
"swish" nn.SiLU Swish activation function
"silu" nn.SiLU SiLU activation function (equivalent to swish)
"mish" nn.Mish Mish activation function
"gelu" GELU Standard GELU
"relu" nn.ReLU ReLU activation function
"gelu-tanh" GELU(approximate="tanh") tanh-approximated GELU
"gelu-fast" GELU(approximate="fast") Fast GELU, accelerated by NPU npu_fast_gelu operator

Returns

nn.Module: Instance of the requested activation function.

Example

from mindiesd import get_activation_layer

act = get_activation_layer("gelu-fast")
out = act(hidden_states)