# Layers Module API Reference
This document describes the public interfaces exposed by the layers module in the mindiesd package. All interfaces can be imported directly via from mindiesd import <interface_name>.
Interface Overview
FA Series
The FA (Flash Attention) series provides Ascend-optimized attention computation capabilities, covering standard attention, variable-length sequence attention, and sparse attention scenarios.
| Interface | Type | Description |
|---|---|---|
attention_forward |
Function | Standard attention forward computation with automatic operator selection |
attention_forward_varlen |
Function | Variable-length sequence attention forward computation |
sparse_attention |
Function | Sparse attention forward computation, supporting rf_v2 / ada_bsa strategies |
Fused Operator Series
The fused operator series provides high-performance Ascend fused operators, covering position encoding, normalization, and activation functions.
| Interface | Type | Description |
|---|---|---|
rotary_position_embedding |
Function | Rotary Position Embedding (RoPE) fused operator |
RMSNorm |
Class | RMS normalization fused operator |
fast_layernorm |
Function | High-performance LayerNorm fused operator |
layernorm_scale_shift |
Function | Adaptive LayerNorm (AdaLayerNorm) fused operator |
get_activation_layer |
Function | Get an activation function instance (includes NPU-accelerated variants) |
FA Series
attention_forward
Standard attention forward computation interface, supporting multiple underlying operators (PFA, FASCore, LaserAttention, etc.) and automatic tuning.
from mindiesd import attention_forward
Function Signature
attention_forward(
query, key, value,
attn_mask=None,
scale=None,
fused=True,
head_first=False,
**kwargs
) -> torch.Tensor
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query |
torch.Tensor |
Yes | - | Query tensor, 4D, layout [B,S,N,D] or [B,N,S,D] |
key |
torch.Tensor |
Yes | - | Key tensor, 4D, same layout as query |
value |
torch.Tensor |
Yes | - | Value tensor, 4D, same layout as query |
attn_mask |
torch.Tensor |
No | None |
Attention mask |
scale |
float |
No | None |
Scale factor; defaults to head_dim ** -0.5 when None |
fused |
bool |
No | True |
Whether to use fused operators; falls back to native computation when False |
head_first |
bool |
No | False |
Whether the head dimension precedes the sequence dimension; True means [B,N,S,D], False means [B,S,N,D] |
kwargs.opt_mode |
str |
No | "runtime" |
Operator dispatch mode: "runtime", "static", or "manual" |
kwargs.op_type |
str |
No | "fused_attn_score" |
Operator type, only effective when opt_mode="manual"; supports "prompt_flash_attn", "fused_attn_score", "ascend_laser_attention" |
kwargs.layout |
str |
No | "BNSD" |
Operator layout, only effective when opt_mode="manual"; supports "BNSD", "BSND", "BSH" |
Returns
torch.Tensor: Attention computation result, same layout as input.
Example
import torch
from mindiesd import attention_forward
query = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
key = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
value = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
out = attention_forward(query, key, value)
Migration Guide
- When migrating from
torch.nn.functional.scaled_dot_product_attention, adjust the input layout from[B,N,S,D]to[B,S,N,D]and removetransposeoperations. - When migrating from
flash_attn.flash_attn_func, the input layout is already[B,S,N,D]and can be replaced directly. - This interface provides forward inference only and does not support backward gradient computation. Remove
dropoutand setrequires_grad=Falseon input tensors when migrating.
attention_forward_varlen
Variable-length sequence attention forward computation interface, suitable for scenarios where sequence lengths vary within a batch.
from mindiesd import attention_forward_varlen
Function Signature
attention_forward_varlen(
q, k, v,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q=None,
max_seqlen_k=None,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=None,
softcap=None,
alibi_slopes=None,
deterministic=None,
return_attn_probs=None,
block_table=None
) -> torch.Tensor
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
q |
torch.Tensor |
Yes | - | Query tensor, 3D, layout [T, N, D] (T is total token count across all sequences) |
k |
torch.Tensor |
Yes | - | Key tensor, 3D, layout [T, N, D] |
v |
torch.Tensor |
Yes | - | Value tensor, 3D, layout [T, N, D] |
cu_seqlens_q |
list[torch.Tensor] |
Yes | - | Cumulative sequence lengths for query, shape (batch_size + 1,), dtype torch.int32 |
cu_seqlens_k |
list[torch.Tensor] |
Yes | - | Cumulative sequence lengths for key, shape (batch_size + 1,), dtype torch.int32 |
max_seqlen_q |
int |
No | None |
Reserved parameter |
max_seqlen_k |
int |
No | None |
Reserved parameter |
dropout_p |
float |
No | 0.0 |
Dropout probability; currently only 0.0 is supported |
softmax_scale |
float |
No | None |
Scale factor; defaults to head_dim ** -0.5 when None |
causal |
bool |
No | False |
Whether to apply causal attention mask |
window_size |
int |
No | None |
Reserved parameter |
softcap |
float |
No | None |
Reserved parameter |
alibi_slopes |
torch.Tensor |
No | None |
Reserved parameter |
deterministic |
bool |
No | None |
Reserved parameter |
return_attn_probs |
bool |
No | None |
Reserved parameter |
block_table |
torch.Tensor |
No | None |
Reserved parameter |
Returns
torch.Tensor: Attention computation result, shape (total, nheads, headdim).
Example
import torch
from mindiesd import attention_forward_varlen
q = torch.randn(8192, 24, 128, device="npu", dtype=torch.float16)
k = torch.randn(8192, 24, 128, device="npu", dtype=torch.float16)
v = torch.randn(8192, 24, 128, device="npu", dtype=torch.float16)
cu_seqlens_q = torch.tensor([0, 2048, 4096, 6144, 8192], dtype=torch.int32, device="npu")
cu_seqlens_k = torch.tensor([0, 2048, 4096, 6144, 8192], dtype=torch.int32, device="npu")
out = attention_forward_varlen(q, k, v, cu_seqlens_q, cu_seqlens_k, causal=False)
Migration Guide
- When migrating from
flash_attn.flash_attn_varlen_func, the interface parameters are largely compatible and can be replaced directly.
sparse_attention
Sparse attention forward computation interface, supporting two sparse strategies: RainFusion (rf_v2) and Adaptive Block Sparse Attention (ada_bsa).
from mindiesd import sparse_attention
Function Signature
sparse_attention(
q, k, v,
attn_mask=None,
scale=None,
is_causal=False,
head_num=1,
input_layout="BNSD",
inner_precise=0,
sparse_type=None,
txt_len=0,
block_size=128,
latent_shape_q=None,
latent_shape_k=None,
keep_sink=True,
keep_recent=True,
cdf_threshold=1.0,
sparsity=0.0,
**kwargs
) -> torch.Tensor
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
q |
torch.Tensor |
Yes | - | Query tensor, 4D, layout determined by input_layout |
k |
torch.Tensor |
Yes | - | Key tensor, 4D, layout determined by input_layout |
v |
torch.Tensor |
Yes | - | Value tensor, 4D, layout determined by input_layout |
attn_mask |
torch.Tensor |
No | None |
Attention mask, reserved parameter |
scale |
float |
No | None |
Scale factor; defaults to head_dim ** -0.5 when None |
is_causal |
bool |
No | False |
Whether to apply causal attention mask |
head_num |
int |
No | 1 |
Number of attention heads |
input_layout |
str |
No | "BNSD" |
Tensor layout, supports "BNSD" or "BSND" |
inner_precise |
int |
No | 0 |
Compute precision mode: 0 for high precision, 1 for high performance |
sparse_type |
str |
No | None |
Sparse type: None, "rf_v2", or "ada_bsa" |
txt_len |
int |
No | 0 |
Text sequence length, only effective when sparse_type="rf_v2" |
block_size |
int |
No | 128 |
Block size; currently only 128 is supported |
latent_shape_q |
list |
No | None |
Latent shape for query [t, h, w], t*h*w = qseqlen, only effective when sparse_type="rf_v2" |
latent_shape_k |
list |
No | None |
Latent shape for key [t, h, w], t*h*w = kseqlen, only effective when sparse_type="rf_v2" |
keep_sink |
bool |
No | True |
Whether to retain sink tokens, only effective when sparse_type="ada_bsa" |
keep_recent |
bool |
No | True |
Whether to retain recent tokens, only effective when sparse_type="ada_bsa" |
cdf_threshold |
float |
No | 1.0 |
CDF threshold, only effective when sparse_type="ada_bsa" |
sparsity |
float |
No | 0.0 |
Sparsity ratio, range [0, 1]; 0 disables sparse algorithm |
Returns
torch.Tensor: Attention computation result, same layout as input.
Example
import torch
from mindiesd import sparse_attention
q = torch.randn(2, 24, 4096, 128, device="npu", dtype=torch.float16)
k = torch.randn(2, 24, 4096, 128, device="npu", dtype=torch.float16)
v = torch.randn(2, 24, 4096, 128, device="npu", dtype=torch.float16)
out = sparse_attention(
q, k, v,
head_num=24,
input_layout="BNSD",
sparse_type="ada_bsa",
sparsity=0.5
)
Fused Operator Series
rotary_position_embedding
Rotary Position Embedding (RoPE) fused operator, injecting positional information into query and key tensors through rotation matrices.
from mindiesd import rotary_position_embedding
Function Signature
rotary_position_embedding(
x, cos, sin,
rotated_mode="rotated_half",
head_first=False,
fused=True
) -> torch.Tensor
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
x |
torch.Tensor |
Yes | - | Query or key tensor, 4D, supports layouts [B,N,S,D], [B,S,N,D], [S,B,N,D] |
cos |
torch.Tensor |
Yes | - | Precomputed cosine frequency tensor, 2D [S,D] or 4D [1,1,S,D]/[1,S,1,D]/[S,1,1,D] |
sin |
torch.Tensor |
Yes | - | Precomputed sine frequency tensor, same dimensions as cos |
rotated_mode |
str |
No | "rotated_half" |
Rotation mode: "rotated_half" for half rotation, "rotated_interleaved" for interleaved rotation |
head_first |
bool |
No | False |
Whether the head dimension precedes the sequence dimension |
fused |
bool |
No | True |
Whether to use fused operators |
Returns
torch.Tensor: Tensor with rotary position embeddings applied, same shape as input x.
Example
import torch
from mindiesd import rotary_position_embedding
x = torch.randn(2, 4096, 24, 128, device="npu", dtype=torch.float16)
cos = torch.randn(1, 4096, 1, 128, device="npu", dtype=torch.float16)
sin = torch.randn(1, 4096, 1, 128, device="npu", dtype=torch.float16)
out = rotary_position_embedding(x, cos, sin, rotated_mode="rotated_half", head_first=False, fused=True)
Rotation Mode Description
- rotated_half: Suitable for models such as OpenSoraPlan and Stable Audio. Splits
xinto front and back halves for rotation. - rotated_interleaved: Suitable for models such as HunyuanDiT, OpenSora, Flux, and CogVideox. Rotates
xby interleaving adjacent elements.
RMSNorm
RMS normalization fused operator, equivalent to T5LayerNorm. It avoids explicit mean computation and focuses on the root mean square of the input tensor.
from mindiesd import RMSNorm
Class Signature
RMSNorm(hidden_size, eps=1e-6)
Constructor Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
hidden_size |
int |
Yes | - | Hidden dimension size |
eps |
float |
No | 1e-6 |
Numerical stability parameter |
forward Method
forward(hidden_states, if_fused=True) -> torch.Tensor
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
hidden_states |
torch.Tensor |
Yes | - | Input tensor, dimension range 2~8 |
if_fused |
bool |
No | True |
Whether to use NPU fused operators |
Example
import torch
from mindiesd import RMSNorm
norm = RMSNorm(1024, eps=1e-6)
x = torch.randn(2, 4096, 1024, device="npu", dtype=torch.float16)
out = norm(x)
fast_layernorm
High-performance LayerNorm fused operator, supporting multiple compute precision modes.
from mindiesd import fast_layernorm
Function Signature
fast_layernorm(
norm, x,
impl_mode=0,
fused=True
) -> torch.Tensor
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
norm |
torch.nn.LayerNorm |
Yes | - | PyTorch LayerNorm instance |
x |
torch.Tensor |
Yes | - | Input tensor, 3D, layout [B,S,H] |
impl_mode |
int |
No | 0 |
Compute mode: 0 high precision, 1 high performance, 2 float16 mode (only available when all inputs are float16) |
fused |
bool |
No | True |
Whether to use fused operators; falls back to standard torch.nn.LayerNorm when False |
Returns
torch.Tensor: LayerNorm computation result, same shape as input x.
Example
import torch
import torch.nn as nn
from mindiesd import fast_layernorm
norm = nn.LayerNorm(1024, eps=1e-5)
x = torch.randn(2, 4096, 1024, device="npu", dtype=torch.float16)
out = fast_layernorm(norm, x, impl_mode=0, fused=True)
layernorm_scale_shift
Adaptive LayerNorm (AdaLayerNorm) fused operator, adding adaptive scaling and shifting on top of LayerNorm.
Computation formula: out = layernorm(x) * (1 + scale) + shift
from mindiesd import layernorm_scale_shift
Function Signature
layernorm_scale_shift(
layernorm, x, scale, shift,
fused=True
) -> torch.Tensor
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
layernorm |
torch.nn.LayerNorm |
Yes | - | PyTorch LayerNorm instance |
x |
torch.Tensor |
Yes | - | Input tensor, 3D, layout [B,S,H] |
scale |
torch.Tensor |
Yes | - | Adaptive scaling parameter, 2D [B,H] or 3D [B,1,H] |
shift |
torch.Tensor |
Yes | - | Adaptive shifting parameter, 2D [B,H] or 3D [B,1,H] |
fused |
bool |
No | True |
Whether to use fused operators |
Returns
torch.Tensor: AdaLayerNorm computation result, same shape as input x.
Example
import torch
import torch.nn as nn
from mindiesd import layernorm_scale_shift
norm = nn.LayerNorm(1024, eps=1e-5)
x = torch.randn(2, 4096, 1024, device="npu", dtype=torch.float16)
scale = torch.randn(2, 1024, device="npu", dtype=torch.float16)
shift = torch.randn(2, 1024, device="npu", dtype=torch.float16)
out = layernorm_scale_shift(norm, x, scale, shift, fused=True)
Constraints
- The last dimension of
xmust equal the last dimensions ofscaleandshift. - If
scaleorshiftis a 3D tensor, the second dimension must be 1.
get_activation_layer
Get an activation function instance by name. Some activation functions provide NPU-accelerated variants.
from mindiesd import get_activation_layer
Function Signature
get_activation_layer(act_type: str) -> nn.Module
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
act_type |
str |
Yes | - | Activation function name, case-insensitive |
Supported Activation Functions
| Name | Implementation | Description |
|---|---|---|
"swish" |
nn.SiLU |
Swish activation function |
"silu" |
nn.SiLU |
SiLU activation function (equivalent to swish) |
"mish" |
nn.Mish |
Mish activation function |
"gelu" |
GELU |
Standard GELU |
"relu" |
nn.ReLU |
ReLU activation function |
"gelu-tanh" |
GELU(approximate="tanh") |
tanh-approximated GELU |
"gelu-fast" |
GELU(approximate="fast") |
Fast GELU, accelerated by NPU npu_fast_gelu operator |
Returns
nn.Module: Instance of the requested activation function.
Example
from mindiesd import get_activation_layer
act = get_activation_layer("gelu-fast")
out = act(hidden_states)