ascend-robot【同步】【非开发代码】代码从 develop 同步到 master

RFC: DeepSeek-V4 Model Adaptation Support (Flash/Pro)

Metadata

Item	Content
Status	Approved
Author	—
Creation Date	2026-06-03
Related Links	DeepSeek-V4 Adaptation Design

1. Overview

This proposal aims to add TensorCast support for DeepSeek-V4 (Flash/Pro) model, covering Head Compression (HC), layered KV compression, shared KV sparse attention, Lightning Indexer, and Hash Routing MoE, enabling proper model compilation and performance simulation.

2. Motivation

DeepSeek-V4 is the latest version of the DeepSeek series, introducing several key architectural upgrades compared to V3/V3.2:

Head Compression: Token-level information compression and restoration via Sinkhorn-based mixing, improving model expressiveness and inference efficiency.
Layered KV Compression: Per-layer configurable compression strategies (sliding window / indexed compression / heavy compression), balancing KV cache size and attention coverage.
Shared KV Attention: Uses single wkv projection to generate shared KV vectors, paired with grouped O projection to reduce computational complexity.
Lightning Indexer: Learned sparse attention selection, selecting critical KVs for attention computation in ratio=4 layers.
Hash Routing MoE: Token-id based hash routing instead of softmax top-k routing, achieving deterministic and efficient expert routing.

Without V4 adaptation, TensorCast cannot load and simulate this model, affecting the ability to evaluate performance of the latest DeepSeek architecture.

3. Goals

3.1 Goals

Support DeepSeek-V4 model loading and compilation
Support Head Compression (HC) performance modeling
Support layered KV compression strategies (ratio=0/4/128)
Support shared KV sparse attention
Support Lightning Indexer (ratio=4)
Support Hash Routing MoE
Support Clamped SwiGLU expert activation

3.2 Non-Goals

No support for parallel combination testing with other models
No support for KV cache block and indexer joint optimization
No support for CP (Context Parallel) combined with V4 sparse attention

4. Use Case Analysis

4.1 DeepSeek-V4 Model Inference Simulation

Scenario: Simulate DeepSeek-V4-Flash/Pro model performance under different parallel configurations.

Features:

Model loading: AutoConfig / AutoModel registration
HC pre/post wrapping: Each decoder layer attention and FFN before/after
Layered attention: Different attention paths for ratio=0/4/128
MoE routing: Hash routing and non-hash routing

Performance Indicators:

Prefill stage error < 30%
Decode stage error < 20%

DFX Requirements:

Compatibility: Support coexistence with V3.2 model profile
Maintainability: V4-specific logic concentrated in isolated files
Testability: Provide unit tests and integration tests coverage

4.2 V4-Specific Operator Performance Evaluation

Scenario: Evaluate computational and memory overhead of V4 new semantic operators.

Features:

HC operators: Sinkhorn iterations, weighted reduction
Compressor: Layered KV compression cost
Lightning Indexer: Learned sparse selection cost
Sparse Attention: Shared KV attention cost

5. Design

5.1 Overall Design

V4 adaptation is divided into five layers:

Model Registration Layer: deepseek_v4 model profile registration
HC Semantic Operator Layer: 4 semantic operators for Head Compression
KV Compression Operator Layer: compressor and scatter_nd_update_mla
Sparse Attention Operator Layer: quant_lightning_indexer and sparse_attn_sharedkv
MoE Routing Operator Layer: moe_gating_top_k/hash and v4_clamped_swiglu

5.2 Key Design Decisions

5.2.1 HC Semantic Operator Merge Strategy

hc_pre_sinkhorn merges Sinkhorn iterations and weighted reduction into one semantic operator, ensuring the cost model accounts for them together. Separating them would cause:

Cost model unable to correlate Sinkhorn iterations and subsequent reduction dependency
Performance estimation may underestimate combined kernel fusion benefits

5.2.2 KV Write Functional Handle

scatter_nd_update_mla returns a functional handle instead of a data handle, ensuring torch.compile cannot perform DCE on the upstream KV producer chain. Design pattern is consistent with V3.2's mlapo_quant returning cache handle.

5.2.3 Dynamic Topk Width

quant_lightning_indexer output width uses min(topk_limit, active_seq_len) instead of fixed topk_limit:

Avoids decode stage topk exceeding actual available compressed sequence length
Ensures performance estimation aligns with actual runtime behavior

5.2.4 O Projection Independent TP Group

V4's O projection uses independent o_proj_tp_group, separable from attention's tp_group:

Needs separate registration of wo_a and o_proj in TP plan
o_proj_tp_group.world_size must be ≤ o_groups

5.3 Technology Selection

Selected Approach: New independent V4 semantic operators and performance models

Alternative: Reuse V3.2 operators with parameter control

Alternative not selected because:

V4 HC mechanism is completely orthogonal to V3.2, cannot be parameterized
V4 compressor and indexer semantics differ significantly from V3.2 DSA indexer
Layered compression requires per-layer configuration support

5.4 Security, Privacy, and DFX Design

Compatibility:

deepseek_v4 as independent model_type, does not affect V3/V3.2 paths
V4 Config subclasses DeepseekV3Config, reuses common fields

Maintainability:

V4-specific logic concentrated in deepseek_v4.py
Performance models concentrated in builtin_model/deepseek_v4.py
Uses _tensor_cast_patched / getattr markers to prevent duplicate patches

Testability:

Unit tests cover each semantic operator
Integration tests cover end-to-end inference flow

5.5 Programming and Integration Design

5.5.1 Model Loading Interface

from tensor_cast.transformers.builtin_model.deepseek_v4 import DeepseekV4Config, DeepseekV4Model

# Load via HF AutoModel
from transformers import AutoModel, AutoConfig
config = AutoConfig.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", config=config)

5.5.2 V4-Specific Configuration Parameters

Parameter	Type	Default	Description
`compress_ratios`	List[int]	Required	Per-layer KV compression strategy
`layer_types`	List[str]	Optional	Per-layer attention type
`topk_limit`	int	Optional	Lightning indexer top-k limit
`num_hash_layers`	int	0	Hash routing MoE layer count
`hc_mult`	int	4	HC expansion factor
`hc_sinkhorn_iters`	int	20	Sinkhorn iteration count
`o_groups`	int	1	O projection group count
`o_lora_rank`	int	Optional	O projection lora rank
`score_func`	str	"sqrtsoftplus"	Routing score function
`swiglu_limit`	float	Optional	Clamped SwiGLU limit value

5.5.3 Semantic Operator Interfaces

hc_pre_sinkhorn

torch.ops.tensor_cast.hc_pre_sinkhorn(
    x,                    # HC mix tensor [B,S,mix_hc]
    hidden_states,        # Original HC-expanded [B,S,Hc,D]
    hc_scale,             # Sinkhorn scale [3, Hc]
    hc_base,              # Sinkhorn base [3, Hc]
    hc_mult,              # HC expansion factor
    sinkhorn_iters,       # Sinkhorn iterations
    hc_eps,               # Epsilon for sinkhorn
) -> (reduced, post, comb)

quant_lightning_indexer

torch.ops.tensor_cast.quant_lightning_indexer(
    q_states,             # RoPE-processed q [B,S,H,D]
    weights,              # weights_proj output [B,S,H]
    indexer_cache,        # Indexer KV cache
    topk_limit,           # Top-k limit
    tp_world_size,        # TP world size
    seq_lens,             # Per-request seq lengths
    query_lens,           # Per-request query lengths
) -> topk_indices

6. Test Design

6.1 Unit Tests

Test Case	Verification Points
`test_v4_config_parsing`	compress_ratios validation, ratio range check
`test_hc_semantic_ops`	hc_pre_*/hc_post/hc_head appear in trace
`test_v4_attention_wrapper`	Different paths for ratio=0/4/128
`test_lightning_indexer`	Dynamic topk width, indexer_cache layout
`test_moe_routing`	Hash routing and non-hash routing distinction
`test_clamped_swiglu`	v4_clamped_swiglu vs standard SiLU behavior

6.2 Integration Tests

python -m pytest tests/test_tensor_cast/test_deepseek_v4.py -v
python -m pytest tests/test_tensor_cast/test_mla.py -v
python -m pytest tests/test_tensor_cast/test_moe_layer.py -v

6.3 End-to-End Verification

Use DeepSeek-V4 text generation compile command for end-to-end verification:

Model construction succeeds, HF DeepseekV4Config can be recognized by TensorCast
Compilation process no longer has HC-related attribute missing errors
HC semantic operators appear in trace
quant_lightning_indexer appears in ratio=4 layers
Existing model tests are unaffected

7. Drawbacks and Risks

7.1 Potential Risks

Breaking Change: V4's O projection uses independent TP group, may affect existing parallel configurations
Performance Regression: HC operators may introduce additional overhead at low hc_mult values
Complexity Increase: 11 new semantic operators and corresponding performance models

7.2 Mitigation Measures

Clearly declare o_proj_tp_group constraints in ModelProfile, provide clear error messages
Verify HC operator overhead at various hc_mult configurations through performance model
Unit tests cover all new operators, ensuring cost model accuracy

8. Existing Technology Reference

DeepSeek-V3.2 Adaptation: GLM5 design doc contains detailed design of V3.2 DSA indexer; this proposal extends for V4-specific HC and layered compression
DeepSeek Official Implementation: deepseek-ai/DeepSeek-V4-Flash/inference/model.py as reference implementation

9. Unresolved Questions

KV Cache Block Joint Optimization: Currently indexer returns topk_indices, but mechanism for loading KV cache blocks according to indexer results is not yet implemented
CP + V4 Sparse Attention: Context Parallel combined with V4 layered sparse attention is not yet supported

Appendix

References

Glossary

Term	Description
HC	Head Compression, token-level compression via Sinkhorn mixing
Compressor	V4 KV compressor, generates coarse-grained KV stream in layers
Lightning Indexer	V4 learned sparse attention selection
Hash Routing	Token-id based MoE routing, replaces softmax top-k
Clamped SwiGLU	SwiGLU activation with gate clamping
Ratio	KV compression ratio, 0=sliding window, 4=indexed compression, 128=heavy compression