RFC: DeepSeek-V4 Model Adaptation Support (Flash/Pro)
Metadata
| Item | Content |
|---|---|
| Status | Approved |
| Author | — |
| Creation Date | 2026-06-03 |
| Related Links | DeepSeek-V4 Adaptation Design |
1. Overview
This proposal aims to add TensorCast support for DeepSeek-V4 (Flash/Pro) model, covering Head Compression (HC), layered KV compression, shared KV sparse attention, Lightning Indexer, and Hash Routing MoE, enabling proper model compilation and performance simulation.
2. Motivation
DeepSeek-V4 is the latest version of the DeepSeek series, introducing several key architectural upgrades compared to V3/V3.2:
- Head Compression: Token-level information compression and restoration via Sinkhorn-based mixing, improving model expressiveness and inference efficiency.
- Layered KV Compression: Per-layer configurable compression strategies (sliding window / indexed compression / heavy compression), balancing KV cache size and attention coverage.
- Shared KV Attention: Uses single wkv projection to generate shared KV vectors, paired with grouped O projection to reduce computational complexity.
- Lightning Indexer: Learned sparse attention selection, selecting critical KVs for attention computation in ratio=4 layers.
- Hash Routing MoE: Token-id based hash routing instead of softmax top-k routing, achieving deterministic and efficient expert routing.
Without V4 adaptation, TensorCast cannot load and simulate this model, affecting the ability to evaluate performance of the latest DeepSeek architecture.
3. Goals
3.1 Goals
- Support DeepSeek-V4 model loading and compilation
- Support Head Compression (HC) performance modeling
- Support layered KV compression strategies (ratio=0/4/128)
- Support shared KV sparse attention
- Support Lightning Indexer (ratio=4)
- Support Hash Routing MoE
- Support Clamped SwiGLU expert activation
3.2 Non-Goals
- No support for parallel combination testing with other models
- No support for KV cache block and indexer joint optimization
- No support for CP (Context Parallel) combined with V4 sparse attention
4. Use Case Analysis
4.1 DeepSeek-V4 Model Inference Simulation
Scenario: Simulate DeepSeek-V4-Flash/Pro model performance under different parallel configurations.
Features:
- Model loading:
AutoConfig/AutoModelregistration - HC pre/post wrapping: Each decoder layer attention and FFN before/after
- Layered attention: Different attention paths for ratio=0/4/128
- MoE routing: Hash routing and non-hash routing
Performance Indicators:
- Prefill stage error < 30%
- Decode stage error < 20%
DFX Requirements:
- Compatibility: Support coexistence with V3.2 model profile
- Maintainability: V4-specific logic concentrated in isolated files
- Testability: Provide unit tests and integration tests coverage
4.2 V4-Specific Operator Performance Evaluation
Scenario: Evaluate computational and memory overhead of V4 new semantic operators.
Features:
- HC operators: Sinkhorn iterations, weighted reduction
- Compressor: Layered KV compression cost
- Lightning Indexer: Learned sparse selection cost
- Sparse Attention: Shared KV attention cost
5. Design
5.1 Overall Design
V4 adaptation is divided into five layers:
- Model Registration Layer:
deepseek_v4model profile registration - HC Semantic Operator Layer: 4 semantic operators for Head Compression
- KV Compression Operator Layer: compressor and scatter_nd_update_mla
- Sparse Attention Operator Layer: quant_lightning_indexer and sparse_attn_sharedkv
- MoE Routing Operator Layer: moe_gating_top_k/hash and v4_clamped_swiglu
5.2 Key Design Decisions
5.2.1 HC Semantic Operator Merge Strategy
hc_pre_sinkhorn merges Sinkhorn iterations and weighted reduction into one semantic operator, ensuring the cost model accounts for them together. Separating them would cause:
- Cost model unable to correlate Sinkhorn iterations and subsequent reduction dependency
- Performance estimation may underestimate combined kernel fusion benefits
5.2.2 KV Write Functional Handle
scatter_nd_update_mla returns a functional handle instead of a data handle, ensuring torch.compile cannot perform DCE on the upstream KV producer chain. Design pattern is consistent with V3.2's mlapo_quant returning cache handle.
5.2.3 Dynamic Topk Width
quant_lightning_indexer output width uses min(topk_limit, active_seq_len) instead of fixed topk_limit:
- Avoids decode stage topk exceeding actual available compressed sequence length
- Ensures performance estimation aligns with actual runtime behavior
5.2.4 O Projection Independent TP Group
V4's O projection uses independent o_proj_tp_group, separable from attention's tp_group:
- Needs separate registration of
wo_aando_projin TP plan o_proj_tp_group.world_sizemust be ≤o_groups
5.3 Technology Selection
Selected Approach: New independent V4 semantic operators and performance models
Alternative: Reuse V3.2 operators with parameter control
Alternative not selected because:
- V4 HC mechanism is completely orthogonal to V3.2, cannot be parameterized
- V4 compressor and indexer semantics differ significantly from V3.2 DSA indexer
- Layered compression requires per-layer configuration support
5.4 Security, Privacy, and DFX Design
Compatibility:
deepseek_v4as independent model_type, does not affect V3/V3.2 paths- V4 Config subclasses
DeepseekV3Config, reuses common fields
Maintainability:
- V4-specific logic concentrated in
deepseek_v4.py - Performance models concentrated in
builtin_model/deepseek_v4.py - Uses
_tensor_cast_patched/getattrmarkers to prevent duplicate patches
Testability:
- Unit tests cover each semantic operator
- Integration tests cover end-to-end inference flow
5.5 Programming and Integration Design
5.5.1 Model Loading Interface
from tensor_cast.transformers.builtin_model.deepseek_v4 import DeepseekV4Config, DeepseekV4Model
# Load via HF AutoModel
from transformers import AutoModel, AutoConfig
config = AutoConfig.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", config=config)
5.5.2 V4-Specific Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
compress_ratios |
List[int] | Required | Per-layer KV compression strategy |
layer_types |
List[str] | Optional | Per-layer attention type |
topk_limit |
int | Optional | Lightning indexer top-k limit |
num_hash_layers |
int | 0 | Hash routing MoE layer count |
hc_mult |
int | 4 | HC expansion factor |
hc_sinkhorn_iters |
int | 20 | Sinkhorn iteration count |
o_groups |
int | 1 | O projection group count |
o_lora_rank |
int | Optional | O projection lora rank |
score_func |
str | "sqrtsoftplus" | Routing score function |
swiglu_limit |
float | Optional | Clamped SwiGLU limit value |
5.5.3 Semantic Operator Interfaces
hc_pre_sinkhorn
torch.ops.tensor_cast.hc_pre_sinkhorn(
x, # HC mix tensor [B,S,mix_hc]
hidden_states, # Original HC-expanded [B,S,Hc,D]
hc_scale, # Sinkhorn scale [3, Hc]
hc_base, # Sinkhorn base [3, Hc]
hc_mult, # HC expansion factor
sinkhorn_iters, # Sinkhorn iterations
hc_eps, # Epsilon for sinkhorn
) -> (reduced, post, comb)
quant_lightning_indexer
torch.ops.tensor_cast.quant_lightning_indexer(
q_states, # RoPE-processed q [B,S,H,D]
weights, # weights_proj output [B,S,H]
indexer_cache, # Indexer KV cache
topk_limit, # Top-k limit
tp_world_size, # TP world size
seq_lens, # Per-request seq lengths
query_lens, # Per-request query lengths
) -> topk_indices
6. Test Design
6.1 Unit Tests
| Test Case | Verification Points |
|---|---|
test_v4_config_parsing |
compress_ratios validation, ratio range check |
test_hc_semantic_ops |
hc_pre_*/hc_post/hc_head appear in trace |
test_v4_attention_wrapper |
Different paths for ratio=0/4/128 |
test_lightning_indexer |
Dynamic topk width, indexer_cache layout |
test_moe_routing |
Hash routing and non-hash routing distinction |
test_clamped_swiglu |
v4_clamped_swiglu vs standard SiLU behavior |
6.2 Integration Tests
python -m pytest tests/test_tensor_cast/test_deepseek_v4.py -v
python -m pytest tests/test_tensor_cast/test_mla.py -v
python -m pytest tests/test_tensor_cast/test_moe_layer.py -v
6.3 End-to-End Verification
Use DeepSeek-V4 text generation compile command for end-to-end verification:
- Model construction succeeds, HF
DeepseekV4Configcan be recognized by TensorCast - Compilation process no longer has HC-related attribute missing errors
- HC semantic operators appear in trace
quant_lightning_indexerappears in ratio=4 layers- Existing model tests are unaffected
7. Drawbacks and Risks
7.1 Potential Risks
- Breaking Change: V4's O projection uses independent TP group, may affect existing parallel configurations
- Performance Regression: HC operators may introduce additional overhead at low hc_mult values
- Complexity Increase: 11 new semantic operators and corresponding performance models
7.2 Mitigation Measures
- Clearly declare
o_proj_tp_groupconstraints in ModelProfile, provide clear error messages - Verify HC operator overhead at various hc_mult configurations through performance model
- Unit tests cover all new operators, ensuring cost model accuracy
8. Existing Technology Reference
- DeepSeek-V3.2 Adaptation: GLM5 design doc contains detailed design of V3.2 DSA indexer; this proposal extends for V4-specific HC and layered compression
- DeepSeek Official Implementation:
deepseek-ai/DeepSeek-V4-Flash/inference/model.pyas reference implementation
9. Unresolved Questions
- KV Cache Block Joint Optimization: Currently indexer returns topk_indices, but mechanism for loading KV cache blocks according to indexer results is not yet implemented
- CP + V4 Sparse Attention: Context Parallel combined with V4 layered sparse attention is not yet supported
Appendix
References
Glossary
| Term | Description |
|---|---|
| HC | Head Compression, token-level compression via Sinkhorn mixing |
| Compressor | V4 KV compressor, generates coarse-grained KV stream in layers |
| Lightning Indexer | V4 learned sparse attention selection |
| Hash Routing | Token-id based MoE routing, replaces softmax top-k |
| Clamped SwiGLU | SwiGLU activation with gate clamping |
| Ratio | KV compression ratio, 0=sliding window, 4=indexed compression, 128=heavy compression |