ascend-robot【同步】【非开发代码】代码从 develop 同步到 master

RFC: Pipeline Parallel Simulation Support

Metadata

Item	Content
Status	Designing
Author	Secluded_Ocean
Creation Date	2026-05-11
Updated Date	2026-05-19
Related Links	https://gitcode.com/Ascend/msmodeling/pull/187
Chinese Version	rfc_pipeline_parallel_support_zh.md

1. Problem Statement

TensorCast and the serving throughput optimizer need Pipeline Parallel (PP) configuration support on top of existing TP/DP/EP/MoE estimation. PP partitions one decoder-only LLM replica into multiple pipeline stages by layer. Each stage owns only part of the decoder layers and its stage-local KV cache, and adjacent stages exchange hidden states. If the throughput optimizer keeps estimating the model as a full single-stage graph, it overestimates per-rank weights/cache memory and cannot reflect inter-stage communication, pipeline bubbles, or the tp_size * pp_size * dp_size == world_size constraint in search results.

This RFC adopts a stage-first PP trace design: split the model into PP stages before runtime tracing, run stage-local forward, torch.compile/runtime trace, and performance-model estimation for each stage separately, then insert logical send/recv events between adjacent stages and use a pipeline scheduler to aggregate latency, bubbles, and breakdowns.

This avoids the previous full-model-first approach. A full-model trace may include cross-layer fusion, compiler graph optimizations, or framework overhead that crosses a future PP boundary. In real PP execution, stage boundaries and send/recv break those optimizations. Splitting or attributing a full-model trace after the fact can therefore overestimate or underestimate stage compute in corner cases.

1.1 Goals

Add --pp-sizes to the throughput optimizer CLI and combine it with --tp-sizes, --ep-sizes, and --moe-dp-sizes to generate valid search spaces.
Expose parallel_config.pp_size in serving YAML and pass it to TensorCast UserInputConfig.
Add pp_group in ParallelGroupManager, sharing the same dimensional layout semantics as TP/DP rank groups.
Build stage-local models/graphs before runtime trace, so each stage only contains its own decoder layers and edge-stage modules.
Run forward, torch.compile, runtime trace, and performance-model estimation independently for each stage, preventing cross-PP-boundary fused ops from polluting estimates.
Insert logical send/recv events between adjacent stages and estimate communication time from hidden-state message size, topology bandwidth, and latency.
Estimate weights memory, KV cache per token, and total KV cache in a stage-aware way when PP is enabled.
Present PP Compute, PP Comm, and PP Bubble separately in serving result breakdowns.
Preserve current behavior, search results, and output format when pp_size=1.

1.2 Non-Goals

The first version does not implement real cross-process distributed execution. send/recv are logical communication events in runtime trace and performance modeling, not real tensor transfers.
The first version does not implement strict event-level 1F1B, interleaved PP, or virtual pipeline stage scheduling.
The first version does not parse _pp_plan automatically and does not support user-declared non-uniform stage partitions.
The first version does not provide full stage-local behavior for VL, MTP, or multimodal models. These models fall back to conservative estimates or skip stage-first tracing on related paths.
The first version does not redefine the combination semantics of MoE/EP rank groups and PP stages. MoE groups continue to use the existing global EP * MOE-TP * MOE-DP == world_size semantics.
The first version does not require profiling databases to already contain stage-local samples. Profiling/empirical PP support can be added through a separate contract later.

2. Design

2.1 Recommended Approach

The recommended approach is a layered design: configuration search, stage-first graph partitioning, per-stage tracing/modeling, logical send/recv, and pipeline scheduling.

flowchart TD
    A[throughput_optimizer CLI or service YAML] --> B[UserInputConfig]
    B --> C[ParallelConfig]
    C --> D[ParallelGroupManager]
    D --> E[tp_group / dp_group / pp_group]
    E --> F[PipelineStagePlan]
    F --> G[Stage-local model/graph builder]
    G --> H[Per-stage forward + torch.compile + runtime trace]
    H --> I[Performance model estimates stage compute]
    F --> J[Logical send/recv at stage boundaries]
    J --> K[Performance model estimates PP comm]
    I --> L[PipelineRuntimeEstimate]
    K --> L
    L --> M[ModelRunnerMetrics]
    M --> N[serving optimizer summary]

The design does not hard-code PP estimation in CLI or serving. CLI and serving only produce candidates and pass pp_size; the TensorCast model and runner handle the stage plan, stage-local graph construction, memory estimation, runtime traces, communication events, and latency aggregation; the serving summary only formats PP breakdowns.

2.1.1 Class Diagram

classDiagram
    class ParallelConfig {
        +int world_size
        +int tensor_parallel_size
        +int data_parallel_size
        +int pipeline_parallel_size
    }

    class ParallelGroupManager {
        +ParallelGroup tp_group
        +ParallelGroup dp_group
        +ParallelGroup pp_group
        +initialize_model_parallel()
    }

    class ParallelGroup {
        +int rank
        +int world_size
        +list rank_group
        +int rank_in_group
    }

    class PipelineStageInfo {
        +int stage_id
        +int num_stages
        +int start_layer
        +int end_layer
        +int total_layers
        +enabled bool
        +is_first_stage bool
        +is_last_stage bool
        +active_layer_indices range
        +owns_layer(layer_idx) bool
    }

    class PipelineStagePlan {
        +list stage_infos
        +list stage_module_specs
        +list send_recv_edges
        +get_stage(stage_id)
    }

    class PipelineStageModule {
        +PipelineStageInfo stage_info
        +forward(input_ids or inputs_embeds, **kwargs)
    }

    class PipelineSendRecvEvent {
        +int src_stage
        +int dst_stage
        +int message_bytes
        +float latency_s
    }

    class PipelineRuntimeEstimate {
        +float latency_s
        +list stage_compute_times_s
        +list stage_comm_times_s
        +list stage_times_s
        +list send_recv_events
    }

    class TransformerModel {
        +ParallelGroupManager parallel_group_manager
        +PipelineStagePlan pipeline_stage_plan
        +build_pipeline_stage_modules()
        +pipeline_stage_weight_sizes
        +pipeline_max_stage_weight_size
    }

    class ModelRunner {
        +run(requests) ModelRunnerMetrics
        -_run_pipeline_stage_first_trace(input_kwargs)
        -_estimate_pipeline_runtime(stage_traces, send_recv_events)
    }

    ParallelConfig --> ParallelGroupManager
    ParallelGroupManager --> ParallelGroup
    ParallelGroupManager --> PipelineStageInfo
    PipelineStagePlan --> PipelineStageInfo
    PipelineStagePlan --> PipelineSendRecvEvent
    TransformerModel --> PipelineStagePlan
    TransformerModel --> PipelineStageModule
    ModelRunner --> PipelineRuntimeEstimate
    PipelineRuntimeEstimate --> PipelineSendRecvEvent

2.1.2 Sequence Diagram

sequenceDiagram
    participant CLI as CLI/Serving Optimizer
    participant UIC as UserInputConfig
    participant TM as TransformerModel
    participant PGM as ParallelGroupManager
    participant PP as pipeline_parallel helpers
    participant MR as ModelRunner
    participant RT as Runtime
    participant PM as Performance Model
    participant OUT as Metrics/Summary

    CLI->>UIC: Set world_size/tp_size/pp_size/dp_size
    UIC->>TM: Build TensorCast model
    TM->>PGM: Initialize TP/DP/PP groups from ParallelConfig
    PGM-->>TM: Return pp_group.rank_in_group/world_size
    TM->>PP: build_pipeline_stage_plan(total_layers, pp_group)
    PP-->>TM: PipelineStagePlan
    MR->>MR: Generate inputs, KV cache, and attention metadata
    loop Each pipeline stage
        MR->>TM: build_pipeline_stage_module(stage_info)
        TM-->>MR: stage-local model/graph
        MR->>PP: build_pipeline_stage_inputs(...)
        PP-->>MR: Stage 0 uses input_ids; later stages use inputs_embeds
        MR->>RT: stage-local forward / torch.compile / runtime trace
        RT-->>MR: stage-local op trace
        MR->>PM: Model stage-local trace
        PM-->>MR: stage_compute_time_s
    end
    MR->>PP: build_pipeline_send_recv_events(...)
    PP-->>MR: logical send/recv events
    MR->>PM: Model send/recv events
    PM-->>MR: stage_comm_times_s
    MR->>PP: estimate_pipeline_runtime_from_stage_times(...)
    PP-->>MR: latency_s, compute, comm, bubble
    MR->>OUT: Write ModelRunnerMetrics.breakdowns
    OUT-->>CLI: Display PP Compute / PP Comm / PP Bubble

2.2 Configuration and Search Space

2.2.1 CLI Argument

cli/inference/throughput_optimizer.py adds --pp-sizes:

python -m cli.inference.throughput_optimizer \
  --input-length 2048 \
  --output-length 512 \
  --num-devices 8 \
  --tp-sizes 1 2 \
  --pp-sizes 1 2 4 \
  Qwen/Qwen3-32B

CLI contract:

Argument State	Behavior
No search-size argument is specified	Backward-compatible behavior: search TP by default and fix `pp_size` to 1.
`--pp-sizes` is not specified	`resolve_search_sizes(None, num_devices, 1)`, so PP is fixed to `[1]`.
`--pp-sizes` is specified with values	Use explicit values; each value must be a positive integer not greater than `num_devices`.
`--pp-sizes` is specified without values	Use `resolve_search_sizes([], num_devices, 1)` to generate powers-of-two candidates.
Search combinations are invalid	If no candidate combination satisfies divisibility constraints, argument parsing exits with an error.

2.2.2 serving YAML

serving_cast.config.ParallelConfig adds pp_size: int = 1. service YAML can declare:

parallel_config:
  world_size: 8
  tp_size: 2
  pp_size: 2
  dp_size: 2

serving_cast.model_runner.ModelRunner.init_tensor_cast_model_runner() passes pp_size when constructing TensorCast UserInputConfig, so serving configuration and TensorCast runtime share the same PP entry point.

2.2.3 Search Space Generation

serving_cast.parallel_runner.ParallelRunner._get_user_config() extends the search space from:

tp_sizes x ep_sizes x moe_dp_sizes

to:

tp_sizes x pp_sizes x ep_sizes x moe_dp_sizes

Candidate filtering rules:

Rule	Meaning
`target_devices % (tp * pp) == 0`	TP and PP together determine the number of devices for one complete pipeline replica.
`dp_size = target_devices // (tp * pp)`	DP is the number of complete pipeline replicas.
`target_devices % ep == 0`	Preserve the existing EP divisibility constraint.
`target_devices % (ep * moe_dp) == 0`	Preserve the existing MoE-DP divisibility constraint.
`moe_tp_size = target_devices // (ep * moe_dp)`	MoE-TP is still derived from the existing global formula.

2.3 Rank Group Semantics

ParallelGroupManager.initialize_model_parallel() reshapes global ranks with the following dimensions:

[-1, data_parallel_size, pipeline_parallel_size, expert_parallel_size, tensor_parallel_size]

Group expansion semantics:

Group	Dimensional Meaning	Behavior
`tp_group`	Tensor parallel ranks within the same stage	Passes `pipeline_parallel_size`, so TP groups are generated locally inside the PP dimension.
`dp_group`	Data parallel ranks across complete pipeline replicas	Passes `pipeline_parallel_size`, so DP groups are separated from the PP dimension.
`pp_group`	Cross-stage ranks at the same TP/DP coordinate	Adds `ParallelGroupType.PIPELINE_PARALLEL` and generates rank groups along the PP dimension.
`ep_group` / `moe_tp_group` / `moe_dp_group`	Existing MoE groups	Temporarily pass `pipeline_parallel_size=1`, preserving global MoE semantics without introducing stage-local MoE combinations.

Example: with world_size=8, tp_size=2, pp_size=4, dp_size=1, ranks are organized by [DP, PP, TP]. For rank=4, pp_group.rank_group == [0, 2, 4, 6] and rank_in_group == 2. This means rank 4 is the second pipeline stage on the same TP lane.

2.4 Pipeline Stage Modeling

2.4.1 `PipelineStageInfo`

tensor_cast/pipeline_parallel.py defines an immutable dataclass:

@dataclasses.dataclass(frozen=True)
class PipelineStageInfo:
    stage_id: int
    num_stages: int
    start_layer: int
    end_layer: int
    total_layers: int

Derived properties:

Property or Method	Meaning
`enabled`	True when `num_stages > 1`.
`is_first_stage`	Whether the current stage is the first stage.
`is_last_stage`	Whether the current stage is the last stage.
`active_layer_indices`	The `[start_layer, end_layer)` range.
`owns_layer(layer_idx)`	Whether a layer belongs to the current stage.

2.4.2 Uniform Layer Partitioning

The first version partitions by decoder layer count:

layers_per_stage = ceil(total_layers / num_stages)
start_layer = min(stage_id * layers_per_stage, total_layers)
end_layer = min((stage_id + 1) * layers_per_stage, total_layers)

Characteristics:

Earlier stages receive the ceil layer count first.
The last stage is clamped with min() to avoid out-of-range indices.
When num_stages > total_layers, later stages may own an empty layer range. Actual configurations should be evaluated through candidate filtering or tests.
Partitioning is only by decoder layer. Embedding, norm, and lm_head are non-layer modules and are handled separately by edge-stage rules.

2.4.3 Stage-First Graph Construction

PP tracing should not run the full model first and then split the trace result. The recommended approach builds a stage-local graph before tracing:

Component	Responsibility
`PipelineStagePlan`	Stores all `PipelineStageInfo` objects, per-stage module ownership, stage boundaries, and send/recv edges.
`build_pipeline_stage_module(stage_info)`	Builds a stage-local module/view from the full model, so the current stage contains only modules it should execute.
`PipelineStageModule`	Exposes the stage-local forward entry point. The first stage receives `input_ids`; later stages receive `inputs_embeds`.
`build_pipeline_stage_inputs()`	Builds runtime trace inputs for each stage; non-first stages receive correctly shaped hidden states.
`restore`/context management	If temporary module views or wrappers are used, the original model state must be restored through context management.

With this strategy, torch.compile and runtime trace see the stage-local graph. Cross-stage fused ops are broken by the stage module boundary and logical send/recv, which is closer to real PP deployment.

2.4.4 Non-Layer Module Ownership

The PP stage-local graph must avoid executing edge-only modules in middle stages:

Module Category	stage 0	Middle Stage	Last Stage
embedding / input ids	Kept	Not included; simulate upstream hidden states with `inputs_embeds`	Not included; simulate upstream hidden states with `inputs_embeds`
decoder layers	Keep only current stage layers	Keep only current stage layers	Keep only current stage layers
norm	Not included	Not included	Kept
lm_head	Not included	Not included	Kept

build_pipeline_stage_inputs() builds empty inputs_embeds for non-first stages:

inputs_embeds.shape = (*token_shape, hidden_size)

token_shape is derived from input_ids or position_ids. This allows middle and last stages to trace from the hidden states entry point without embedding.

2.5 TransformerModel Integration

TransformerModel.__init__() calls the following after ParallelGroupManager initialization:

self.pipeline_stage_plan = build_pipeline_stage_plan(
    self.text_config.num_hidden_layers,
    self.parallel_group_manager.pp_group,
)

New properties and methods:

Interface	Behavior
`pipeline_stage_plan`	Describes every stage's layer range, module ownership, and communication boundaries.
`build_pipeline_stage_modules()`	Builds all stage-local modules or stage-local module views.
`pipeline_stage_weight_size`	Weight memory estimate for the current stage.
`pipeline_stage_weight_sizes`	Weight memory estimates for all stages.
`pipeline_max_stage_weight_size`	Maximum weight memory among all stages; `ModelRunner` uses it as the model weight memory in PP mode.
`get_language_layers()`	Locates the decoder layer list through `custom_model_registry.get_language_layers(model_type)`.

Weight memory estimation rules:

Model or Stage	Estimation
`pp_size=1`	Return full-model weights.
VL or MTP model	Fall back to full-model weights and log a warning.
Unable to locate language layers	Fall back to full-model weights and log a warning.
First stage	`embedding + active_layer_size`.
Middle stage	`active_layer_size`.
Last stage	`active_layer_size + norm + lm_head`.

If the framework cannot precisely split non-layer weights yet, it can temporarily use a conservative estimate:

edge_stage_weight = active_layer_size + non_layer_size

and report pipeline_max_stage_weight_size externally to avoid underestimating peak per-card weight memory.

2.6 KV Cache and Indexer Cache Estimation

tensor_cast/core/input_generator.py connects cache estimation to the stage plan:

Function	PP Behavior
`_get_kv_cache_info()`	`kv_cache_by_layers` may keep metadata for all layers, but `kv_cache_per_token` only accumulates active layers for the current stage.
`get_kv_cache_info()`	Same behavior for the generic KV cache path.
`get_dsa_indexer_cache_info()`	DSA indexer cache per token only accumulates active layers for the current stage.

When PP is enabled, ModelRunner.run() traverses all stages in PipelineStagePlan and reports the maximum KV cache size among stages:

kv_cache_size_gb = max(stage_kv_cache_bytes) / 1024^3

The reported value is the largest single-stage KV cache requirement rather than the global sum across all model layers.

2.7 Pipeline Communication and Latency Model

2.7.1 Hidden States Message Size

The first version models logical send/recv of hidden states between adjacent stages:

message_bytes = num_tokens * hidden_size * dtype_size

Implementation entry:

estimate_hidden_states_message_bytes(num_tokens, hidden_size, dtype)

Inputs require num_tokens >= 0 and hidden_size >= 0.

2.7.2 Send/Recv Communication Events

build_pipeline_send_recv_events() generates logical communication edges between adjacent stages:

stage_i --send(hidden_states)--> stage_i+1
stage_i+1 <--recv(hidden_states)-- stage_i

estimate_pipeline_send_recv_time() uses CommAnalyticModel(device_profile) and queries bandwidth and latency through PP group ranks:

one_way_time_s = latency + message_bytes / bandwidth

Communication terms by stage:

Stage	incoming	outgoing
First stage	0	one-way time
Middle stage	one-way time	one-way time
Last stage	one-way time	0
`pp_size=1` or `message_bytes=0`	0	0

The first version may represent send/recv as pseudo ops in runtime trace or pass them through a side channel as PipelineSendRecvEvent objects. The key constraint is that communication events are placed at stage boundaries, not attributed vaguely after full-model tracing.

2.7.3 Stage Compute Time

Stage compute time comes from each stage-local graph's independent trace and performance-model estimate:

stage_compute_time[i] = performance_model(stage_i_runtime_trace)

It no longer depends on distributing full-model analytic time. If a stage trace cannot be generated, a short-term fallback is allowed:

stage_compute_time[i] =
    estimated_full_model_time * stage_layer_count[i] / assigned_layers

but the result must be marked as an approximation so users do not mistake it for completed stage-first tracing.

2.7.4 Pipeline Latency Formula

Each stage time is:

stage_time[i] = stage_compute_time[i] + stage_comm_time[i]

Overall pipeline latency is:

latency_s = sum(stage_times) + (num_microbatches - 1) * max(stage_times)

num_microbatches can be derived from len(input_kwargs["attention_meta"].query_lens) and is at least 1. The formula is a fill-drain approximation: the first microbatch passes through all stages, and later microbatches progress at the pace of the slowest stage.

PipelineRuntimeEstimate stores:

Field	Meaning
`latency_s`	Overall latency after PP adjustment.
`stage_compute_times_s`	Compute time for each stage.
`stage_comm_times_s`	Incoming + outgoing communication time for each stage.
`stage_times_s`	Compute + communication time per stage.
`send_recv_events`	Logical inter-stage communication events.

2.8 ModelRunner Flow and Outputs

When PP is enabled, ModelRunner.run() performs:

Generate inputs, KV cache, and attention metadata.
Build PipelineStagePlan from pp_size, pp_group, and decoder layer count.
Build a stage-local model/graph for each stage.
Run forward, torch.compile, runtime trace, and performance-model estimation independently for each stage to get stage_compute_time_s.
Build logical send/recv events from input_ids.numel(), hidden_size, dtype, and PP rank groups, then estimate stage_comm_time_s.
Use estimate_pipeline_runtime_from_stage_times() to aggregate PP latency.
Use PipelineRuntimeEstimate.latency_s as the total execution time for the corresponding performance model.
Add {model_name}_pipeline_parallel to breakdowns:

{
    "compute": sum(estimate.stage_compute_times_s),
    "communication": sum(estimate.stage_comm_times_s),
    "bubble": max(0.0, estimate.latency_s - sum(estimate.stage_times_s)),
}

Then, serving_cast.service.utils.format_breakdowns() formats ordinary op-bound breakdowns and PP breakdowns separately:

Mem 25.00 | Comm 25.00 | Cube 50.00 | Vec 0.00 | PP Compute 50.00 | PP Comm 16.67 | PP Bubble 33.33

2.9 Compatibility and Constraints

Scenario	Constraint
`pp_size=1`	Does not build multi-stage graphs, does not insert send/recv, and degrades to the existing single-stage behavior.
analytic-only	Stage-first tracing primarily serves the analytic performance model; if no analytic model is present, it may fall back to an approximation and mark it.
profiling/empirical	Profiling + PP needs a stage-local profiling data contract; until then, profiling mode must not claim complete PP modeling.
VL/MTP	Stage-local graph construction and weight estimation fall back or are skipped to avoid incorrectly pruning non-standard model structures.
MoE	EP/MoE groups are not included in the PP stage dimension for now, preserving existing global MoE semantics.
Cross-layer fusion	`torch.compile` and runtime trace must run on stage-local graphs to avoid cross-PP-boundary fusion.
Output fields	The `parallel` label can continue to display `TP=... \| PP=... \| DP=...`; the new PP breakdown does not change existing columns such as `ttft`, `tpot`, and `token/s`.

3. Alternatives

3.1 Only Extend Configuration and Search

This option adds only --pp-sizes, serving pp_size, and pp_group, while still estimating the model with the full layer count and full KV cache.

Advantages:

Minimal change scope.
Lower risk to existing TP/EP paths.

Disadvantages:

Search results overestimate PP per-card memory and ignore or underestimate inter-stage communication and bubbles.
Differences between pp_size > 1 and pp_size=1 mainly come from DP derivation rather than PP itself.
Users can easily assume complete PP performance modeling already exists.

This RFC does not choose this option.

3.2 Full-Model Forward/Trace First, Then Split PP Results

This option first runs TensorCast forward/runtime trace on the unsplit full model, obtains full-model analytic time, then attributes time to stages through stage-local replacement or layer-count ratios.

Advantages:

Smaller change scope.
Reuses existing full-model trace and fallback logic.
Simple compatibility path for pp_size=1.

Disadvantages:

Full-model trace may include cross-layer fused ops across future PP boundaries, which does not match real PP stage boundaries.
Shared overhead requires after-the-fact attribution and is sensitive to model structure, wrappers, and compile graphs.
Send/recv can only be appended as a formula and does not naturally appear at stage boundaries in trace.
It does not match deployment semantics such as vLLM-Ascend, where stages are split before compile/trace.

This RFC keeps this option only as a short-term fallback, not the recommended main path.

3.3 Real Pipeline Runtime Simulation

This option explicitly constructs send/recv ops, microbatch events, stage queues, and overlap in Runtime, and outputs Chrome traces.

Advantages:

Higher precision and explainability.
Future support for 1F1B, interleaved PP, virtual stages, and communication-computation overlap becomes natural.

Disadvantages:

Requires changes to Runtime scheduling, input generation, trace schema, and performance model interfaces.
Has broad impact on both analytic and profiling performance models.
Implementation cost is too high for the first stage of PP search support.

This RFC uses stage-first tracing plus a fill-drain latency approximation as an intermediate architecture before real pipeline runtime simulation.

3.4 Model-Declared `_pp_plan`

This option reads _pp_plan or base_model_pp_plan from HuggingFace or model configuration and partitions stages according to model declarations.

Advantages:

Closer to real deployment strategies.
Can express embedding, norm, lm_head, special blocks, and non-uniform layer assignments.

Disadvantages:

Declaration formats and completeness differ across models.
Requires maintaining mappings from model structure to TensorCast wrappers.
Not required for first-version PP search and memory trend validation.

This RFC uses uniform decoder layer partitioning and leaves _pp_plan for later extensions.

4. Implementation Plan and Future Evolution

4.1 Implementation Items

Item	Status	Implementation Scope	Acceptance Gate
CLI PP search entry	To do	Add `--pp-sizes`, candidate validation, and trailing model id normalization.	Tests cover explicit PP sizes and invalid combination exits.
serving PP configuration	To do	Add `serving_cast.config.ParallelConfig.pp_size` and pass it into TensorCast `UserInputConfig`.	Tests cover YAML parsing and `ModelRunner` argument propagation.
PP rank group	To do	Add `ParallelGroupType.PIPELINE_PARALLEL` and `pp_group`.	Tests cover PP rank group dimensions.
Stage plan and stage-local graph	To do	Add `PipelineStagePlan`, uniform partitioning, stage-local modules, and edge-stage module ownership.	Tests cover even/uneven partitioning and first/middle/last stage graph behavior.
Stage-aware memory estimation	To do	Add `pipeline_max_stage_weight_size`, active-layer KV cache per token, and maximum stage KV cache.	Tests cover weight and KV cache active-layer behavior.
Stage-first runtime trace	To do	Run per-stage forward, compile, trace, and performance-model estimation.	Tests cover no cross-stage fusion, stage trace fallback, and runtime estimates.
Logical send/recv modeling	To do	Add hidden-states message bytes, stage-boundary send/recv pseudo events, and communication latency.	Tests cover communication boundaries, first/middle/last communication direction, and topology bandwidth selection.
PP latency estimation	To do	Add stage compute + comm, fill-drain latency, and bubble calculation.	Tests cover latency formula and microbatch boundaries.
serving breakdown display	To do	Separate op-bound breakdowns from PP breakdowns in `format_breakdowns()`.	Tests cover `PP Compute`, `PP Comm`, and `PP Bubble` formatting.

4.2 Test Plan

Minimum test gate:

python -m pytest tests/test_throughput_optimizer.py -q
python -m pytest tests/test_tensor_cast/test_pipeline_parallel.py -q
python -m pytest serving_cast/tests/ut/test_config.py -q
python -m pytest serving_cast/tests/ut/test_service/test_common.py -q
python -m pytest serving_cast/tests/ut/test_service/test_base_optimizer.py -q
python -m pytest serving_cast/tests/ut/test_service/test_parallel_runnner.py -q
python -m pytest serving_cast/tests/ut/test_tensor_cast_model_runner.py -q

Required coverage:

Area	Required Coverage
CLI arguments	Explicit `--pp-sizes`, defaults, invalid combinations, and trailing model id.
Search space	`tp * pp` divisibility filtering, `dp_size = num_devices // (tp * pp)`.
Configuration propagation	service YAML to serving `ParallelConfig`, then to TensorCast `UserInputConfig`.
Rank groups	PP dimension rank groups, `rank_in_group`, and relationship with TP/DP dimensions.
Stage partitioning	Even/uneven layer split, empty-stage boundary, and active layer indices.
Stage-local graph	First stage keeps embedding, middle stages use `inputs_embeds`, and last stage keeps norm/lm_head.
Compile/trace boundary	`torch.compile` and runtime trace execute on stage-local graphs and do not fuse across PP boundaries.
Cache estimation	KV cache and DSA indexer cache only accumulate active stage layers.
Communication modeling	Message bytes, send/recv pseudo events, first/middle/last communication direction, topology bandwidth, and latency.
Latency model	Stage compute, stage comm, fill-drain latency, microbatch count, and bubble.
Output format	PP breakdown is displayed separately from the original op-bound breakdown.

4.3 Future Evolution

Evolution Item	Trigger	Scope	Exit Criteria
Configurable stage partition	Uniform partitioning does not match target model deployment.	Support manual stage layer ranges or `_pp_plan`.	RFC updated and non-uniform partition tests added.
VL/MTP/multimodal PP	Target models need PP search.	Define ownership for vision tower, MTP head, language layers, and output layers.	No full-model weight fallback; stage traces run stably.
PP + MoE stage-local groups	MoE models need simultaneous PP and EP search.	Redefine stage-local EP/MoE-TP/MoE-DP groups and global group relationships.	Rank groups, cache, dispatch/combine communication all have test coverage.
Real send/recv kernels	Real distributed communication or end-to-end pipeline execution is needed.	Evolve logical send/recv pseudo events into real Runtime ops.	Real communication kernels appear in trace and align with device execution.
Strict microbatch scheduling	Fill-drain approximation error is not acceptable.	Support event-level 1F1B, interleaved PP, and bubble/overlap simulation.	Aligned with real scheduling or a reference simulator, with error reports.
Profiling/empirical PP	Profiling database needs to support PP.	Define stage-local profiling data, communication CSV, and empirical stage aggregation contract.	PP output in profiling mode is explainable and has coverage metrics.

4.4 Runtime Constraints

Users must ensure world_size == tp_size * pp_size * dp_size; invalid combinations should surface during configuration validation or candidate filtering.
PP estimation primarily targets decoder-only LLMs. Non-standard model structures require explicit validation.
PP latency is a simulation estimate and does not mean TensorCast has performed real distributed pipeline execution.
torch.compile, runtime trace, and performance-model estimation must use stage-local graphs instead of post-processing a full-model trace.
In the first version, send/recv are logical communication events; real communication kernels and overlap modeling are future work.
pipeline_max_stage_weight_size and maximum stage KV cache are conservative per-rank peak metrics and are not equal to global total memory.
PP Bubble comes from the fill-drain formula. When the microbatch count is 1, bubble is 0.

RFC: Pipeline Parallel Simulation Support

Metadata

1. Problem Statement

1.1 Goals

1.2 Non-Goals

2. Design

2.1 Recommended Approach

2.1.1 Class Diagram

2.1.2 Sequence Diagram

2.2 Configuration and Search Space

2.2.1 CLI Argument

2.2.2 serving YAML

2.2.3 Search Space Generation

2.3 Rank Group Semantics

2.4 Pipeline Stage Modeling

2.4.1 PipelineStageInfo

2.4.2 Uniform Layer Partitioning

2.4.3 Stage-First Graph Construction

2.4.4 Non-Layer Module Ownership

2.5 TransformerModel Integration

2.6 KV Cache and Indexer Cache Estimation

2.7 Pipeline Communication and Latency Model

2.7.1 Hidden States Message Size

2.7.2 Send/Recv Communication Events

2.7.3 Stage Compute Time

2.7.4 Pipeline Latency Formula

2.8 ModelRunner Flow and Outputs

2.9 Compatibility and Constraints

3. Alternatives

3.1 Only Extend Configuration and Search

3.2 Full-Model Forward/Trace First, Then Split PP Results

3.3 Real Pipeline Runtime Simulation

3.4 Model-Declared _pp_plan

4. Implementation Plan and Future Evolution

4.1 Implementation Items

4.2 Test Plan

4.3 Future Evolution

4.4 Runtime Constraints

2.4.1 `PipelineStageInfo`

3.4 Model-Declared `_pp_plan`