ascend-robot【同步】【非开发代码】代码从 develop 同步到 master

RFC: Current Mixed-Batch Variable-Token Modeling in Throughput Optimizer

Metadata

Item	Content
Status	Approved
Author	stormchasingg
Updated Date	2026-05-15
Related Links

1. Summary

This document records the current implementation of variable-token throughput optimization in the codebase.

Compared with the original RFC version, the current implementation still supports mixed-batch modeling for variable-length prefill workloads, but the code structure has been significantly simplified and renamed:

the CLI uses a boolean --length-distribution switch instead of a user-provided path
distribution parsing and workload construction live in OptimizerData
mixed-batch execution is implemented through _get_batched_forward_info()
final reporting no longer relies on a dedicated summary subclass
batched detail-row expansion is handled directly inside OptimizerSummary

The current implementation only applies to:

cli.inference.throughput_optimizer
disaggregation mode
prefill-only runs
TTFT constrained searches

It does not apply to:

PD ratio optimization mode
Monte Carlo sampling
request arrival distribution modeling

2. Current Scope and Entry Conditions

2.1 CLI behavior

The CLI now exposes:

--input-length
--length-distribution

with the rule:

exactly one of them must be provided

This validation is enforced in cli/inference/throughput_optimizer.py.

2.2 Built-in distribution mode

--length-distribution is a boolean switch, not a file-path argument.

When enabled, the CLI loads the built-in distribution file:

serving_cast/example/length_distribution.yaml

This mode is currently restricted to:

--disagg
--ttft-limits
no --tpot-limits
no PD ratio optimization

If these conditions are not satisfied, the CLI rejects the run.

3. Data Model

3.1 Distribution types

serving_cast/service/utils.py defines:

LengthBin
LengthDistribution

Each bin contains:

min_tokens
max_tokens
weight

Validation rules:

min_tokens >= 0
max_tokens > min_tokens
weight > 0
adjacent bins must not overlap

Weights are not required to sum to 1. The implementation normalizes them internally when building representative rows.

3.2 OptimizerData fields

OptimizerData currently contains both fixed-length and distribution-mode fields:

input_length
length_distribution
output_length
batch_size
ttft_limits
tpot_limits
prefix_cache_hit_rate
other serving and search parameters

Distribution mode is identified by:

optimizer_data.length_distribution is not None

4. Variable-Token Workload Construction

4.1 Representative rows

OptimizerData.get_representative_rows() converts each length bin into a representative row.

The current implementation uses the bin midpoint by default and returns rows with:

num_input_tokens
query_len
request_ratio

Semantics:

num_input_tokens is the representative original input-token count
query_len is the effective prefill length after applying prefix-cache reduction
request_ratio is the normalized bin weight

4.2 Effective input length

OptimizerData.get_effective_input_length() behaves differently by mode:

fixed-length mode:
- returns the scalar effective input length after prefix-cache reduction
distribution mode:
- returns the weighted average of representative query_len

OptimizerData.get_max_effective_input_length() is distribution-specific and is used by the CLI for:

max_prefill_tokens validation

It computes the maximum effective prefill length from the configured bins.

4.3 Integer sample allocation

OptimizerData.build_concurrency_samples(concurrency) expands the distribution into a concrete mixed batch.

The implementation:

computes ideal sample counts from concurrency * request_ratio
takes floor(...) as the base allocation
assigns the remaining requests using the largest-remainder method

Returned rows contain:

num_input_tokens
query_len
request_ratio
samples

This produces a deterministic mixed-batch composition for a given concurrency.

5. Execution Path

5.1 Fixed-length path

BaseThroughputOptimizer._get_forward_info() is still the standard path for:

fixed-length prefill
decode

It constructs a single RequestInfo template and runs:

generate_inputs

5.2 Mixed-batch path

BaseThroughputOptimizer._get_batched_forward_info() is the current mixed-batch path.

It:

calls optimizer_data.build_concurrency_samples(concurrency)
expands those rows into a real heterogeneous List[RequestInfo]
repeats each row according to samples
runs inference with:
- generate_inputs_varlen

Request fields are aligned with RequestInfo semantics:

num_input_tokens for original input-token count
query_len for actual prefill computation length

6. Disaggregation Integration

DisaggThroughputOptimizer.get_inference_info() now supports both modes:

fixed-length
variable-token mixed-batch

The branch condition is:

variable_input_mode = optimizer_data.length_distribution is not None

6.1 Mixed-batch prefill

Under variable-token prefill:

_get_batched_forward_info() is used
latency_ms is computed from model execution time plus serving_cost
throughput is computed from the true batch token count:

total_input_tokens = Σ(num_input_tokens * samples)
token/s = total_input_tokens / ttft * 1000

This replaces the old scalar formula based on one input_length.

6.2 Summary rows

The resulting DataFrame contains:

one aggregate row
multiple composition detail rows

The aggregate row uses:

num_input_tokens = "all"
request_ratio = 1.0
samples = concurrency

Detail rows reuse the same configuration fields but clear performance columns such as:

ttft
tpot
token/s
token/s/device
percentage_breakdowns

7. Final Report and Table Rendering

7.1 Summary class structure

The current implementation does not use a dedicated summary subclass for mixed-batch mode.

Instead, OptimizerSummary itself handles both:

regular fixed-length final output
mixed-batch final output

7.2 Best-row selection

OptimizerSummary._prepare_agg_disagg_results() still performs the base filtering and ranking:

filter by ttft and tpot limits
sort by token/s
keep the best row for each parallel

This selection happens on the aggregate rows.

7.3 Composition-row expansion

If args.length_distribution is not None, OptimizerSummary._get_agg_disagg_final_out() dispatches to:

_get_agg_disagg_final_out_batched()

That path:

selects the best aggregate rows
calls _expand_composition_rows()
appends the matching detail rows from self._summary_df

The matching keys are:

parallel
batch_size
concurrency
num_devices

Ordering rules:

aggregate row first
detail rows after
detail rows sorted by num_input_tokens

7.4 Batched final table

The mixed-batch final table is rendered by:

_get_disagg_table_buf_batched()

This table is currently prefill-only and shows:

Top
num_devices
num_input_tokens
request_ratio
samples
concurrency
TTFT (ms)
Throughput (token/s)
parallel
batch_size

input_length and output_length are intentionally not shown in the batched final table because the composition rows are centered on:

original representative token count
request ratio
allocated sample count

Performance columns on detail rows are rendered as -.

8. Module Interaction Diagram

CLI Argument Parsing (throughput_optimizer.py)
    │
    ├─ Exactly one of --input-length / --length-distribution
    │
    ├─ --length-distribution enabled?
    │   ├─ No
    │   │   └─ Use scalar input_length path
    │   │
    │   └─ Yes
    │       ├─ load_length_distribution()
    │       ├─ Build OptimizerData(length_distribution=...)
    │       ├─ Validate:
    │       │   ├─ disagg only
    │       │   ├─ prefill only (--ttft-limits set)
    │       │   └─ no --tpot-limits / no PD ratio optimization
    │       └─ Use distribution-aware prefill path
    │
    └─ ParallelRunner(args)
        │
        └─ run_disagg()
            │
            ├─ For each TP/parallel candidate
            │   └─ _get_df_list()
            │       └─ DisaggThroughputOptimizer.run()
            │           │
            │           ├─ Binary-search batch size
            │           └─ For each candidate batch
            │               └─ get_inference_info()
            │                   │
            │                   ├─ length_distribution is None?
            │                   │   ├─ Yes → _get_forward_info()
            │                   │   └─ No  → _get_batched_forward_info()
            │                   │            │
            │                   │            ├─ build_concurrency_samples(concurrency)
            │                   │            ├─ Expand rows into heterogeneous RequestInfo list
            │                   │            └─ run_inference(generate_inputs_varlen)
            │                   │
            │                   ├─ Compute TTFT / throughput
            │                   └─ Build:
            │                       ├─ one aggregate row
            │                       └─ multiple composition detail rows
            │
            └─ OptimizerSummary.report_final_result(args)
                │
                ├─ length_distribution is None?
                │   ├─ Yes → _get_agg_disagg_final_out()
                │   │         └─ _get_disagg_table_buf()
                │   │
                │   └─ No  → _get_agg_disagg_final_out_batched()
                │             │
                │             ├─ _prepare_agg_disagg_results()
                │             ├─ _expand_composition_rows()
                │             └─ _get_disagg_table_buf_batched()
                │
                └─ Print overall best configuration + final table

9. Ongoing Work and Limitations

The following directions are already identified and are still in progress:

variable-token mixed-batch modeling for aggregation mode
variable-token mixed-batch modeling for decode-only scenarios

Beyond that, current limitations include:

only the built-in YAML file is supported from the CLI
distribution mode only works for disaggregation prefill with TTFT limits
PD ratio optimization does not support variable-token mixed-batch modeling
best-row selection still happens on aggregate rows before detail-row expansion

10. Notes for Future Changes

If the implementation evolves again, the following areas are most sensitive and should be updated together:

CLI contract for --length-distribution
OptimizerData naming and workload-construction helpers
BaseThroughputOptimizer mixed-batch execution entry
DisaggThroughputOptimizer summary row schema
OptimizerSummary batched final-report formatting

In particular, any future reintroduction of:

custom distribution file paths
summary subclasses
decode-mode batched reporting
aggregation-mode variable-token support

should be documented as a separate follow-up RFC update.