RFC: Current Mixed-Batch Variable-Token Modeling in Throughput Optimizer
Metadata
| Item | Content |
|---|---|
| Status | Approved |
| Author | stormchasingg |
| Updated Date | 2026-05-15 |
| Related Links |
1. Summary
This document records the current implementation of variable-token throughput optimization in the codebase.
Compared with the original RFC version, the current implementation still supports mixed-batch modeling for variable-length prefill workloads, but the code structure has been significantly simplified and renamed:
- the CLI uses a boolean
--length-distributionswitch instead of a user-provided path - distribution parsing and workload construction live in
OptimizerData - mixed-batch execution is implemented through
_get_batched_forward_info() - final reporting no longer relies on a dedicated summary subclass
- batched detail-row expansion is handled directly inside
OptimizerSummary
The current implementation only applies to:
cli.inference.throughput_optimizer- disaggregation mode
- prefill-only runs
TTFTconstrained searches
It does not apply to:
- PD ratio optimization mode
- Monte Carlo sampling
- request arrival distribution modeling
2. Current Scope and Entry Conditions
2.1 CLI behavior
The CLI now exposes:
--input-length--length-distribution
with the rule:
- exactly one of them must be provided
This validation is enforced in cli/inference/throughput_optimizer.py.
2.2 Built-in distribution mode
--length-distribution is a boolean switch, not a file-path argument.
When enabled, the CLI loads the built-in distribution file:
serving_cast/example/length_distribution.yaml
This mode is currently restricted to:
--disagg--ttft-limits- no
--tpot-limits - no PD ratio optimization
If these conditions are not satisfied, the CLI rejects the run.
3. Data Model
3.1 Distribution types
serving_cast/service/utils.py defines:
LengthBinLengthDistribution
Each bin contains:
min_tokensmax_tokensweight
Validation rules:
min_tokens >= 0max_tokens > min_tokensweight > 0- adjacent bins must not overlap
Weights are not required to sum to 1.
The implementation normalizes them internally when building representative rows.
3.2 OptimizerData fields
OptimizerData currently contains both fixed-length and distribution-mode fields:
input_lengthlength_distributionoutput_lengthbatch_sizettft_limitstpot_limitsprefix_cache_hit_rate- other serving and search parameters
Distribution mode is identified by:
optimizer_data.length_distribution is not None
4. Variable-Token Workload Construction
4.1 Representative rows
OptimizerData.get_representative_rows() converts each length bin into a representative row.
The current implementation uses the bin midpoint by default and returns rows with:
num_input_tokensquery_lenrequest_ratio
Semantics:
num_input_tokensis the representative original input-token countquery_lenis the effective prefill length after applying prefix-cache reductionrequest_ratiois the normalized bin weight
4.2 Effective input length
OptimizerData.get_effective_input_length() behaves differently by mode:
- fixed-length mode:
- returns the scalar effective input length after prefix-cache reduction
- distribution mode:
- returns the weighted average of representative
query_len
- returns the weighted average of representative
OptimizerData.get_max_effective_input_length() is distribution-specific and is used by the CLI for:
max_prefill_tokensvalidation
It computes the maximum effective prefill length from the configured bins.
4.3 Integer sample allocation
OptimizerData.build_concurrency_samples(concurrency) expands the distribution into a concrete mixed batch.
The implementation:
- computes ideal sample counts from
concurrency * request_ratio - takes
floor(...)as the base allocation - assigns the remaining requests using the largest-remainder method
Returned rows contain:
num_input_tokensquery_lenrequest_ratiosamples
This produces a deterministic mixed-batch composition for a given concurrency.
5. Execution Path
5.1 Fixed-length path
BaseThroughputOptimizer._get_forward_info() is still the standard path for:
- fixed-length prefill
- decode
It constructs a single RequestInfo template and runs:
generate_inputs
5.2 Mixed-batch path
BaseThroughputOptimizer._get_batched_forward_info() is the current mixed-batch path.
It:
- calls
optimizer_data.build_concurrency_samples(concurrency) - expands those rows into a real heterogeneous
List[RequestInfo] - repeats each row according to
samples - runs inference with:
generate_inputs_varlen
Request fields are aligned with RequestInfo semantics:
num_input_tokensfor original input-token countquery_lenfor actual prefill computation length
6. Disaggregation Integration
DisaggThroughputOptimizer.get_inference_info() now supports both modes:
- fixed-length
- variable-token mixed-batch
The branch condition is:
variable_input_mode = optimizer_data.length_distribution is not None
6.1 Mixed-batch prefill
Under variable-token prefill:
_get_batched_forward_info()is usedlatency_msis computed from model execution time plusserving_cost- throughput is computed from the true batch token count:
total_input_tokens = Σ(num_input_tokens * samples)
token/s = total_input_tokens / ttft * 1000
This replaces the old scalar formula based on one input_length.
6.2 Summary rows
The resulting DataFrame contains:
- one aggregate row
- multiple composition detail rows
The aggregate row uses:
num_input_tokens = "all"request_ratio = 1.0samples = concurrency
Detail rows reuse the same configuration fields but clear performance columns such as:
ttfttpottoken/stoken/s/devicepercentage_breakdowns
7. Final Report and Table Rendering
7.1 Summary class structure
The current implementation does not use a dedicated summary subclass for mixed-batch mode.
Instead, OptimizerSummary itself handles both:
- regular fixed-length final output
- mixed-batch final output
7.2 Best-row selection
OptimizerSummary._prepare_agg_disagg_results() still performs the base filtering and ranking:
- filter by
ttftandtpotlimits - sort by
token/s - keep the best row for each
parallel
This selection happens on the aggregate rows.
7.3 Composition-row expansion
If args.length_distribution is not None, OptimizerSummary._get_agg_disagg_final_out() dispatches to:
_get_agg_disagg_final_out_batched()
That path:
- selects the best aggregate rows
- calls
_expand_composition_rows() - appends the matching detail rows from
self._summary_df
The matching keys are:
parallelbatch_sizeconcurrencynum_devices
Ordering rules:
- aggregate row first
- detail rows after
- detail rows sorted by
num_input_tokens
7.4 Batched final table
The mixed-batch final table is rendered by:
_get_disagg_table_buf_batched()
This table is currently prefill-only and shows:
Topnum_devicesnum_input_tokensrequest_ratiosamplesconcurrencyTTFT (ms)Throughput (token/s)parallelbatch_size
input_length and output_length are intentionally not shown in the batched final table because the composition rows are centered on:
- original representative token count
- request ratio
- allocated sample count
Performance columns on detail rows are rendered as -.
8. Module Interaction Diagram
CLI Argument Parsing (throughput_optimizer.py)
│
├─ Exactly one of --input-length / --length-distribution
│
├─ --length-distribution enabled?
│ ├─ No
│ │ └─ Use scalar input_length path
│ │
│ └─ Yes
│ ├─ load_length_distribution()
│ ├─ Build OptimizerData(length_distribution=...)
│ ├─ Validate:
│ │ ├─ disagg only
│ │ ├─ prefill only (--ttft-limits set)
│ │ └─ no --tpot-limits / no PD ratio optimization
│ └─ Use distribution-aware prefill path
│
└─ ParallelRunner(args)
│
└─ run_disagg()
│
├─ For each TP/parallel candidate
│ └─ _get_df_list()
│ └─ DisaggThroughputOptimizer.run()
│ │
│ ├─ Binary-search batch size
│ └─ For each candidate batch
│ └─ get_inference_info()
│ │
│ ├─ length_distribution is None?
│ │ ├─ Yes → _get_forward_info()
│ │ └─ No → _get_batched_forward_info()
│ │ │
│ │ ├─ build_concurrency_samples(concurrency)
│ │ ├─ Expand rows into heterogeneous RequestInfo list
│ │ └─ run_inference(generate_inputs_varlen)
│ │
│ ├─ Compute TTFT / throughput
│ └─ Build:
│ ├─ one aggregate row
│ └─ multiple composition detail rows
│
└─ OptimizerSummary.report_final_result(args)
│
├─ length_distribution is None?
│ ├─ Yes → _get_agg_disagg_final_out()
│ │ └─ _get_disagg_table_buf()
│ │
│ └─ No → _get_agg_disagg_final_out_batched()
│ │
│ ├─ _prepare_agg_disagg_results()
│ ├─ _expand_composition_rows()
│ └─ _get_disagg_table_buf_batched()
│
└─ Print overall best configuration + final table
9. Ongoing Work and Limitations
The following directions are already identified and are still in progress:
- variable-token mixed-batch modeling for aggregation mode
- variable-token mixed-batch modeling for decode-only scenarios
Beyond that, current limitations include:
- only the built-in YAML file is supported from the CLI
- distribution mode only works for disaggregation prefill with
TTFTlimits - PD ratio optimization does not support variable-token mixed-batch modeling
- best-row selection still happens on aggregate rows before detail-row expansion
10. Notes for Future Changes
If the implementation evolves again, the following areas are most sensitive and should be updated together:
- CLI contract for
--length-distribution OptimizerDatanaming and workload-construction helpersBaseThroughputOptimizermixed-batch execution entryDisaggThroughputOptimizersummary row schemaOptimizerSummarybatched final-report formatting
In particular, any future reintroduction of:
- custom distribution file paths
- summary subclasses
- decode-mode batched reporting
- aggregation-mode variable-token support
should be documented as a separate follow-up RFC update.